What is the algorithm/data structure used by Lucene to compute the term frequency of documents?
-
It looks like use the Map structure to compute the term frequency, but I wanted to know the more detailed answer.
-
Answer:
Terms and their frequencies are denoted by Vectors stored in invertedIndex. A Term is the basic unit for searching which consists of a pair of string elements: <fieldname,text>. A term vector is a collection of terms.The inverted index maps terms to documents. For each term T , it should store the set of all documents containing that term. So it is the duty of analyzer to look for the terms in documents and create a token stream so that they can be mapped.Terms are stored in segments and they are sorted. The .frq file contains the ids of documents which contain each term, along with the frequency of the term in that document. Lucene stores the term data in inverted index format as described in the image below: TermDocs gives the TF of a given term in each document that contains the term. We can get the term documents from an IndexReader, using the term of interest. I hope following code will make it easy to understand. List<String> termlist = new ArrayList<String>(); IndexReader reader = IndexReader.open(indexFolder); TermEnum terms = reader.terms(); while (terms.next()) { Term term = terms.term(); String termText = term.text(); int frequency = reader.docFreq(term); termlist.add(termText); } reader.close();
Dhwaj Raj at Quora Visit the source
Other answers
Lucene uses inverted index(http://en.wikipedia.org/wiki/Inverted_index) to store term vectors. For more detail check out lucid imagination wiki(http://www.lucidimagination.com/blog/2009/03/18/exploring-lucenes-indexing-code-part-2/, http://www.lucidimagination.com/blog/2009/03/04/exploring-lucenes-indexing-code-part-1/) and Lucene documentation(http://lucene.apache.org/core/old_versioned_docs/versions/3_0_3/fileformats.html#Index%20File%20Formats)
Vineet Yadav
Related Q & A:
- What is the best data model and database systems to store social graph?Best solution by Quora
- What were some of the songs used in last nights indianapolis supercross?Best solution by Yahoo! Answers
- What is an electronic folder structure?Best solution by recordsmanagement.tab.com
- What is a good data entry work at home job?Best solution by ChaCha
- What is the electron dot structure?Best solution by ChaCha
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.