WildCARD search using lucene in a large file containing 100 millions Strings taking too much time. i want the result in 1-2 seconds. any help?
-
I have a file size 1.43 gb. the file contains 100 millions strings ( 3 - 80 characters length) separated line by line in the file. i am doing WILDCARD search on the file using lucene. right now i am creating one document for each string. i want total count of the search keyword (*searchkeyword*).here is my code lucene.demo.java public class LuceneDemo { //a path to directory where Lucene will store index files private static String indexDirectory = "C:\\indextofile"; // a path to directory which contains data files that need to be indexed private static String dataDirectory = "C:\\indexofilef"; public static int count = 0; private Searcher indexSearcher; public static void main(String[] args) throws FileNotFoundException, IOException { LuceneDemo luceneDemo = new LuceneDemo(); //create Lucene index luceneDemo.createLuceneIndex(); //create IndexSearcher luceneDemo.createIndexSearcher(); luceneDemo.termQueryExample(); } private void createLuceneIndex(){ Indexer indexer = new Indexer(indexDirectory,dataDirectory); //Create IndexWriter System.out.println("testing-4"); indexer.createIndexWriter(); try { //Index data indexer.indexData(); } catch (FileNotFoundException e) { throw new RuntimeException(e); } catch (IOException e) { throw new RuntimeException(e); } } private void createIndexSearcher() throws CorruptIndexException, IOException{ /* Create instance of IndexSearcher */ indexSearcher = new IndexSearcher(indexDirectory); } private void termQueryExample() throws CorruptIndexException, IOException{ try { Directory directory = FSDirectory.getDirectory(indexDirectory); //IndexSearcher indexSearcher = new IndexSearcher(directory); BooleanQuery.setMaxClauseCount(102400000); Term term = new Term("reversecontent", "bubble*com"); Query query = new WildcardQuery(term); Hits hits = indexSearcher.search(query); System.out.println("######## Hits :"+hits.length()); } catch (Exception e) { e.printStackTrace(); } } } Indexer.java public class Indexer { private IndexWriter indexWriter; /*Location of directory where index files are stored */ private String indexDirectory ; /*Location of data directory */ private String dataDirectory ; public String FIELD_CONTENTS = "contents"; public Indexer(String indexDirectory, String dataDirectory){ this.indexDirectory = indexDirectory ; this.dataDirectory = dataDirectory ; } /** * This method creates an instance of IndexWriter which is used * to add Documents and write indexes on the disc. */ void createIndexWriter(){ if(indexWriter == null){ try{ //Create instance of Directory where index files will be stored Directory fsDirectory = FSDirectory.getDirectory(indexDirectory); /* Create instance of analyzer, which will be used to tokenize the input data */ Analyzer standardAnalyzer = new KeywordAnalyzer(); //Create a new index boolean create = true; //Create the instance of deletion policy IndexDeletionPolicy deletionPolicy = new KeepOnlyLastCommitDeletionPolicy(); indexWriter = new IndexWriter(fsDirectory,standardAnalyzer,create, deletionPolicy,IndexWriter.MaxFieldLength.UNLIMITED); }catch(IOException ie){ System.out.println("Error in creating IndexWriter"); throw new RuntimeException(ie); } } } void indexData() throws FileNotFoundException, IOException{ File[] files = getFilesToBeIndxed(); for(File file:files){ FileReader fr = new FileReader(file); // To store the contents read via File Reader BufferedReader br = new BufferedReader(fr); // Read br and store a line in 'data', print data String data; System.out.println("start"); while((data = br.readLine()) != null) { String newdata = data+".com"; Document doc = new Document(); //doc.add(new Field("content", newdata, // http://Store.NO, Index.NOT_ANALYZED)); doc.add(new Field("reversecontent", new StringBuffer(newdata).reverse().toString(), http://Store.NO, Index.NOT_ANALYZED)); indexWriter.addDocument(doc); } System.out.println("end"); // Add these fields to a Lucene Document //Step 3: Add this document to Lucene Index. } /* Requests an "optimize" operation on an index, priming the index for the fastest available search */ indexWriter.optimize(); System.out.println("optimization done"); /* * Commits all changes to the index and closes all associated files. */ indexWriter.close(); } private File[] getFilesToBeIndxed(){ File dataDir = new File(dataDirectory); if(!dataDir.exists()){ throw new RuntimeException(dataDirectory+" does not exist"); } File[] files = dataDir.listFiles(); return files; } }
-
Answer:
a. I believe the problem is your using a KeywordAnalyzer. This means each of your 100 million strings will get a different index term, unless some of them are identical. Try switching to a StandardAnalyzer, say. This will allow the index to be much more efficient. b. Try testing this on a small scale (say 10000 strings), to see that you are getting proper results. c. I believe the Lucene users mailing list should give you better responses than Quora for a specific technical Lucene question. d. Other than that, read http://www.manning.com/hatcher3/. This is the best book I know about information retrieval. And it explains Lucene like no other resource.
Yuval Feinstein at Quora Visit the source
Related Q & A:
- Can I estimate the row count of a large mysql table using the disk space?Best solution by Database Administrators
- When i do a search and click on a result a get a page addressed rc10 overture what do i do?Best solution by Yahoo! Answers
- How much can I print out of one Lexmark #2 ink cartridge?Best solution by Yahoo! Answers
- How to search for a particular string in a text file using java?Best solution by Stack Overflow
- Should I upgrade to OS 3.1.2?Best solution by answers.yahoo.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.