How can one build a search engine for some specific search?
-
I want to buld a serch engine like http://www.indiabookstore.net/ for specific search on some collection of website. Fro where should i start, what tools or courses should i take like IR?
-
Answer:
Here's what you need to doStep 1 : Find the domain and set of website (or type of websites) you need to crawl and index. Step 2 : Write a crawler to get the data from these websites. Recursively check all the pages within each category and subcategory like traversing a tree. Step 3 : Parse their content from their webpages and index the data based on Book Titles, Description and Reviews. (totally depends on you) Step 4 : Create a retrieval mechanism. Define proper ranking measures so that you'll show the results that a customer expects. Step 5 : Create a front-end (of course, you know that pretty well) Optional : To stay up-to-date, repeat from Step 2 periodically. Additional : Make sure your systems are capable of handling big amount of data and retrieval requests. Here's how you're going to do theseStep 1 is one of the most crucial parts. Well emulating IndiaBookStore would be simple as they index a fixed set of websites. Remember the amount and quality of stuff you retrieve depends on the 'seed'. Step 2 http://stackoverflow.com/questions/102631/how-to-write-a-crawler http://www.makeuseof.com/tag/build-basic-web-crawler-pull-information-website/ If you don't know much about IR you can have an overview from a Udacity course that gives very nice intro. This course introduces you to crawler as a part of programming projects. https://www.udacity.com/course/cs101 Step 3 : Begin with a SAX parser or a third party library to retrieve the body of the pages. You will need lots and lots of regex. Next part is the most important component, definitely needed in a Search facility : INDEXING. You need to build an inverted index which maps a word to name/ID of a book or webpage. These links will introduce you to indexing and parsing : http://stackoverflow.com/questions/1786689/parsing-web-pages Step 4 : All you have to do is retrieve relevant data from the the index you already created in Part 3. To make retrieval faster, make sure to create a hierarchy of secondary indexes (like hashing). Step 5 : You know it pretty well ;) If you are not a champ with that, you'd like to hire a professional web designer who can improve the aesthetics and usability while you work on the tech part. Prof Chris Manning's book : http://nlp.stanford.edu/IR-book/ will teach you all you need to work with this project. And yes, use JAVA for this project. Java is cool! = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = p.s. If you have some experience with software projects and have time to learn something new, try out a tool like Lucene, Solr and Nutch. (All are Apache Projects)
Aditya Joshi at Quora Visit the source
Other answers
Both sites most likely use concepts outlined in http://isbn.net.in/ and http://swaroopch.com/2010/12/31/isbnnetin-open-source/. Basically, they perform remote searches on a specific set of websites and retrieve price and other details based on DOM structure. The also crawl and index information from these sites beforehand. Crawling and Parsing: Pretty much as has described. Nutch might be an overkill if you are doing only 10-15 sites. Some options are given in . Search: Either https://lucene.apache.org/solr/ or http://www.elasticsearch.org I personally find Elasticsearch easier to set up and use. If you have decided on Java, I would recommend the http://www.grails.org framework. With its out of the box scaffolding and plugin for Elasticsearch, you can potentially get the front end up and running in a day.
Haribabu Thilakar
There is an open source web crawler called Nutch. It has facilities to parse the crawled data and from those data build a search engine using Solr: https://nutch.apache.org/ https://lucene.apache.org/solr/
Dan Brown
A great option to build is with Elasticsearch! Elasticsearch is a distributed, real-time, search and analytics platform which will help you to build a modern search engine. I would say that if you have any Python experience, this quick tutorial will help you build a great search engine: http://blog.tryolabs.com/2015/02/17/python-elasticsearch-first-steps/ Hope it is helpful!
Martin Fagioli
Related Q & A:
- How do I put a search engine onto my web site?Best solution by thesitewizard.com
- How can we build a chair?Best solution by Yahoo! Answers
- How can one start a non-profit organization?Best solution by Yahoo! Answers
- How can I add a website to google search engine?Best solution by safehouseweb.com
- How can one obtain a G.E.D?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.