What makes one search engine better than another?

How can one build a search engine for some specific search?

  • I want to buld a serch engine like http://www.indiabookstore.net/ for specific search on some collection of website. Fro where should i start, what tools or courses should i take like IR?

  • Answer:

    Here's what you need to doStep 1 : Find the domain and set of website (or type of websites) you need to crawl and index. Step 2 : Write a crawler to get the data from these websites. Recursively check all the pages within each category and subcategory like traversing a tree. Step 3 : Parse their content from their webpages and index the data based on Book Titles, Description and Reviews. (totally depends on you) Step 4 : Create a retrieval mechanism. Define proper ranking measures so that you'll show the results that a customer expects. Step 5 : Create a front-end (of course, you know that pretty well) Optional : To stay up-to-date, repeat from Step 2 periodically. Additional : Make sure your systems are capable of handling big amount of data and retrieval requests. Here's how you're going to do theseStep 1 is one of the most crucial parts. Well emulating IndiaBookStore would be simple as they index a fixed set of websites. Remember the amount and quality of stuff you retrieve depends on the 'seed'. Step 2 http://stackoverflow.com/questions/102631/how-to-write-a-crawler http://www.makeuseof.com/tag/build-basic-web-crawler-pull-information-website/ If you don't know much about IR you can have an overview from a Udacity course that gives very nice intro. This course introduces you to crawler as a part of programming projects. https://www.udacity.com/course/cs101 Step 3 : Begin with a SAX parser or a third party library to retrieve the body of the pages. You will need lots and lots of regex. Next part is the most important component, definitely needed in a Search facility : INDEXING. You need to build an inverted index which maps a word to name/ID of a book or webpage. These links will introduce you to indexing and parsing : http://stackoverflow.com/questions/1786689/parsing-web-pages Step 4 : All you have to do is retrieve relevant data from the the index you already created in Part 3. To make retrieval faster, make sure to create a hierarchy of secondary indexes (like hashing). Step 5 : You know it pretty well ;) If you are not a champ with that, you'd like to hire a professional web designer who can improve the aesthetics and usability while you work on the tech part. Prof Chris Manning's book : http://nlp.stanford.edu/IR-book/ will teach you all you need to work with this project. And yes, use JAVA for this project. Java is cool! = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = p.s. If you have some experience with software projects and have time to learn something new, try out a tool like Lucene, Solr and Nutch. (All are Apache Projects)

Aditya Joshi at Quora Visit the source

Was this solution helpful to you?

Other answers

Both sites most likely use concepts outlined in http://isbn.net.in/ and http://swaroopch.com/2010/12/31/isbnnetin-open-source/. Basically, they perform remote searches on a specific set of websites and retrieve price and other details based on DOM structure. The also crawl and index information from these sites beforehand. Crawling and Parsing: Pretty much as  has described. Nutch might be an overkill if you are doing only 10-15 sites. Some options are given in . Search: Either https://lucene.apache.org/solr/ or http://www.elasticsearch.org I personally find Elasticsearch easier to set up and use. If you have decided on Java, I would recommend the http://www.grails.org framework. With its out of the box scaffolding and plugin for Elasticsearch, you can potentially get the front end up and running in a day.

Haribabu Thilakar

There is an open source web crawler called Nutch.  It has facilities to parse the crawled data and from those data build a search engine using Solr: https://nutch.apache.org/ https://lucene.apache.org/solr/

Dan Brown

A great option to build is with Elasticsearch! Elasticsearch is a distributed, real-time, search and analytics platform which will help you to build a modern search engine. I would say that if you have any Python experience, this quick tutorial will help you build a great search engine: http://blog.tryolabs.com/2015/02/17/python-elasticsearch-first-steps/ Hope it  is helpful!

Martin Fagioli

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.