How to make a web crawler?

What are all the concepts I need to understand in order to build a web crawler?

  • I want to make a web crawler just for fun, but I'd like to know what I need to understand before starting. Are there particular data structures or technologies I need to know?

  • Answer:

    Theres is a really nice course on http://www.udacity.com (Intro to CS) on building a web crawler in python. It's pretty good to start off with and to learn the basics.

Anonymous at Quora Visit the source

Was this solution helpful to you?

Other answers

A web crawler is basically very easy: crawl(page): html = GET page links = get_links(html) //do something with html foreach link in links: crawl(link) (also, if you were a good person you would respect robots.txt) So the important things to understand are HTTP and HTML. Enough to be able to issue get requests (easy), parse the HTML (very hard; use a library), and do whatever you want to for each web page (??).

Jonathan Paulson

I'm somewhat surprised that no one here has mentioned duplicate detection - once you crawl a url,  you don't want to crawl it again anytime soon. Past the basic link-extracting / crawling logic, it depends exactly what you want to do.  Some pages require cookies or render content with javascript, so interpreting their content may not be as simple as just searching through their html for href's and other interesting properties.

David Goldstein

The last crawler I wrote was in node.js because of it's asynchronicity - I was able to launch plenty of threads at once. It was also easy to embed a backend with some basic stats. Plus I wanted to dive into a "new" technology. Node.js did not disappoint, and was extremely easy to build a crawler in, although JS can get a bit messy!

Jack Penny

Web crawling has become very easy with the availability of open source tools such as jsoup, tagsoup, etc. Jsoup even also allows you to select elements based on CSS selectors. This is a big bonus if you have some web development knowledge already. Document doc = Jsoup.connect(url).get(); Elements links = doc.select("a[href]"); With the above code, you now have All the links in a page. Thats how powerful Jsoup is. For building a web crawler, you need to know basic HTML in terms of how a page is structured. Most of the web crawlers use XPath or CSS Selector to extract elements. Understanding them would be a big bonus. There are tons of commercial and open source options available. Looking at them before you build yours would give you a good understanding. As the above answers mentions, respect robots.txt , else your ip would be blacklisted.

Arjun Raman

Creating a web crawler is simple. The following simple recursive algorithm is used in most of the webcrawlers.             1) crawl the given URL and store the URL link and HTML content in DB         2)find all <a> tags in the current page and fetch the href's value from it         3)for each <a> tag's href check                         a)whether the address contains the root address of the website as part of it(accomplish this using grep)                        b)whether the URL is going to be crawled  for the first time(accomplish this by running a simple query in DB)                        c)if a) and b) are true then go to step 1

Vishnu Jayavel

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.