How to make a web crawler?

What are some really interesting web crawling projects?

  • I want to make some small project using web crawler to get used myself with python. Any idea of things that I should do? It could be a console or website application. Please share your experience/details if you ever done this before :)

  • Answer:

    The Udacity CS101 course is the best resource for a beginner and it's in Python :) http://www.udacity.com/overview/Course/cs101/CourseRev/apr2012 If you just want to build a crawler skip to the related videos.

Dhiraj Thakur at Quora Visit the source

Was this solution helpful to you?

Other answers

A nice little project that I've done in the past is a simple 'sites similar to X' recommendation engine. You crawl sites, strip out any html tags, and build word lists for the content on each one, then you just need some metric for comparing how similar they are to one another (I used the http://en.wikipedia.org/wiki/Jaccard_index to calculate the similarity of the word lists for each site) The user can then just enter a url and get a list of the most similar sites (could be a console app, or I did mine with a web front end) It's a heap of fun to tweak, and you end up with something pretty usable at the end of it.

Tom Robert

I've got an ongoing partially unresolved inquiry into converting sitemaps into mindmaps [1], which in ideal conditions would involve crawling a website, parsing the directory structure into a nested outline, and then inserting the crawled webpage titles into the outline.  A cloud based (web) API would be the ideal form for this.  I can further recommend my Meta Guide webpage, "100 Best Web Crawler Videos" [2]. [1] http://www.meta-guide.com/home/ai-engine/mindmap-conversion [2] http://www.meta-guide.com/home/about/best-of-the-best-videos/100-best-web-crawler-videos

Marcus L Endicott

http://Import.io is by far the best crawling software. Unfortunatly it Will be no more free starting next month (over priced, Too expensive).The best thing to do is learning Python And use beautifulsoup and/or all the frameworks using this library. It has no limit, with some research you can crawl any website even if it's known to be "uncrawlable" (google results, linkedin, ...)

Hamadi Lanouar

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.