Solucija - What are the best resources to learn about web crawling and scraping?

What are the best resources to learn about web crawling and scraping?

Answer:

I found this articles and Q&A useful: 's Introduction to Compassionate Screen Scraping: http://dev.lethain.com/an-introduction-to-compassionate-screenscraping/ by 's answer to Etiquette of Screen-scraping Stack Overflow: http://meta.stackoverflow.com/questions/443/etiquette-of-screen-scraping-stack-overflow Options for HTML scraping: http://stackoverflow.com/questions/2861/options-for-html-scraping 's blog: http://petewarden.typepad.com/searchbrowser/2010/04/how-i-got-sued-by-facebook.html 's blog: http://paulbutler.org/archives/groupon-math-data-scraping-to-estimate-revenue/ Web scraping legalities: http://stackoverflow.com/questions/396778/legalities-of-screen-scraping http://www.emanueleminotto.it/how-to-write-a-crawler Also see these HN threads: http://news.ycombinator.com/item?id=158902 http://news.ycombinator.com/item?id=2450841 http://news.ycombinator.com/item?id=150077 http://news.ycombinator.com/item?id=96057

Was this solution helpful to you?

Other answers

The best way IMHO to learn web crawling and scraping is to download and run an open-source crawler such as Nutch or Heritrix. They are pretty simple to use and very shortly you will have some crawled data to play with. For scraping, the best thing to do is to write a simple web agent, which is a simple program that fetches source HTML of web pages and processes it. Most modern scripting programming languages e.g. PHP/Python/Perl include built-in primitives and libraries for getting source HTML in a single line of code. The easiest way to process obtained HTML is to use regular expression string matching facilities of these languages which are very easy to use and quite comprehensive. Regular expressions are only the first step, there are many other ways to do more detailed and faster processing.

Borislav Agapiev

Web scraping using http://asp.net (http://www.amazon.com/Data-Scraping-Information-Web-ASP-NET/dp/B00095M8P6) Web scraping using php(http://www.phparch.com/books/phparchitects-guide-to-web-scraping-with-php/) mining the web(http://www.amazon.com/Mining-Web-Discovering-Knowledge-Hypertext/dp/1558607544/ref=sr_1_3?ie=UTF8&s=books&qid=1288062780&sr=1-3) Introduction to information retrieval (http://nlp.stanford.edu/IR-book/information-retrieval-book.html) For general purpose I recommend last two books.

Vineet Yadav

Most important tools in scraping is html parser(neko html, Mozilla parser) and query language(xpath, regexp, GATE, Tregexp). On this basis, can be constructed complex rule-based systems with/without machine learning. Many web test frameworks are useful for building scraping tools like HTMLUnit, Selenium

Yura Koroliov

The best way to learn web crawling is to learn python scrapy framework. It is very simple to use and for crawling heavy ajax/java-script sites we can use phantom js along with scrapy. Scrapy crawling is faster than any other platforms, since it uses asynchronous operations (on top of Twisted). Scrapy has better and faster support for parsing (x)html on top of libxml2. Scrapy is a mature framework with full unicode, redirection handling, gzipped responses, odd encoding, integrated http cache etc. I suggest reading the scrapy documentation which is the best way to learn it in a programmers perspective. Here is the link http://scrapy.org/doc/ . For an introduction and to grab some basic ideas read http://quadloops.com/scrapy-python-web-scrapping-framework/.

Tony Paul

I think it depends on how you want to do web scraping and web crawling.You can learn to master a programming language or, to master some web scrapers.You may get some ideas from these articles:http://www.octoparse.com/blog/scraping-websites-what-for/http://www.octoparse.com/blog/extract-text-from-html-document/â€œ1. Programming languageFor those simple HTML documents, people who have basic coding knowledge can choose to write a program to remove all HTML tags and retain only the text inside HTML files, using regular expression or XPath. There are several widely used programming languages such as C#, Java, Python, JS, PHP, Go and NodeJs that are available for computer users. You can pick a suitable one to start your project. Some of these languages have their own parser for HTML that are available and free online and you will know more about these HTML parsers by click here https://en.wikipedia.org/wiki/Comparison_of_HTML_parsers.It is worth mentioning that the code you write can only be used for one type of web page, that means different types of web pages needs to write different code. Besides, you need to test your code after you have written your program, and it takes longer time for who have no experience to write code and test the code.2. Web data extraction toolsThere are many powerful web extraction tools such as http://import.io, mozenda, Octoparse that are available for computer users to harvest almost everything on the web page, including the text, links, images, etc. You can convert what you get into structured data format.You donâ€™ t need to write any code, so itâ€™s especially good for those who have no coding experience. In most cases, you donâ€™t need to write regular expression or XPath. The visualization enable users to better interact with the web page. Itâ€™s easy to check and export the data without any IDE.â€I found some useful web scraping tools that may help you better fetch what you need. :)http://www.octoparse.com/?qu http://Dataextractionservices.com http://Habiledata.com http://Computyne.com http://datahut.co/http://Datoin.com http://grepsr.com http://priceonomics.com http://promptcloud.com http://scrapehero.com http://scrapinghub.com http://thewebminer.com http://vnpglobal.com http://webscraping.com http://webrobots.io http://80legs.com http://apifier.com http://cloudscrape.com http://datafiniti.net http://DataScraping.co http://diffbot.com http://fminer.com http://GooSeeker.com http://Import.io http://moreover.com http://mozenda.com http://parsehub.com http://scrape.it http://spinn3r.com http://thepriceminer.com http://uipath.com http://webcrawling.org http://webrobots.io

Daisy Hung

There are hundreds of scraping tools available in the market, but as I continuously mention one tool that is very good compare to other scraping tools. Because it is very easy to understand, it keeps updating with the new version, gives expected output and lot of user-friendly features. It is called as â€œEasy Data Feedâ€ and it is available on Easy Data Feed - Web Data Extraction Scraping Software These are some features of this tool: You can do data manipulate. Multiple profiles. You can add custom values. You can convert Measurements. You can read about how to use it here: http://www.easydatafeed.com/open-source/ They also have developers, you can hire them to do the job for you, and their Skype is â€œeasydatafeedâ€

Nor Rieh

It depends on what you are wanting to learn and why? if you just want an ability to functionally scrape sites then there isn't much need to learn these days with http://import.io companies like Importâ€¢io and outwithub. http://support.import.io has some good articles on how to use there tool to get data from websites. If you want to learn how to program scraper wiki is good, and i've heard python is THE language to use. The is no shortage of tutorials out there, on navigating xpath and why NOT to use regex on HTML: http://blog.codinghorror.com/parsing-html-the-cthulhu-way/

Daniel Cave

Manning, Raghaven, and SchÃ¼tze's Introduction to Information Retrieval has a couple interesting chapters on basic web search, including crawling. You can find a copy of the book online at http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Anonymous

Web scraping can be pretty tricky, so it's actually really helpful to use web scraping tools to assist you as well as teach you more about it! http://www.kimonolabs.com is a great free web scraping tool that uses CSS selectors to save the data structure of properties you wish to scrape and then uses them to automatically extract the data for you. It also lets you modify your results with a customizable JS function so you can do adjustments beyond HTML tag delinations. Though Kimono is very user friendly, it's also super powerful and people have used it for complex projects like this awesome interactive map of no fly zones: http://builtwith.kimonolabs.com/post/101375093149/mapping-no-fly-zones-this-awesome-interactive-map P.S. I totally work here, but it really is an awesome way to get more familiar with web scraping and has taught me so much about data structuring on the web!

Laura Nguyen

Related Q & A:

What are the best resources to learn Java?Best solution by Stack Overflow
What is the best way to learn English?Best solution by Yahoo! Answers
What is the best way to learn how to build websites and web applications with Python?Best solution by Quora
What is the best way to learn jazz piano?Best solution by Yahoo! Answers
What is the best way to learn French?Best solution by Yahoo! Answers

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.