What are some good tools for interactive web scraping and automation?

What are the fastest open source scraping tools?

  • Ones that can parse millions of web pages in a reasonable amount of time. Some of the desired features would be multi-threaded HTTP/1.1 persistent connections, distributed capabilities, ease of use, XPath or similar.

  • Answer:

    It's difficult to say what's 'fastest', since it really depends on how you use the crawler. For example, if you use scrapy (http://scrapy.org/) you can choose how (if at all) you want to parse web pages.  The most common method to extract data in scrapy is with xpath, relying on the underlying libxml2 library. This parsing and extraction is usually the bottleneck and any solution that does use something as convenient as xpath typically comes with similar (or worse) overheads. Some crawlers that are worth a mention when talking about performance: Apache Nutch http://nutch.apache.org/ - distributed and has support for parsing and extraction. It also has an extension mechanism via. plugins. This one sounds like it meets a lot of your requirements. Heretrix https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - the Internet Archive's web-scale crawler. for a more complete list of tools see I'm disappointed that the other answers to this question are marketing for closed source solutions, despite the question.

Shane Evans at Quora Visit the source

Was this solution helpful to you?

Other answers

Fastest scraping tools can be developed under JAVA environment by implementing proxy management (Multiple proxies) with multi threaded architecture.Scraping is an art to first analyse the internal structure of any website and then develop the crawler to maximize the scraping speed and overcoming all the obstacles imposed in the website.As per my view, there is no open source scraping tools available in the market. There are few free providers (For EXP http://www.scrapingexpert.com/) which ask user to provide the website list and automatically http://www.scrapingexpert.com/ without asking user to map the fields. Although, code base of such free providers is also not available.TANSTAAFL : Allowing user to fetch the data from some structured website at free of cost but on the other way getting paid heavily based on the heavy traffic on their website. Scraped Content provided by such websites are poor in quality.

Heena Desai

To start with, let’s explain the meaning of scraping. Web scraping is a computer software technique of extracting information from websites. As it can seem at the first sight, free open-source is a great solution for your business, but it isn’t. Open-source tools usually offer bad quality services.It is hard to scrape data by yourself and to my mind, it is better to hire specialist for this hard task. Web Scraping tools like the Easy Data Feed are the easiest methods to extract data from any website for development your online business. It is an exclusive desktop, a shopping cart migration tools. Easy Data Feed is a perfect solution for those who are planning to move their current store to a better shopping cart solution or want to move to the latest version of your current cart.Moreover, the tool can be used for quickly downloading the inventory, pricing, and product information from a competitor’s website into a usable spreadsheet.By the way, EDF is powered by Shopping.CartElite! I was satisfied when I found Shopping.CartElite. They are a powerful and easily scalable turn-key solution for e-commerce businesses. You will not need heavy development or integrations with them and when you’re ready, you can quickly scale.Disclosure: I wrote this post, and I have used the company.

Bryan Dersam

http://easydatafeed.com is the fastest Open Source Scraping tool. It is : Free Open Source Well Working Always kept updated In Easy Data Feed Software, we can : Run Daily Schedules Set Scrape Limits Visit as Anonymous Set Thread and Speed Password Protected Scrape Solve Decaptcha Use Proxies That's why I think it is the fastest and Best Web Scraping/ Data Scraping tool. You can read about how to use it here: http://www.easydatafeed.com/open-source/ They also have developers you can hire to do the job for you, their skype is “easydatafeed”

David Johnson

Scrapy (http://scrapy.org/) seems a safe bet. Other options are jSpider (http://j-spider.sourceforge.net) jARVEST(http://sing.ei.uvigo.es/jarvest/index.html) and Web Harvest (http://web-harvest.sourceforge.net/), sure there are many more.

Bala Mahadevan

There are several web scraping tools available in market. One of the best web scraping tool is developed by  netUcon Company.This company experts in providing services like 1. .net development projects(Microsoft .NET Framework 1.1/2.0/3.0/3.5/4.0/4.5)2. Ecommerce Integration(Amazon Integration ,Ebay Integration, Shopify Integration, Volusion Integration)3. Web Data Scraping(Yelp, Just dial, Carid ,LinkedIn, Amazon,Government websites, Social networking sites scraping and so on…)4. Quick Book Integration5. Accounting Software Integration 6. Custom Website Development7. ERP Development8. Data entry 9. Data mining10. Lead Generation on “LinkedIn, Twitter and Face book”.11. BPO:- Data Processing12. Digital Marketing and so on…you can use thisIt developed this LinkedIn connection Creator this LCC is useful for scraping CEOs: Seeking connections to different CEOs, Creating B2B contacts, Creating B2C contacts, Lead Generators, Digital Marketers, Bloggers who post their blogs on LinkedIn and so on..For more details have a lookhttps://www.dropbox.com/s/zhcgcpojwz6wvfh/LinkedIn%20Connection%20Creator%28LCC%29.docx?dl=0You can read more about Netucon here: http://www.netucon.com/They also have developers you can hire to do the job for you; their Skype is “netrocks7”

Ati Jain

You can try VietSpider - http://vietspider.org/webextractor/ - The tool is advanced but easy to use interface. - Implements Web 3.0 Crawling, http://en.wikipedia.org/wiki/Website_Parse_Template/ concept. - The Web Crawler can use Proxy and Multi-threaded is configurable per website. - VietSpider includes built-in web browser, supports JavaScripts. - Many built in plugins such plugin to synchronize data  with Joomla, Drupal, WordPress, VBulletin,... Supports exporting data to  MS Excel, CSV, XML,... Data Cleansing and Transformation.

Thuan Nhu Dinh

I like to use Python's Requests and multiprocessing libraries.  I have been able to achieve faster speeds with this combo than with scrapy ( with the bottleneck being network speed ) the you can use lxml or beautful soup to parse using xpath.  Beautiful Soup will eat up a lot more processing power than lxml but it is more lenient on syntax.  Use Sessions in request for increased speed.  Also if you can use regex to just parse a few data points that is usually faster than using lxml.

Ryan Daniel Hovey

The http://www.30Digits.com Web Extractor has a distributed architecture, multi-threading spidering, and multi-threaded extraction processes. It is also very precise about the parts of the site that are spidered. It avoids rendering of pages and other steps that slow a spider down. XPath can be used when needed or simply RegEx when that is sufficient for even quicker parsing. Is it Open Source? Sorry no, this is a commercial product. If you would like to know more though, feel free to contact me.

Justin Gilbreath

I work for http://screen-scraper.com and our software is amongst the most efficient web data aggregation software available.  This is because screen-scraper only targets the assets that it needs.  In other words, it does not request every aspect of a Webpage (i.e. images, javascript, css, etc.), only the assets containing the data that it needs. Written in 100% Java, fully multi-threaded, and can scale to handle virtually unlimited number of Websites.  screen-scraper can be centrally controlled and distributed across any number of instances across any number of servers. New instances can be created programmatically, on-demand based on load.  Each instance can draw from a central data source and can store captured data back to a database, flat text file, XML, SOAP or virtually any other data type. Check it out and let us know if you have any questions. http://screen-scraper.com

Scott Wilson

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.