What are the fastest open source scraping tools?
-
Ones that can parse millions of web pages in a reasonable amount of time. Some of the desired features would be multi-threaded HTTP/1.1 persistent connections, distributed capabilities, ease of use, XPath or similar.
-
Answer:
It's difficult to say what's 'fastest', since it really depends on how you use the crawler. For example, if you use scrapy (http://scrapy.org/) you can choose how (if at all) you want to parse web pages. The most common method to extract data in scrapy is with xpath, relying on the underlying libxml2 library. This parsing and extraction is usually the bottleneck and any solution that does use something as convenient as xpath typically comes with similar (or worse) overheads. Some crawlers that are worth a mention when talking about performance: Apache Nutch http://nutch.apache.org/ - distributed and has support for parsing and extraction. It also has an extension mechanism via. plugins. This one sounds like it meets a lot of your requirements. Heretrix https://webarchive.jira.com/wiki/display/Heritrix/Heritrix - the Internet Archive's web-scale crawler. for a more complete list of tools see I'm disappointed that the other answers to this question are marketing for closed source solutions, despite the question.
Shane Evans at Quora Visit the source
Other answers
Fastest scraping tools can be developed under JAVA environment by implementing proxy management (Multiple proxies) with multi threaded architecture.Scraping is an art to first analyse the internal structure of any website and then develop the crawler to maximize the scraping speed and overcoming all the obstacles imposed in the website.As per my view, there is no open source scraping tools available in the market. There are few free providers (For EXP http://www.scrapingexpert.com/) which ask user to provide the website list and automatically http://www.scrapingexpert.com/ without asking user to map the fields. Although, code base of such free providers is also not available.TANSTAAFL : Allowing user to fetch the data from some structured website at free of cost but on the other way getting paid heavily based on the heavy traffic on their website. Scraped Content provided by such websites are poor in quality.
Heena Desai
To start with, letâs explain the meaning of scraping. Web scraping is a computer software technique of extracting information from websites. As it can seem at the first sight, free open-source is a great solution for your business, but it isnât. Open-source tools usually offer bad quality services.It is hard to scrape data by yourself and to my mind, it is better to hire specialist for this hard task. Web Scraping tools like the Easy Data Feed are the easiest methods to extract data from any website for development your online business. It is an exclusive desktop, a shopping cart migration tools. Easy Data Feed is a perfect solution for those who are planning to move their current store to a better shopping cart solution or want to move to the latest version of your current cart.Moreover, the tool can be used for quickly downloading the inventory, pricing, and product information from a competitorâs website into a usable spreadsheet.By the way, EDF is powered by Shopping.CartElite! I was satisfied when I found Shopping.CartElite. They are a powerful and easily scalable turn-key solution for e-commerce businesses. You will not need heavy development or integrations with them and when youâre ready, you can quickly scale.Disclosure: I wrote this post, and I have used the company.
Bryan Dersam
http://easydatafeed.com is the fastest Open Source Scraping tool. It is : Free Open Source Well Working Always kept updated In Easy Data Feed Software, we can : Run Daily Schedules Set Scrape Limits Visit as Anonymous Set Thread and Speed Password Protected Scrape Solve Decaptcha Use Proxies That's why I think it is the fastest and Best Web Scraping/ Data Scraping tool. You can read about how to use it here: http://www.easydatafeed.com/open-source/ They also have developers you can hire to do the job for you, their skype is âeasydatafeedâ
David Johnson
Scrapy (http://scrapy.org/) seems a safe bet. Other options are jSpider (http://j-spider.sourceforge.net) jARVEST(http://sing.ei.uvigo.es/jarvest/index.html) and Web Harvest (http://web-harvest.sourceforge.net/), sure there are many more.
Bala Mahadevan
There are several web scraping tools available in market. One of the best web scraping tool is developed by netUcon Company.This company experts in providing services like 1. .net development projects(Microsoft .NET Framework 1.1/2.0/3.0/3.5/4.0/4.5)2. Ecommerce Integration(Amazon Integration ,Ebay Integration, Shopify Integration, Volusion Integration)3. Web Data Scraping(Yelp, Just dial, Carid ,LinkedIn, Amazon,Government websites, Social networking sites scraping and so onâ¦)4. Quick Book Integration5. Accounting Software Integration 6. Custom Website Development7. ERP Development8. Data entry 9. Data mining10. Lead Generation on âLinkedIn, Twitter and Face bookâ.11. BPO:- Data Processing12. Digital Marketing and so onâ¦you can use thisIt developed this LinkedIn connection Creator this LCC is useful for scraping CEOs: Seeking connections to different CEOs, Creating B2B contacts, Creating B2C contacts, Lead Generators, Digital Marketers, Bloggers who post their blogs on LinkedIn and so on..For more details have a lookhttps://www.dropbox.com/s/zhcgcpojwz6wvfh/LinkedIn%20Connection%20Creator%28LCC%29.docx?dl=0You can read more about Netucon here: http://www.netucon.com/They also have developers you can hire to do the job for you; their Skype is ânetrocks7â
Ati Jain
You can try VietSpider - http://vietspider.org/webextractor/ - The tool is advanced but easy to use interface. - Implements Web 3.0 Crawling, http://en.wikipedia.org/wiki/Website_Parse_Template/ concept. - The Web Crawler can use Proxy and Multi-threaded is configurable per website. - VietSpider includes built-in web browser, supports JavaScripts. - Many built in plugins such plugin to synchronize data with Joomla, Drupal, WordPress, VBulletin,... Supports exporting data to MS Excel, CSV, XML,... Data Cleansing and Transformation.
Thuan Nhu Dinh
I like to use Python's Requests and multiprocessing libraries. I have been able to achieve faster speeds with this combo than with scrapy ( with the bottleneck being network speed ) the you can use lxml or beautful soup to parse using xpath. Beautiful Soup will eat up a lot more processing power than lxml but it is more lenient on syntax. Use Sessions in request for increased speed. Also if you can use regex to just parse a few data points that is usually faster than using lxml.
Ryan Daniel Hovey
The http://www.30Digits.com Web Extractor has a distributed architecture, multi-threading spidering, and multi-threaded extraction processes. It is also very precise about the parts of the site that are spidered. It avoids rendering of pages and other steps that slow a spider down. XPath can be used when needed or simply RegEx when that is sufficient for even quicker parsing. Is it Open Source? Sorry no, this is a commercial product. If you would like to know more though, feel free to contact me.
Justin Gilbreath
I work for http://screen-scraper.com and our software is amongst the most efficient web data aggregation software available. This is because screen-scraper only targets the assets that it needs. In other words, it does not request every aspect of a Webpage (i.e. images, javascript, css, etc.), only the assets containing the data that it needs. Written in 100% Java, fully multi-threaded, and can scale to handle virtually unlimited number of Websites. screen-scraper can be centrally controlled and distributed across any number of instances across any number of servers. New instances can be created programmatically, on-demand based on load. Each instance can draw from a central data source and can store captured data back to a database, flat text file, XML, SOAP or virtually any other data type. Check it out and let us know if you have any questions. http://screen-scraper.com
Scott Wilson
Related Q & A:
- Is there any open-source antivirus for Android?Best solution by avira.com
- Are there any open-source check-in / check-out systems?Best solution by stackoverflow.com
- Are there any open source projects for building websites like elance?Best solution by Stack Overflow
- How to Implement Gateway Service something similar to Oracle API gateway Using Java and Java based Open Source frameworks only?Best solution by Quora
- What are the best free Internet marketing tools?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.