What web scraping tool is the best to extract data?

What are the best annotation-based scraping tools and techniques?

  • I want to extract structured data from news websites (title, body etc). Most of the tools, like and othe scrapers, depend on a programmer, whom needs to carefully inform each of the CSS or XPath selectors. Of couse, hardcoding is the way to go on a small scale. But if I have 500 websites to scrape, this becomes impractical. Is there any tool that records user annotation (and, possibly, user navigation too) that can be later used to scrape data? (Some sort of updated Solvent, the now dead Firefox add-on project, would be ideal). Any insights are greatly appreciated!

  • Answer:

    http://scrapinghub.com/portia/ is a visual scraping tool based on Scrapy! Scrapinghub offers a hosted version, but you can export your work if you like. Portia is completely open source. EDITED 2015-08-05: updated to mention Portia instead of older technology

Shane Evans at Quora Visit the source

Was this solution helpful to you?

Other answers

Option 1 :If you want to just extract Title, Body etc. then use http://www.datascraping.co/doc/29/built-in-fields options.  Using this feature you can extract 1000 or millions of website without any code writing or CSS, XPATH or REGEX selector. Just setup your scraping agent and choose builtin fields only then enter the list of URLs and enjoy scraping.Option 2 : Use Point and Click CSS selector https://chrome.google.com/webstore/detail/web-scraping-app/gpolcofcjjiooogejfbaamdgmgfehgff and create Jquery style scraping agent. E.g. Column1 to extract the body tag Column2 to extract the title tag and so on for others.. Once done with the setup save the agent and execute in Data Scraping Studio http://www.datascraping.co/download.aspx for batch url crawling. Since TITLE, BODY etc are comman tags and all the website has these HTML tags. In case it's not there you will see NULL as extracted. Option 3 : Check out if your targeted websites using any structured microdata  (http://schema.org/ ) if yes - Go ahead and create one http://www.datascraping.co/doc/questions/36/extracting-schema-org-microdataand pass through all the URLs.

Prince Rathee

If you want to extract structured data from news websites, I’m going to help you. My company dealt with scraping recently. I intended to hire a developer but read about ShoppingCartElit at one forum. They offer customized data entry tools. It’s a powerful solution with features that are exactly what you’re looking for. They are not the largest platform around but that’s why they have reasonably priced packages, and they serve businesses ranging from startups to Enterprise solutions.           If you work with marketplaces, you can sync your items to one place and scrape, еhe data from each account. This scraping works for both directions: when you are going to load data to your ShoppingCartElit back end or vice versa. When I chose second option I received exported CSV file. Furthermore, I contacted their support team, and you may do this either.             Disclosure: I am reviewing the company in my answer, and I have used it before.

Emily Smith

Scrape IT is a company that just delivers the information you need from websites on daily basis. Just tell them something like "I would like to have the data of those 500 websites in this format" and the will deal with the rest. They can also search the internet for you if you are looking for specific information without telling them on which websites to look. See http://www.scrape-it.nl

Gertjan van Montfoort

Web scraping can be difficult without a lot of programming knowledge, but there is this great, free service http://kimonolabs.com that allows you to scrape data based solely on point and click selections. Kimono will use the HTML tags of the page to recognize other matching data and make the rest of your selections for you. All you need to do is an initial set-up, clicking on an element of the property you are looking for and Kimono will save the CSS selector behind it. This allows you to turn your selections into an API that can get the data from any page with the same data structure. Although Kimono is very simple to use, it's also really powerful and can be utilized to do a lot more advanced stuff too. People have even combined it with other applications to do complex analysis, like this project with sentiment analysis: http://blog.kimonolabs.com/2014/12/17/guest-blog-sentiment-analysis-on-web-scraped-data-with-kimono-and-monkeylearn/ So yes, I do work here, but it really is an awesome product and it's absolutely free! Definitely worth a try.

Laura Nguyen

Your best bet for extracting from article-specific sites is to use a tool tailored to extracting articles. There are a couple of open-source libraries and some services that are dedicated to clean article extraction from any article/news page, and don't have to be individually trained on the sites from which you want to extract: Boilerpipe: https://code.google.com/p/boilerpipe/?utm_source=rss&utm_medium=rss&utm_campaign=boilerpipe-boilerplate-removal-and-fulltext-extraction-from-html-pages Goose: https://github.com/GravityLabs/goose uses computer vision to extract articles, products, images or other clean text from web pages. Our Article API returns the title, clean text or normalized HTML, relevant images, author, date, videos if embedded within the article, etc: http://www.diffbot.com/products/automatic/article offers an Extract API which extracts from article pages. offers a developer API for extracting articles from web pages. Pocket / ReadItLater used to offer an extraction API, but no longer do: http://getpocket.com/developer/docs/v3/article-view

John Davi

http://www.visualscraper.com/ is possible to do so. However, it's still in beta version and it has no xpath support yet. Hopefully, it support Xpath in the future

Ho Thanh Son

Try Outwit Hub. But you still need to know RegExp. Some examples are also provided inside.

Andrey Strelnikov

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.