What web scraping tool is the best to extract data?

How can I automatically extract data from PDFs based on keywords?

  • I have a project that involves downloading and gathering selected data from a large number of PDFs stored at various web sites. Right now, I do this manually.  Is there any open source or low cost tool that would allow me to specify keywords and then automatically extract any text adjacent from those keywords from multiple PDFs?

  • Answer:

    There's an off-the-shelf product called Split Pro (http://www.debenu.com/products/server/debenu-pdf-split-pro/) that has the ability to help you with this task. Otherwise, you should be looking at using an SDK (such as PDFlib TET - http://www.pdflib.com/products/tet/) to extract the text and perform the analysis you require.

Chris Dahl at Quora Visit the source

Was this solution helpful to you?

Other answers

You can do it yourself with the help of the open source projects, pdfbox or xpdf, or let me do it for you with a little fee.

Steven Lee

Thanks for the three previous answers.  I am organizing a hackathon with the Sunlight Foundation to address the problem of PDF data extraction. Here is the announcement:  http://www.meetup.com/Open-Source-Finance/events/145743722/

Marc Joffe

If you do not require OCR, then there are several options available, including: PDFMiner (Python): https://github.com/euske/pdfminer/ PyPDF2 (Python): https://github.com/mstamy2/PyPDF2 pdftotext (Command Line): https://en.wikipedia.org/wiki/Pdftotext Alternatively, if you don't want to have to code the integrations and manipulations yourself, then http://taskpipes.com might be useful for you. TaskPipes allows you to parse PDFs and other file types and reformat the data as required through the online user interface. (Full disclosure: I am the co-founder of TaskPipes)

Fraser Atkins

You can use IntelliGet ( http://www.mountonetech.com/intelliget ) for this. You can specify markers in multiple ways: keyword(s) in the same line or on a different line, a combination of keywords, length of the line, line number, etc. You can create as complex a marker as you require. Once you have tested the extraction, you can run it on multiple files - it can extract from hundreds of files in a minute.

S Garg

There are a few email parser softwares out there that do it pretty well, provided there is not a need for Optical Character Recognition (OCR).I use http://mailparser.io, here is their blog article on parsing pdf data: https://mailparser.io/blog/convert-pdf-files-to-google-spreadsheet?utm_source=quora. This email parser has the most flexibility to get what you need, that I’ve seen.Also, Zapier has some pretty good options for email parsing (albeit a lighter version), and then zapping over to other Apps too, see here: https://parser.zapier.com/.Good Luck!

Tom Kincheloe

You can use PDF Box or I-Text pdf processor apis to convert pdfs into html document or text document. Later using regular expression we can identify the pattern to extract the data. Both PDF Box and I-Text provides their open source libraries, you can implement that using any one of high-level programming like java to achieve your need.

Vijay Kumar R

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.