How can I automatically extract data from PDFs based on keywords?
-
I have a project that involves downloading and gathering selected data from a large number of PDFs stored at various web sites. Right now, I do this manually. Is there any open source or low cost tool that would allow me to specify keywords and then automatically extract any text adjacent from those keywords from multiple PDFs?
-
Answer:
There's an off-the-shelf product called Split Pro (http://www.debenu.com/products/server/debenu-pdf-split-pro/) that has the ability to help you with this task. Otherwise, you should be looking at using an SDK (such as PDFlib TET - http://www.pdflib.com/products/tet/) to extract the text and perform the analysis you require.
Chris Dahl at Quora Visit the source
Other answers
You can do it yourself with the help of the open source projects, pdfbox or xpdf, or let me do it for you with a little fee.
Steven Lee
See if this helps: http://tv.adobe.com/watch/learn-acrobat-x/saving-search-results-in-acrobat/
Abhigyan Modi
Thanks for the three previous answers. I am organizing a hackathon with the Sunlight Foundation to address the problem of PDF data extraction. Here is the announcement: http://www.meetup.com/Open-Source-Finance/events/145743722/
Marc Joffe
If you do not require OCR, then there are several options available, including: PDFMiner (Python): https://github.com/euske/pdfminer/ PyPDF2 (Python): https://github.com/mstamy2/PyPDF2 pdftotext (Command Line): https://en.wikipedia.org/wiki/Pdftotext Alternatively, if you don't want to have to code the integrations and manipulations yourself, then http://taskpipes.com might be useful for you. TaskPipes allows you to parse PDFs and other file types and reformat the data as required through the online user interface. (Full disclosure: I am the co-founder of TaskPipes)
Fraser Atkins
You can use IntelliGet ( http://www.mountonetech.com/intelliget ) for this. You can specify markers in multiple ways: keyword(s) in the same line or on a different line, a combination of keywords, length of the line, line number, etc. You can create as complex a marker as you require. Once you have tested the extraction, you can run it on multiple files - it can extract from hundreds of files in a minute.
S Garg
There are a few email parser softwares out there that do it pretty well, provided there is not a need for Optical Character Recognition (OCR).I use http://mailparser.io, here is their blog article on parsing pdf data: https://mailparser.io/blog/convert-pdf-files-to-google-spreadsheet?utm_source=quora. This email parser has the most flexibility to get what you need, that Iâve seen.Also, Zapier has some pretty good options for email parsing (albeit a lighter version), and then zapping over to other Apps too, see here: https://parser.zapier.com/.Good Luck!
Tom Kincheloe
You can use PDF Box or I-Text pdf processor apis to convert pdfs into html document or text document. Later using regular expression we can identify the pattern to extract the data. Both PDF Box and I-Text provides their open source libraries, you can implement that using any one of high-level programming like java to achieve your need.
Vijay Kumar R
Related Q & A:
- How can I insert posted data into the database?Best solution by Stack Overflow
- How can I query Parse data by creation date with Swift?Best solution by Stack Overflow
- How can I programmatically extract a file quickly and efficiently within Android?Best solution by Stack Overflow
- How can I add the data to shopping cart?Best solution by Stack Overflow
- How can I sort my data in alphabetic order?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.