Is there an easy to use Python library to read a PDF file and extract its text?
-
All of the ones I've seen on github are insanely complicated and do not function properly. Surely there must be something more polished out there.
-
Answer:
Yes.http://textract.readthedocs.org/en/latest/ is very easy to use. It relies on a couple of different possible underlying solutions: pdf2text and pdfminer. I haven't used the underlying tools directly, but Textract is as simple as it gets.Another great solution is Apache Tika. Tika runs a Java server in the background. It's sort of a Rosetta Stone of text ingestion (reads documents, images, etc), so it might be overkill if you only care about PDF. Still, it's production-quality, extremely fast, and there's an easy to use https://github.com/chrismattmann/tika-python.Here's an example of Textract: from textract import process text = process('/tmp/mydocument.pdf') And, the same example using Tika: from tika import parser text = parser.from_file('/tmp/mydocument.pdf') Both are good solutions. Which one you choose would depend on how you're going to use it.
Nathan Hemingway at Quora Visit the source
Other answers
This may come to you as a surprise, but PDF was never actually intended as a format for easy text extraction. Indeed, its primary purpose is to make sure that whatever is in the document would be displayed in a consistent manner across multiple platforms, as well as print identically (as much as that is possible) everywhere. This is achieved in many different ways, and sometimes goes as far as literally saying "this letter goes here, this letter goes here, and this letter goes here" in its internal language. So yes, the libraries to extract text from PDF are insanely complicated by necessity, and they sometimes do not even function properly, because PDF is not meant for that.
Paulina JonuÅ¡aitÄ
PyPDF2 has an extractText() function is very easy to use. As other people have noted, it is harder than it seems to get to text out of a PDF and the extractText is known not to work on every PDF, especially if it's formatted in a complicated way (e.g. pictures with captions, columns). I've heard it also has trouble with non-English language PDFs, but I haven't tried it.That said, it has worked fine for me in the past on simple PDFs, and only takes a few lines of code to get it going (although the resulting string will probably be in need of some regex massaging after). May be worth a shot for your case. A good tutorial is here: https://automatetheboringstuff.com/chapter13/ Example: def get_pdf_content(pdf_path, page_nums=[0]): content = '' p = file(pdf_path, "rb") pdf = pyPdf2.PdfFileReader(p) for page_num in page_nums: content += pdf.getPage(page_num).extractText() return content
Kayla Andersen
Easy to use libraries will have a hard time extracting information from a PDF.Usually you will find space characters randomly in the text, or what seems like one line are actually two lines on top of each other (so if the pdf says "Hello World" your library might read it as "H l W r d\n el o or l").However the guys and gals at https://www.gini.net/ have a service that you can easily integrate into a python program that will extract information from the pdf (also works on images, etc.).
Tobias Kommerell
Since Python is designed as a language to process data of various types, it is natural to think of using Python extract and manipulate text data in the ubiquitous format PDF.For reading PDF files, you can use PyPDF module (see https://automatetheboringstuff.com/chapter13/).For writing PDF files, I would recommend using the ReportLab library, which appears to be quite mature and has been deployed in many systems (Windows, Linux, etc.). See here: http://www.reportlab.com.
Chien Nguyen
Python has a lot of libraries for PDF extract,many of them have been discussed below.I would like to add up PDFMiner and Slate to the queuePDFMinerPDFMiner is a tool for extracting information from PDF documents. Unlike other PDF-related tools, it focuses entirely on getting and analyzing text data.PDFMiner allows one to obtain the exact location of text in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis.Features Written entirely in Python. (for version 2.4 or newer) Parse, analyze, and convert PDF documents. PDF-1.7 specification support. (well, almost) CJK languages and vertical writing scripts support. Various font types (Type1, TrueType, Type3, and CID) support. Basic encryption (RC4) support. PDF to HTML conversion (with a sample converter web app). Outline (TOC) extraction. Tagged contents extraction. Reconstruct the original layout by grouping text chunks. Source distribution: http://pypi.python.org/pypi/pdfminer/github: https://github.com/euske/pdfminer/Online Demo:(pdf -> html conversion webapp) http://pdf2html.tabesugi.net:8080/CommunityThe tutor has below mailing list is for users who want to ask questions.http://groups.google.com/group/pdfminer-users/Slate 0.5.2Slate is a Python package that simplifies the process of extracting text from PDF files. It depends on the PDFMiner package.Though PDFminer is simple it has its own drawbacks Getting simple things done, like extracting the text is quite complex. The program is not designed to return Python objects, which makes interfacing things irritating. Itâs an extremely complete set of tools, with multiple and moderately steep learning curves. Itâs not written with hackability in mind. Slate provides one class, PDF. PDF takes a file-like object and will extract all text from the document, presentating each page as a string of text >>> with open('example.pdf') as f: ... doc = slate.PDF(f) ... >>> doc [..., ..., ...] >>> doc[1] 'Text from page 2...' If your pdf is password protected, pass the password as the second argument >>> with open('secrets.pdf') as f: ... doc = slate.PDF(f, 'password') ... >>> doc[0] "My mother doesn't know this, but..." Author: Tim McNamara Home Page: http://github.com/timClicks/slate Quora Source:https://github.com/euske/pdfminer/http://github.com/timClicks/slateHappy Learning Python :)Cheers!
Radhakrishnan Ramesh
Related Q & A:
- How do I convert a PDF file to PDF/A in Delphi?Best solution by softwarerecs.stackexchange.com
- How to read a PDF file with PHP?Best solution by Stack Overflow
- Is there a limit on the size of a new file or a text file?Best solution by Stack Overflow
- How do I open a pdf file in my e'mail attachment?Best solution by Yahoo! Answers
- How do you shrink a PDF file?Best solution by Yahoo! Answers
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.