Extract Text and Images From a PDF
-
I'd like to extract the text and images from a multi-page pdf to use on the web. I've got a number of large (100 to 200 pages) PDFs that I need to extract the text and images from to use with a CMS for a website. I've looked at a number of PDF to HTML software options, but they all seem to rely on CSS for positioning text, whereas I'd prefer to just have the text in simple paragraph format. If there's a PDF to HTML solution that doesn't format like this, I'd be interested in that, of course. Proper handling of tables would be quite beneficial. The way Acrobat extracts text is decent, but I'd prefer it if there were no line breaks for individual paragraphs (although I could probably setup a PHP script to run through and fix that if it's my only option). As for the images, it'd be great if I had control over how they were saved. In a folder called PDF_NAME/PAGE_1 for example for all the page 1 images. At the very least I need to know which page they were extracted from, either by the folder or filename. I also know I can extract images from a PDF using Photoshop, but unless there's a way to bulk extract images and store them in the format mentioned above, I don't think the solution will really work. I'm running Windows XP, but a solution that works with Linux (Ubuntu) would be fine. If there's an absolutely amazing piece of software for OS X, I'd be inclined to find a system I could use to run it. I'm also open to any web-based solutions (maybe I could use PHP for such a task?).
-
Answer:
I've been through tons of PDF converters, and the one that i liked the most is http://buy.abbyy.com/content/pdftransformer/default.aspx - despite the dodgy name, it's really useful as it allows you to select which regions of the PDF you'd like to convert, allows you to dump to simple paragraphs, and it converts to a variety of formats. I'm not sure about your advanced options, as my trial has finished now. It's worth downloading the trial, however, if you haven't already.
backwards guitar at Ask.Metafilter.Com Visit the source
Other answers
The images part is easy. Acrobat 7 Pro has a "Advanced>Export All Images..." command that names them by page and then by a serial number. Getting nicely formatted text from PDFs is the tricky part. Are the PDFs tagged for XML? That would be an easy workflow to set up. Otherwise, you might need to go at the text manually with a word processor and macros (what I use) or with some tailor-made code.
cowbellemoo
Related Q & A:
- How do I convert a PDF file to PDF/A in Delphi?Best solution by softwarerecs.stackexchange.com
- How do I search for text inside a PDF?Best solution by Stack Overflow
- How to read a PDF file with PHP?Best solution by Stack Overflow
- How do I open a pdf file in my e'mail attachment?Best solution by Yahoo! Answers
- How to make the first page of a PDF display by itself and the succeeding pages display two-up?Best solution by Super User
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.