How to extract text from web page?

Extract Text and Images From a PDF

I'd like to extract the text and images from a multi-page pdf to use on the web. I've got a number of large (100 to 200 pages) PDFs that I need to extract the text and images from to use with a CMS for a website. I've looked at a number of PDF to HTML software options, but they all seem to rely on CSS for positioning text, whereas I'd prefer to just have the text in simple paragraph format. If there's a PDF to HTML solution that doesn't format like this, I'd be interested in that, of course. Proper handling of tables would be quite beneficial. The way Acrobat extracts text is decent, but I'd prefer it if there were no line breaks for individual paragraphs (although I could probably setup a PHP script to run through and fix that if it's my only option). As for the images, it'd be great if I had control over how they were saved. In a folder called PDF_NAME/PAGE_1 for example for all the page 1 images. At the very least I need to know which page they were extracted from, either by the folder or filename. I also know I can extract images from a PDF using Photoshop, but unless there's a way to bulk extract images and store them in the format mentioned above, I don't think the solution will really work. I'm running Windows XP, but a solution that works with Linux (Ubuntu) would be fine. If there's an absolutely amazing piece of software for OS X, I'd be inclined to find a system I could use to run it. I'm also open to any web-based solutions (maybe I could use PHP for such a task?).
Answer:

I've been through tons of PDF converters, and the one that i liked the most is http://buy.abbyy.com/content/pdftransformer/default.aspx - despite the dodgy name, it's really useful as it allows you to select which regions of the PDF you'd like to convert, allows you to dump to simple paragraphs, and it converts to a variety of formats. I'm not sure about your advanced options, as my trial has finished now. It's worth downloading the trial, however, if you haven't already.

backwards guitar at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?