Is there a clean way to parse HTML?

Turning HTML into a book?

  • I'm trying to turn someone's blog into annual books for them as a gift (not Christmas). What's the best way to do this? The book printer wants a PDF. I've already captured the HTML for the blog into files, and have written a perl script to parse the HTML into a basic structured text file that identifies the title, date, and other blog metadata in a standardized way. My specific question: how can I import this text (with some HTML content in the bodies) into a word processor or other page layout program so that (a) the (simple) HTML formatting inside the blog content is preserved, and (b) styles are automatically assigned to the title, date and such so that I don't have to manually do it? Contraints:I'm flexible about the layout software, but if it's not MS Word 2003 or Publisher, it needs to be free (and able to handle 200-300 page manuscripts in a single file). The structure of the file to import is flexible, since I'm creating it... for example, creating XML would not be difficult.I'm aware there are numerous HTML -> PDF options, but I really want to use a WYSIWYG style layout program that supports TOC creation, page numbering, and so on.The work to automate can't be too elaborate, since this is a one-off (there are about 600 blog entries spread out over three books).Any suggestions?

  • Answer:

    You could use one of http://wiki.docbook.org/topic/ConvertOtherFormatsToDocBook to convert your HTML to DocBook, and then one of http://wiki.docbook.org/topic/DocBookPublishingTools to convert your DocBook to PDF.

reborndata at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

In my defense I did download a couple of blogs. One came into Word as the entire 3 years of blogging with each field associated with its particular style. 300+ pages with images, but sadly no comments. Maybe the step I missed to tell you about was that you must save the imported blog as a Word document (not as HTML).

cabb-chase

Word will open HTML files, why not just do that, then save as a .doc and work on it in Word? You'll probably have to be a bit wary of its interpretation of HTML, (you might have tweak your perl script and re-export a couple of times perhaps) but it should work.

AmbroseChapel

latex!! it's perfect - assuming it's just text, and not heavy with images or other stuff. it's not wysiwyg, in that you don't have precise control over the style. it works by 'compiling' tagged text into beautifully typeset documents. table of contents and sections and stuff are trivial. it is built for automation and large documents. I really think it's exactly what you need. many free/open source tools exist. you could probably do the whole thing with one script. see http://www.w3.org/Tools/html2things.html, http://www.iwriteiam.nl/html2tex.html, and other stuff http://www.google.ca/search?q=html+to+latex&start=0&ie=utf-8&oe=utf-8&client=firefox-a&rls=org.mozilla:en-US:official

PercussivePaul

You seem like a programmer type - if so, you may find http://www.fpdf.org/ to be useful for this purpose. It's a minimalist PHP library for writing PDF's that contains very intuitive handling of margins, line breaks, page breaks, and standard headers & footers (the things I originally assumed would be very complicated when I first experimented with PDF generation). CSS-like text styles would be handled by writing methods that set font characteristics and so on.

migurski

Not exactly what you're looking for, but http://www.blurb.com/create/book/blogbookthat claims to do all this automagically (but is presumably locked into their book-making service).

stavrosthewonderchicken

how about openoffice? i know it can export straight to PDF and it certainly fits your free criteria.

moochoo

You might like this option... get the new Internet Explorer 7, then click on the pull-down menu "Page" and select "Edit with Microsoft Word". Its Magic! Loads the whole thing into Word preserving most of the formating and images. For the blog I tried, it was as good as I could hope for.

cabb-chase

I can't tell if the advice above is meant to be a joke or not. If so, stop. If not, I'm sorry, but please lurk more, cabb-chase. That's fucking cretinous.

stavrosthewonderchicken

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.