Is there a clean way to parse HTML?

What's the best way to convert a Microsoft Word document to clean unstyled HTML?

  • Goals and constraints: I'd like to take a Word doc, maintain only basic styling, and get very clean HTML or, ideally, Markdown out. It also has to be able to be automated via an API, script, Mac program, etc (i.e. no Windows programs or pastable HTML forms). Like clean enough such that my grandmother could view the source and recognize it as her own delicious pie recipe. Here is what I'd like to maintain (anything else is gravy): Chapter headings (or things with large font treated the same way) (h1, h2, h3) Paragraphs (p) Lists (ol, ul) Links (a) Bold (b) Tables (if possible) I don't want any CSS... just HTML tags. It's ok if the user has some minimal work to do: I'm ok with a solution that requires some tagging by the user (e.g. chapter headings), but I would much prefer a tagging solution that could find other instances (e.g. if I mark one section as header 1, it looks for others with the same font attributes) and that wouldn't require the user to tag until their fingers bleed (think #Chapter1 as good, but <p>My laborious task.</p><p>My laborious task.</p>as bad.) What I've already found: I've seen web tools like: http://word2cleanhtml.com/ I've played around with Pandoc: http://johnmacfarlane.net/pandoc/ I've even used Google Docs (it does a pretty good job, and you can API-a-tize it - but I'm not yet convinced it's the golden path.) I've heard about, but not yet tried: http://wordoff.org/api http://www.w3.org/People/Raggett/tidy/ / http://tidy.sourceforge.net/ Dreamweaver Export (wild!) So, I ask you now, friends, Romans, countrymen of the Internet, how would you do this? Heck, how would you even approach this problem - maybe I'm doing it all wrong!

  • Answer:

    Do a self email in Gmail, attaching the word file. Then open it using "view as html" Save or share. Same can be done with .pdf files too.

Anonymous at Quora Visit the source

Was this solution helpful to you?

Other answers

I'm a Roman*, so I'll tell you what I use**. LibreOffice, which reads all those Microsoft files and saves into a lot of different formats. It costs you nothing to try. You can even edit PDF files. http://libreoffice.org * Rome, NY. ** By necessity, because I run Linux

James Van Damme

hi you can save in Word as filtered HTML. Have you tried that? It removes all Office-specific tags etc... there are some font and style defs left in the HTML, but it is pretty clean...

Brian Phillips

Unfortunately Wordoff has stopped working. Use http://www.html-cleaner.com instead. It's very user-friendly way of converting Word documents to clean HTML!

John Johnson

Aptara and Innodata are the partners Inkling used to do this to make hundreds of super clean HTML books.Even with 50+ amazing engineers, the most efficient way to do this with the highest quality was with people + technology (ticketing systems on the content, revision control, and individual logins made process improvements possible.)There were too many variables and things that went wrong to be able to efficiently do it automatically (at least in 2012) with the quality you're talking about. Heck, even having the original InDesign files that the print book was made from didn't help much.The differences between clean HTML and great page layout are just so different.Try printing out a random complex webpage, the mess that's produced will give you a sense of just how different HTML and print are when you dig a bit.

Jordan Crawford

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.