How do I open this word document in html?

How do you convert a Word Document into very simple html in Python?

  • Every now and then I receive a Word Document that I have to display as a web page. I'm currently using Django's flatpages to achieve this by grabbing the html content generated by MS Word. The generated html is quite messy. Is there a better way that can generate very simple html to solve this issue using Python?

  • Answer:

    A good solution involves uploading into Google Docs and exporting the html version from it. (There must be an api for that?) It does so many "clean ups"; http://www.crummy.com/software/BeautifulSoup/ down the road can be used to make any further changes, as appropriate. It is the most powerful and elegant html parsing library on the planet. This is a known standard for Journalist companies.

Thierry Lam at Stack Overflow Visit the source

Was this solution helpful to you?

Other answers

It depends how much formatting and images you're dealing with. I do one of a couple things: Google Docs: Probably the closest you'll get to the original formatting and usable HTML. Markdown: Abandon formatting. Paste it into a plain text editor, run it through Markdown and fix the rest by hand.

Chris Amico

You can also use http://www.abisource.com//http://wvware.sourceforge.net/ to convert word document to XHTML and then parse it with http://www.crummy.com/software/BeautifulSoup//http://effbot.org/zone/element-index.htm/etc. to preprocess it if you need. In my experience, Abiword does a pretty good job at converting Word files and produce relatively clean XHTML files. I should mention that Abiword can be run on the command line, so it's easy to integrate it in an automated process.

Etienne

My super-simple app http://wordoff.org/ has an http://wordoff.org/api for cleaning up cruft from Word-exported HTML. You could override the save method of your flatpages model to pipe your HTML through the API the first time it gets saved. Something like this: import urllib import urllib2 def decruft(html): data = urllib.urlencode({'html' : html}) req = urllib2.Request('http://wordoff.org/api/clean', data) response = urllib2.urlopen(req) return response.read() def save(self, **kwargs): if not self.pk: # only de-cruft when content is first added self.content = decruft(self.content) super(FlatPage, self).save(**kwargs)

tomd

There are many other approaches, depending on your specific circumstances, beyond the good ones already suggested -- see http://stackoverflow.com/questions/910730/python-ms-word and its answers for a good survey!

Alex Martelli

Word 2010 has the ability to "save as filtered web page". This will eliminate the overwhelming majority of the HTML that Word inserts. Greg Burdett

Greg Burdett

I found this web page: http://www.textfixer.com/html/convert-word-to-html.php It converts a formated text to simple HTML markup, preserving bold, italic, links and paragraphs, but not adding tags for font-sizes and faces. Exactly what I needed to save some time.

DerVO

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.