best way to convert the this html file into an xml file using python
-
this html is https://mail.google.com/mail/?ui=2&ik=a0b1e46c9c&view=att&th=1296be43b8e3bbd9&attid=0.1&disp=inline&zw : <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"><html><head><META http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body> <div bgcolor="#48486c"> <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" background="http://title.jpg" height="130"> <tr height="129"> <td width="719" height="129"></td> <td width="1" height="129"></td> </tr> <tr height="1"> <td width="720" height="1"></td> <td width="1" height="1"></td> </tr> </table> <table width="720" border="0" cellspacing="0" cellpadding="0" align="center" height="203"> <tr height="20"> <td width="719" height="20"></td> <td width="1" height="20"></td> </tr> <tr height="69"> <td width="719" height="69" valign="top" align="left"> <table width="719" border="1" cellspacing="2" cellpadding="0"> <tr> <td bgcolor="a5fdf8" width="390"><b>Stream Name</b></td> <td bgcolor="a5fdf8" width="61"><b>Status</b></td> <td bgcolor="a5fdf8" width="61"><b>Duration</b></td> <td bgcolor="a5fdf8" width="185"><b>Start</b></td> </tr> <tr bgcolor="white"> <td width="390">c:\streams\ours\Sony_AVCHD_<WBR>Test_Discs_60Hz_00001.m2ts</td> <td width="61"><font color="#D0D0D0">----</font></td> <td width="61">00:00:02</td> <td width="185">2010/06/15-15:06:17</td> </tr> </table> </td> <td width="1" height="69"></td> </tr> <tr height="113"> <td width="720" height="113" colspan="2" valign="top" align="left"> <table width="721" border="1" cellspacing="2" cellpadding="0"> <tr bgcolor="a5fdf8"> <td width="299"><b>Test Category</b></td> <td width="61"><b>Error</b></td> <td width="62"><b>Warning</b></td> <td width="275"><b>Details</b></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#099eac">All Tests (Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts)</font></td> <td width="61"><font color="#ff0000">34787</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#800000"> ETSI TR-101-290 Tests</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#800000"> ISO/IEC Transport Stream Tests</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#800000"> System Data T-STD Tests</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#099eac"> Prog(1)</font></td> <td width="61"><font color="#ff0000">34787</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#099eac"> VES(0xe0)</font></td> <td width="61"><font color="#ff0000">34787</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#1010F0"> H.264/AVC Conformance</font></td> <td width="61"><font color="#ff0000">34718</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"> <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_Conf.txt</font></a><br> </td> </tr> <tr bgcolor="white"> <td width="299"><font color="#101010"> Sequence</font></td> <td width="61"><font color="#000000">0</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#101010"> Picture</font></td> <td width="61"><font color="#000000">0</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#101010"> Slice</font></td> <td width="61"><font color="#000000">0</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#101010"> Macroblock</font></td> <td width="61"><font color="#ff0000">34718</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#101010"> Block</font></td> <td width="61"><font color="#000000">0</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#1010F0"> HRD Tests</font></td> <td width="61"><font color="#ff0000">69</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"> <a><font color="#ff0000">Sony_AVCHD_Test_Discs_60Hz_<WBR>00001.m2ts_Prog(1)_PID(0x1011)<WBR>_H264_HRD.txt</font></a><br> </td> </tr> <tr bgcolor="white"> <td width="299"><font color="#101010"> HRD level</font></td> <td width="61"><font color="#ff0000">69</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#800000"> Video T-STD Tests</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#099eac"> AES(0xfd)</font></td> <td width="61"><font color="#000000">0</font></td> <td width="61"><font color="#000000">0</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#808080"> Audio Level Tests</font></td> <td width="61"><font color="#808080">Disabled</font></td> <td width="61"><font color="#808080">Disabled</font></td> <td width="275"></td> </tr> <tr bgcolor="white"> <td width="299"><font color="#800000"> Audio T-STD Tests</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="61"><font color="#800000">No Lic</font></td> <td width="275"></td> </tr> </table> </td> </tr> <tr height="1"> <td width="719" height="1"></td> <td width="1" height="1"></td> </tr> </table> </div> </body></html> has any python lib to do this ? thanks
-
Answer:
http://www.crummy.com/software/BeautifulSoup/ gets you almost all the way there: >>> import BeautifulSoup >>> f = open('a.html') >>> soup = BeautifulSoup.BeautifulSoup(f) >>> f.close() >>> g = open('a.xml', 'w') >>> print >> g, soup.prettify() >>> g.close() This closes all tags properly. The only issue remaining is that the doctype remains HTML -- to change that into the doctype of your choice, you only need to change the first line, which is not hard, e.g., instead of printing the prettified text directly, >>> lines = soup.prettify().splitlines() >>> lines[0] = ('<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"' '"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">') >>> print >> g, '\n'.join(lines)
zjm1126 at Stack Overflow Visit the source
Other answers
lxml works well: from lxml import html, etree doc = html.fromstring(open('a.html').read()) out = open('a.xhtml', 'wb') out.write(etree.tostring(doc))
Ian Bicking
Related Q & A:
- how delete node in xml file using php?Best solution by Stack Overflow
- How to convert positional attributes in an xml file to normal attributes?Best solution by Stack Overflow
- How to convert build.xml to maven pom.xml file?Best solution by Stack Overflow
- how to parse a xml file using jquery and phonegap?Best solution by Stack Overflow
- How to Creating text File using Python?Best solution by Stack Overflow
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.