Is there a clean way to parse HTML?

Parse an Internet Explorer Bookmark file in Python or Java

  • I'm quite embarrassed to ask for help on this as I've already banged my head against the wall trying to figure this one out for 8 hours over this weekend and I think this should be easily solvable! My question is: How do I parse an IE saved bookmark.html file such that no matter how many subfolders are contained by a top level folder, the correct parent/child folder path will be maintained. Here is what I mean. For the following list of bookmark folders and subfolders, insert the path into a database or write to a file: Let's say we have 3 top level folders (all starting at an index of 4 by the way in IE). Arts Business Computers In Arts we have: Animation. In Animation we have: Education. In Education we have: On-line And so on. Each top level folder might have no sub folders, or it might have 1000's. Each sub folder in turn might have no sub folders, or it might have 1000's. Help me get past my stuck point. I can get a list of the top level folders and every sub folder within, but not the actual parent / child relationship as detailed above. The end result of the above example would look like this inserted as a single row in a MySQL database, or written to a text file: Arts Arts/Animation Arts/Animation/Education Arts/Animation/Education/On-line Business (assuming no sub folders) Computers (assuming no sub folders) Here is my existing Python code. Your code may be in Python or Java. Preferably in Python. You may modify the code as you see fit. Simplify it, or start over from scratch. You are ONLY reading the bookmark.html file!!! You are not traversing the Windows Favorites directory. (Actually, I've already written something to do that, and that was easy. The OS tells you what the path is.) No other language is acceptable as I only have experience with the above 2. Also, if you could explain what was difficult about this? I mean it seems so straightforward, and I was able to write it out in pseudocode no problem, but when it came to implementing it....I just can't get my mind to go beyond the second level of folders. Perhaps a suggestion on how to do these problems in the future? import os import string import re f=open('all_bookmark.html', 'r+') line_list = f.readlines() xy = 4 top_folder = '' sub_folder = [] folder = [] filename = 'out.txt' localfile = open(filename, 'wb') p = re.compile('<DT><H3 FOLDED ADD_DATE=.*?>') o = re.compile(' ADD_DATE.*?.>') for x in line_list: y = string.find(x,'<'); if(y == xy): strip_version = x.lstrip() if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')): sub_folder = [] strip_version = x.lstrip() strip_version = p.sub('', strip_version) strip_version = strip_version.replace('</H3>','') strip_version = strip_version.replace('\n','') top_folder = strip_version list.append(folder, top_folder) if(y > xy): strip_version = x.lstrip() if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')): strip_version = p.sub('', strip_version) strip_version = strip_version.replace('</H3>','') strip_version = strip_version.replace('\n','') sub_folder = [top_folder + '/' + strip_version] list.append(folder, sub_folder) if(y < xy): strip_version = x.lstrip() if(strip_version.startswith('<DT><H3 FOLDED ADD_DATE')): strip_version = p.sub('', strip_version) strip_version = strip_version.replace('</H3>','') strip_version = strip_version.replace('\n','') sub_folder = [sub_folder + '/' + strip_version] folder[0:-1] list.append(folder, sub_folder) '''Now that I have all of the folder paths, I iterate through the folder list and write each line to a file. Problem is, in Python, you can't write the element of a list to a file.....must be a way, but I haven't found it.''' for x in folder: localfile.write(string.join(x) + '\n') localfile.close()

  • Answer:

    Hi coolguy90210, In a Netscape-format bookmark file exported from Internet Explorer, each folder is listed in a DT element, which is followed by a DL element, which may be empty, or may contain folders or links. That means if you see a folder inside a DL element, you know it's inside the folder named in the preceding DT element, and if you see the end of a DL element, you know you have reached the end of the containing folder. So when you see a folder name, you know you want to put out a line, and the problem is what previously seen folder names to prefix to the one just read. To get the prefix, every time you see a folder name, you start a scope and append the name to the current prefix, and every time you see the end of a DL element, you know you have left a folder scope, so you remove the last folder name from the prefix. So I wrote the program (in Python) to look just for DT elements for folder names and the ends of DL elements (coded as "</DL>") that mark the ends of folder scopes. Here's the program: # Extract a list of folder pathnames from a Netscape-style bookmark file # as exported by Microsoft Internet Explorer. # Constants ENDFOLDER = "</DL><p>" PREFIX = "<DT><H3 FOLDED ADD_DATE=" # Input file f=open('c:/Doc/bookmark.htm', 'r+') # Output file filename = 'out.txt' localfile = open(filename, 'w') currentFolder = "" # Read in the whole input file. line_list = f.readlines() for x in line_list: # Remove all leading whitespace. line = x.lstrip() if line.startswith(PREFIX): # This line tells us about a folder. # Find the next '>' after the prefix. lpos = line.find('>', len(PREFIX)) if lpos > 0: # Find the next '<' after the '>'. rpos = line.find('<', lpos) # Extract the folder name. name = line[lpos + 1:rpos] # If at the top level, we don't want a leading slash. if len(currentFolder) == 0: currentFolder = name else: # This folder must be within the last one we saw, # so append its name to the currentFolder name. currentFolder = currentFolder + '/' + name # Put out the constructed pathname. localfile.write(currentFolder + '\n') elif line.startswith(ENDFOLDER): # The other kind of interesting input marks the end of a # containing folder's scope. # Trim the last folder name from the current path. trimpos = currentFolder.rfind('/') if trimpos < 0: # Must be top level, so trash it all. currentFolder = "" else: currentFolder = currentFolder[:trimpos] localfile.close() I expect it is also possible to solve the problem using the standard Python htmllib module, which would actually parse all the HTML. http://www.python.org/doc/current/lib/module-htmllib.html It's hard to say why this was hard for you. It looks like the difficulty was in finding the algorithm rather than implementing it. I'm also not sure how to advise you about tackling such problems in the future. In this case, it might have helped to review basic computer science material on parsing and tree data structures, as well as the structure of HTML. And with Python, unless you are dealing with something very specialized, it's often a good idea to look for existing modules that do some of the work for you. I hope this answer is helpful. If anything is unclear or you need more information about any part of it, please ask for a clarification. --efn

coolguy90210-ga at Google Answers Visit the source

Was this solution helpful to you?

Related Q & A:

Just Added Q & A:

Find solution

For every problem there is a solution! Proved by Solucija.

  • Got an issue and looking for advice?

  • Ask Solucija to search every corner of the Web for help.

  • Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.