Is there any papers/theses/research out there that proves that RegEx should not be used for HTML parsing and that an XML parser should be used instead?

The general consensus is never use RegEx for HTML parsing; an XML parser should be used instead. Is there any commendable papers/theses out there which states/prove this? ------------- After reading this answer (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1732454) to the "RegEx match open tags except XHTML self-contained tags" StackOverflow question and the subsequent 'Parsing Html The Cthulhu Way' blog post by Jeff Atwood, I am a bit annoyed that the head of our Computer Science department taught us 2/3 modules on parsing HTML with RegEx. I'm looking for something I can approach the department with so we can 'save the children'. I don't think they'd change course content based on a blog/SO post. It would be real nice if the documentation proves that an XML parser is the way forward also.I understand there might be value in learning it even if you wouldn't use it. But is there any documentation that make the case against using RegEx for HTML parsing? (Preferably one that points out XML parsers as the better solution and why). ------- This answer (http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags#answer-1758162) to the "RegEx match open tags except XHTML self-contained tags" StackOverflow question about Chomsky Hierarchy (http://en.wikipedia.org/wiki/Chomsky_hierarchy) is interesting. Is there research out there to prove this is this case and therefore prove the overall point? -------- Even if you don't agree with me going to my department and protesting (feel free to say that too by the way) please post any links to any papers/research because it could be very helpful for the project I am currently in the middle of, which is (right now) using RegEx for a lot of HTML parsing. Any papers/research could greatly improve the documentation side of my project, if not the whole thing. -------- A question like this was closed on StackOverflow as 'not constructive'.
Answer:

This whole cthulhu-never-parse-xml-with-regex meme has to die. Shouting doesn't make you right. Swearing doesn't make you right. Even using unicode to make your text look all weird doesn't make you right. This meme is technically correct but basically just wrong. It is technically correct in the sense that a single, classical regular expression is not capable of recognizing the set of valid XML documents. But to say that you therefore "can not parse xml with regex" is like saying that you "can not blow things up with dynamite", because dynamite after all won't ignite without a match, and you need the oxygen from the air to fuel the combustion: it's irritating pedantry. So, just to set the record straight: 1) Parsers use regex. Parsers are most often made of two components: a scanner written as a regular expression, and a parser written as a context-free grammar. If you've heard of the unix tools lexx and yacc, well the first one is for writing scanners and the second one is for writing parsers. Lexx is just a fancy tool for writing regular expressions. In fact, it is quite unusual to find *any* parser which *doesn't* use regular expressions. The Haskell combinator library Parsec allows you to combine the lexing and parsing into one; and packrat parsers don't require a separate scan phase. Most everything else uses regex (or hand-written finite-state machines). 2) HTML Parsers use regex. The source code for Python's HTML parser is here: http://hg.python.org/cpython/file/default/Lib/html/parser.py Check out lines 16-65. Nothing but regex. Python's parser, like most parsers for HTML, SGML or XML, is completely reliant on regex. I've just looked through several such parsers - a few of the low-level ones like Xerces or expat don't use regex libraries, but they still have scanner phases with their characteristic finite state machines. So saying "don't use regex for parsing HTML; instead, use a parser" is kind of like saying "don't use an engine to travel long distances; try a car instead." Ignorant silliness. 3) In fact, the scanning phase is pretty much all there is to parsing HTML. In a language like, say, C, the grammar parser will often be much more complicated than the scanner. This is because it must deal with all the complex nested structural patterns of C: it must know which "else" belongs to which "if", when x * * y is a declaration and when it is multiplication and dereference, and so on. HTML doesn't have that kind of structural complexity. It's basically just tags inside tags. In fact, XML or HTML parsers often operate in SAX mode, which is really just a scanner. The grammar parser for HTML, whose hoary complexity supposedly precludes one from EVER parsing HTML with a regex, could literally be written as a short switch statement inside a while loop. The real reason you shouldn't use a regex to parse XML is that XML is an overcomplicated abomination, whose correct processing is impossible without a serious amount of code. A correct XML processor has to handle namespaces, entity references, encodings, CDATA, processing instructions, and endless other shit that you've never heard of and don't care about. HTML by contrast has a defacto standard of "what will work on at least the most dominant browsers", and in practice you can parse it with regex, and in practice everyone does. Finally I'd just like to say, don't always trust programmer's conventional wisdom or the crowd at StackOverflow. It's the digital equivalent of bullshit around the water cooler. (ditto Quora for that matter). If you're looking to parse HTML with regex, I would look at the python parser I posted above, or the one here: https://github.com/tautologistics/node-htmlparser/blob/master/lib/htmlparser.js or just generally look around. The theory is explained in books like "Introduction to Automata Theory, Langauges and Computation", but it's probably a lot more effort to try to learn all that than it is to understand a simple regex-based parser.

Jason Priestley at Quora Visit the source

Was this solution helpful to you?

Other answers

Regular Expressions are basically finite state machines. This means that they are not Turing Complete and have certain limitations (such as not being able to detect if tags were closed correctly and / or nested correctly). A regex on it's own will thus not be sufficient. Highly recommended theory behind this stuff is the book by Michael Sipser: http://www-math.mit.edu/~sipser/book.html That does not mean a regex does not have it's merits for simple ad-hoc stuff. Just don't build a browser based on it.

Ruben Vermeersch

The biggest argument against using regexp to parse all your XML is performance. A general regexp will be a lot slower then a focused regexp to find tokens and a proper tokenized grammar. Having said that, XML has some particularly ugly syntax that makes parsing harder then it should be, such as the significance of whitespace.

Jeff Kesselman

As long as your HTML is well-formed, you should be able to hack together some regexes and parse an HTML document with no problem. When you remove the caveats above, however, the reasons that you shouldn't use regexes for this are practical, not academic. In general, the HTML you may need to parse will be of multiple HTML versions, and may not be well formed. There is still an expectation, however, that the document can be presented, so parsers have to handle a lot of corner cases to work with badly-formed HTML. As long as your professor controls the input, and doesn't assault you with a bunch of broken HTML in multiple versions, there shouldn't be any problem in applying regexes to your class assignments. I wouldn't contact your department's dean demanding a change in curriculum over this. I suspect that your professor assumes that most students have enough of an understanding of HTML to be able to build parsers for it using regexes, and prefers to use HTML to avoid teaching you Lisp or some toy language for you to work with. I'm guessing that if I heard the full story of why you've been parsing HTML with regexes in class, my reaction would be, "kids these days have it so easy." In practice, when working with arbitrary HTML documents, you should consider parsing the HTML into a DOM and querying with XPath or CSS Selectors. I prefer Selenium Webdriver for this. It may seem like overkill, but the parsers in Firefox and Webkit are state of the art, and if the page is updated via AJAX calls after the document is loaded, then the updates will be queryable via the DOM, all without needing to write a single regex.

Kevin von Horn

Related Q & A:

How to write a parser in C?Best solution by Stack Overflow
What is the best language for HTML parsing and web scraping?Best solution by Quora
How to convert a HTML file to XML file?Best solution by Stack Overflow
Why was the top level domain .co.uk used instead of .uk?Best solution by Yahoo! Answers
How Much are theses Golf Irons worth?Best solution by Yahoo! Answers

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.