Why is C++ grammar so hard to parse?
-
I am working on a project in which I am taking help of "yacc-able" C grammar. I thought that it would be easy to do experiments on C grammar by writing my own actions in yacc file and then use the same procedure for C++ grammar. Now after 2 weeks, when I searched for "yacc-able" grammar for C++ I got links http://www.nobugs.org/developer/parsingcpp/index.html http://www.jguru.com/faq/view.jsp?EID=531848 in which they have clearly stated about "how hard is to parse C++". . I just want to know what are the main hurdles in parsing using C++ grammar . Is it the inclusion of templates, namespaces which makes it complex or was I just wrong in assuming that C is just a subset of C++ ?
-
Answer:
C is hard to parse, and C++ is harder. I don't remember the specifics, but what I heard is that the problem is grammar ambiguities. There are many situations in which you don't know how to assign elements semantically until you look ahead several tokens, and may even have to tentatively parse something one way, them back out and take a different tack. For example, the characters ;, (), <>, &, and comma (,) can all be used in many different ways dependent on context, type structure, etc.
Paul King at Quora Visit the source
Other answers
From when I worked with this, many many years ago, this is what I remember: First, a given token can have different meaning based on context. Thus, the lexer has to integrate with the parser. Second, the meaning of tokens depend on what semantics are already bound to those tokens. For example, whether "foo" is a local variable, a typedef, or a function reference, is not known at the point of discovery -- and more than one meaning could be conceivable. Third, the C/C++ preprocessor has actually grown tentacles and whiskers and taken a life of its own. Look at the various implementations of "stringify this token" to start understanding the depth of the problem. Finally, the language is in itself ambiguous. If you have code like "if blah then if blah then blah else blah," a typical parser won't know whether the "else" belongs to the first or the second if statement. (The language specifies the second.) You can avoid this ambiguity by writing a version of the parser that accepts "no short ifs" (greedily eats "else") and use that on the inside of parsing if statements only. This duplicates the code you need to write, and introduces various possibilities for bugs! Each ambiguity in the language has the possibility of adding combinatorical complexity. Another example: ">>" can mean "shift-right" or "close two templates." You can implement this by coding the shift operator in the grammar as a sequence of two greater-than operators, and separately error out if you see whitespace between the two. But this now requires more than a single-token look-ahead to determine whether what you're parsing is a greater-than operator or shift-operator, and because they have different precedence, the syntax parser needs to know which one it is. Another option is to allow the close-template code to accept the shift-right token as a "closer," and then somehow push another "close template" operation onto the list of tokens to be parsed, which has its own problems. To get more insight, I would recommend reading the source code for clang, which is actually reasonable easy to get into. I would not recommend reading the source for gcc, which I found a lot harder to get into.
Jon Watte
The question is a bit bent, but the intent is clear. OP wants to know why C++ source code is hard to parse, and related, why grammars for C++ are so complex. The fundamental problem is that a committee has been jamming new features into the language for over 20 years and they wonât stop, regardless of ever-decreasing utility of new features.The consequences are that the grammar is huge (esp for C++11 and worse for C++14; C++17 is comingâ¦), and ambiguous. Most parser generators don't handle ambiguous grammars, so the front ends that parse must use some kind of hacky solution to get around this (usually feedback between the symbol table and parser, whose very idea starts out by coupling mechanisms that shouldn't be coupled). C++ has lots of grammar cases that require arbitrary lookahead, which most parser generators don't handle either. A last resort is to code a parser by hand to avoid the troubles of using of parser generator, but that gets you the cost of not using a parser generator. The right cure for this is to use a strong parser generator that handles ambiguiites; our DMS Software Reengineering Toolkit uses a GLR parser generator and completely decouples all the symbol table construction from parsing, and thus makes it pretty practical to "easily" parse C++.The second problem is that have a bare parser really isn't useful; to analyze or manipulate C++ code, the tool has to know what the symbols mean, and how to reason about the code in various ways, For a long winded answer, google my essay on "Life After Parsing". The short answer is you need a preprocessor, you need a symbol table, and you have to expand templates to fill in the symbol table. The preprocessor is just a complicated pain to implement because it has many dark corners. Building the symbol table means getting Koenig lookup right, and that's a mess compounded by the sheer number of cases induced by the grammar rules (check out lambdas!). Finally, one has to do template expansion (and ideally, recently, constant expression evalution). The effort to get all this right is easily 10x what it takes to get a raw parser, e.g., years plural of smart guys. GCC and Clang succeed by using huge open source teams on this, and, well, decades of time. Our DMS Software Reengineering Toolkit does this too; our team is just a few people but DMS is really, really, good (IMHO but draw your own conclusion) at supporting the implementation of complex langauge processor. (The proof is that we have it without the huge teams of GCC and Clang).The third problem is deep reasoning requires control and data flow analysis of the code. There's 600 pages of C++ reference that tells you in highly convoluted ways how constructors, destructors, references, etc. are all evalutated under various strange circumstances. Getting this right is the same kind of effort as getting the symbol tables right. Again, this is technically just (a *lot* of) sweat. GCC, Clang, and DMS all do this. DMS does this for GCC and Microsoft C++ dialects. GCC and Clang only do it for GCC as I understand them.Finally, you'll actually want to *do* something with all this detail. GCC wants to be a compiler, and will resist your every attempt to make it do something else; some superhumans have succeeded. Clang I hear is better; it is really organized as a library of tools and has support for climbing over ASTs and generate code patches; I don't know how Clang integrates ASTs and LLVM (data flow) stuff. DMS was designed from the beginning to provide facilities to visit ASTs, symbol tables, and control and dataflow graphs; they are all interlinked in a "convenient" (IMHO) manner so you can navigate from one to the other and back easily. More importantly, DMS provides ways to build more complex analyzers via attribute grammars, BDDs, abstract interpretaion frameworks and pattern matching using source-level patters. On the modification side, DMS allows one to modify ASTs procedurally or using source-to-source rewrites. After modifying a program, the modified AST can be prettyprinted, preserving indentation, comments, radix of number literals, etc; that is, directly compilable code. Clang has some support for prettyprinting; GCC has nothing at all.The bottom line: you don't want to build your own C++ parser; its a literally man-decades of effort exercise. Get one that already works.
Ira Baxter
Related Q & A:
- Why is the size of an empty class not zero in C++?Best solution by geeksforgeeks.org
- Why does Parse use Javascript?Best solution by Stack Overflow
- Why can't I debug in eclipse C++?Best solution by Ask Ubuntu
- How hard is C++ programming?Best solution by Yahoo! Answers
- Which is better, A in an easy class or B/C in a hard class?Best solution by monkeysee.com
Just Added Q & A:
- How many active mobile subscribers are there in China?Best solution by Quora
- How to find the right vacation?Best solution by bookit.com
- How To Make Your Own Primer?Best solution by thekrazycouponlady.com
- How do you get the domain & range?Best solution by ChaCha
- How do you open pop up blockers?Best solution by Yahoo! Answers
For every problem there is a solution! Proved by Solucija.
-
Got an issue and looking for advice?
-
Ask Solucija to search every corner of the Web for help.
-
Get workable solutions and helpful tips in a moment.
Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.