Specifying a parser for a language grammar

In my previous post, I introduced some basic concepts for Mitopia’s parser abstraction. In this post I’d like to give a sample complete language definition, in this case for a C programming language expression evaluator. We will use this language grammar definition in later posts to discuss more complex concepts.

Firstly, we will need a lexical analyzer capable of recognizing all the necessary C language operators and operand types that might be involved in an expression or assignment. In an earlier post, I have already introduced the basic concepts of the Mitopia® lexical analysis package so I will not describe the lexical analyzer in further detail herein. Also it is assumed that the reader is familiar with, or has access to documentation on, the C language in order to assist them in interpreting the lexical analyzer specification (Calculator.LXS) below:

Mitopia Lexical Analyzer Specification for C expression evaluator

It should be obvious by comparing the recognizer specification above with the C language grammar syntax specification in any C text book that this recognizer can identify all potential tokens appearing in a C expression or assignment. Given this recognizer, we can now define a parser for C expressions/assignments (Calculator.BNF) that uses these token numbers as follows:

Mitopia Parser Specification for C expression evaluator

This parser specification utilizes the tokens from our lexical analyzer specification (both the ‘oneCat’ tokens like ‘++’ and the ‘catRange’ tokens like ‘<3:DecInt>’) to define the operator precedence and syntax for our expression and statement evaluator. Note that in this BNF, expressions can include parameterized function calls (as for C itself), the productions to handle this are the last four listed in the BNF. Obviously this allows even this simple parser to support a language involving an extensible function set.

For now we will ignore how these specifications are tied together with the code that will actually execute the calculator functionality and focus on the insights that can be gained into the Mitopia parser abstraction by use of this example grammar.

Referring to our example, this BNF gives a relatively complete description of the C language expression syntax together with enforcement of all operator precedence specified by ANSI. This BNF is sufficient to create a program to recognize and interpret C expressions and assignments.

You will notice that precedence order is specified simply by choosing the order in which one production leads to another with the lowest precedence grammar constructs/operators being refined through a series of productions into the higher precedence ones. Note also that many productions lead directly to themselves (e.g. more_statements ::= statement ; more_statements); this is the mechanism used to represent the fact that a list of similar constructs is permitted at this point. If you look in any C book, you will find a section describing the syntax of the language grammar either as syntax diagrams for a series of grammar productions similar to that above (ignoring Mitopia’s unique EBNF symbols for now). Most computer languages are specified in this way and to make a parser for an existing language using this package, all you essentially have to do is type in the grammar productions as they appear in the language specification. The way of specifying a grammar used above is a custom variant of the Backus-Naur Form. It is the oldest and easiest to understand means of describing a computer language. The symbols (enclosed in ‘<‘, ‘>’ delimiters) described in the table in my previous post, plus the ‘::=‘ symbol are referred to as meta-symbols, that is they are not part of the language, but they are part of the language grammar specification.

A production of the form (non_terminal ::= production_1 <or> production_2) means that there are two alternative constructs that ‘non-terminal‘ can be comprised of; they are ‘production_1‘ or ‘production_2‘. The grammar for a realistic programming language may contain hundreds of these productions, for example, the definition of Algol 60 contains 117. An LL(1) parser must be able to tell at any given time what production out of a series of productions is the right one, simply by looking at the current token in the input stream and the non-terminal that it currently has on the top of it’s parsing stack. This means, effectively, that the sets of all possible first tokens for each production appearing on the right hand side of any grammar production must not overlap; the parser must be able to look at the token in the input stream and tell which production on the right hand side is the ‘right one’.

The set of all terminals that might start any given non-terminal symbol in the language grammar is known as the FIRST set of that non-terminal. When designing a language to be processed by the parser abstraction, ensuring that these FIRST sets are not multiply defined is most of the work. In order to understand how to write productions acceptable to an LL(1) parser, (LL(1) stands for ‘Left-to-right, using Leftmost derivations, using at most 1 token lookahead’) we must understand recursion in a language grammar and the difference between left and right recursion in particular.

If the parser tries to use a production of the form ‘A ::= A anything‘ then it will fall into an infinite loop trying to expand A, this is known as left recursion. Left recursion may be more subtle, as in the pair of productions ‘S ::= X a b‘ and ‘X ::= S c d‘. Here the recursion is indirect; that is the parser expands ‘S‘ into ‘X a‘, then it subsequently expands ‘X‘ into ‘S c‘ which gets it back to trying to expand ‘S‘, thereby creating an infinite loop. This is known as indirect left recursion. You must eliminate all left recursion from your language grammar. You do this simply by replacing all productions of the form ‘A ::= A anything‘ (or indirect equivalents) by a set of productions of the form “A ::= t1 more_t1 <or> … tn more_tn” where t1..tn are the language tokens (or non-terminal grammar symbols) that start the various different forms of ‘A‘.

Recursion is usually used in grammars to express a list of things separated by some separator symbol (e.g. comma). This can be expressed either as “A ::= A , B” or “A ::= B , A“. The first form is left recursive and thus cannot be used; the second form is known as right recursive and is perfectly acceptable. The production “more_statements ::= statement ; more_statements” in the Calculator BNF is an example of a right recursive production.

A standard problem with top down parsers, in general, is that the order of the alternative productions is important in determining if the parser will accept the complete language or not. In the Mitopia® parser abstraction, we avoid this problem by requiring that the FIRST sets of all productions on the right hand side be non-overlapping. Thus in conventional BNF, it is permissible to write:

expression ::= element element + expression element * expression

To meet the requirements of PS_MakeDB() and of an LL(1) parser, we must reformulate this BNF statement into a pair of statements viz:

expression ::= element rest_of_expression

rest_of_expression ::= + expression * expression

As can be seen, the ‘element‘ token has been factored out of the two alternatives (a process known as left-factoring) in order to avoid the possibility of multiply defined FIRST sets. In addition, we have added a new symbol to the BNF meta-language, the <null> symbol. A <null> symbol is used to indicate to the parser generator that a particular grammar non-terminal is nullable, that is, it may not in fact be present at all in certain input streams. There are a large number of examples of the use of this technique in the Calculator BNF language grammar above.

If we wished to limit ourselves to LL(1) grammars then these are the only grammatical issues we would have to face. However, LL(1) grammars are very restrictive and the parser is capable of accepting a much larger set by the use of deliberate ambiguity. Consider the grammar:

operand ::= expression ( address_register )

This might commonly occur when specifying assembly language syntax. The problem is that this is not LL(1) since expression may itself start with a ‘(‘ token, or it may not, thus when processing operand, the parser may under certain circumstances need to look not at the first, but at the second token in the input stream to determine which alternative to take. Such a parser would be an LL(2) parser. We cannot solve the problem by factoring out the ‘(‘ token, as in the expression example above, because expressions do not have to start with a ‘(‘. Thus without extending the language beyond LL(1) we would be unable to handle this situation. Consider however the modified language grammar fragment:

operand ::= …. ( expr_or_indir expression

expr_or_indir ::= Aregister ) expression )

Here we have a production for operand which is deliberately ambiguous because it has a multiply defined first set since ‘(‘ is in FIRST of both of the last two alternatives. However, we have arranged the order of the alternatives such that the parser will take the “( expr_or_indir” production first and, should it fail to find an address register following the initial ‘(‘ token, the parser will then take the second production which correctly processes “expression )” since as stated before expression itself need not begin with a ‘(‘ token. If we allow this case, we have created the equivalent of a two token look-ahead in the parser, hence the language it can accept is now LL(2). An options parameter ‘kIgnoreAmbiguities‘ can be passed to PS_MakeDB() to cause it to accept grammars containing such FIRST set ambiguities, however, it can no longer verify the correctness of the language grammar and it is the responsibility of the language definer to ensure that the first production can always be reduced to the second when such a grammatical trick is used. It is best not to use the ‘kInoreAmbiguities‘ parameter unless one is certain of what they are doing.

We are not quite done yet because language grammars can get considerably nastier than LL(2). To continue a theme, consider the problem of parsing the complete set of 68K assembly language addressing modes, or more particularly the absolute, indirect, pre-decrement and post-increment addressing modes. The absolute and indirect syntax was presented above, however the pre-decrement addressing mode adds the form “-(Aregister)“, while the post-increment adds the form “(Aregister)+“. We would need an LL(3) parser (and some fancy grammar productions) to handle the pre-decrement mode, since the parser cannot positively identify the pre-decrement mode until it has consumed both the leading ‘–‘ and ‘(‘ tokens in the input stream. An LL(4) parser is necessary to recognize the post-increment form.

It is true that if all we were trying to do was recognize valid assembly syntax, we could just left-factor out the “( Aregister )” for the post-increment form, but we want to actually perform some useful function with the parser and this is accomplished by inserting reverse polish plug-in operator calls of the form <@n:m[:hint text]> into the grammar (see the Calculator EBNF above) and later posts. Whenever the parser exposes such a meta-symbol on the top of the parsing stack, it calls the registered plugin function in order to accomplish some sort of semantic action or processing. Given the fact that we might wish to call a different plug-in to handle each of the different 68K addressing modes, we must ensure that we do not call the wrong one before we are absolutely sure that we know what addressing mode we are dealing with.

So, we must extend the parser language set to be LL(n) where ‘n’ could be quite large. One mechanism used to do this is to provide explicit control of limited parser back-up capabilities by adding the <bkup> meta-symbol. Backing up a parser is a lot more complex than it might sound since the parsing stack must be repaired and the lexical analyzer backed-up to an earlier point in the token stream in order to try an alternative production. Nonetheless, the PS_Parse() parser is capable of limited backup within a single input line by use of the flag. Consider the modified grammar fragment:

operand ::= … ( Aregister <bkup> areg_indirect abs_or_displ …

abs_or_displ ::= – ( ARegister ) <@1:1> expression <@1:2>

areg_indirect ::= ) opt_postinc

opt_postinc ::= <@1:3> + <@1:4>

Reverse polish operators are discussed in later posts, but let us assume that <@1:1> is the handler for the pre-decrement mode, <@1:2> for the absolute mode, <@1:3> for the indirect mode, and <@1:4> for the post-increment mode. Now when the parser encounters a ‘(‘ token it will push on the “( Aregister areg_indirect” production. As it does so, it notices the presence of the symbol in the production being pushed and it saves it’s own state as well as that of the input lexical analyzer. Parsing continues and the ‘(‘ is accepted. Now let us assume that the input was actually an expression so when the parser tries to match the ‘ARegister‘ terminal that is now on the top of it’s parsing stack, it fails. Without the backup flag, this is considered a syntax error and the parser aborts; however because the parser has a saved state, it instead restores the parser and lexical analyzer state to that which existed at the time it first encountered the ‘(‘ symbol, it then causes the production that immediately follows the one containing the <bkup> flag to be selected in preference to the original. Since the lexical analyzer has also been backed up, the first token processed is once again ‘(‘ and parsing proceeds normally through “abs_or_displ” to “expression” and finally to invocation of plug-in <@1:2> as appropriate for the absolute mode. Note that a similar but slightly different sequence is caused by the flag in the first production for “abs_or_displ” and that in all cases, the plug-in that is appropriate to the addressing mode encountered will be invoked and no other.

Thus by this use of explicit ambiguity plus controlled parser backup, we have created a parser capable of recognizing languages from a set of grammars that is considerably larger than those normally associated with predictive parsing techniques. Indeed the set is sufficiently large that it can probably handle any practical computer programming language. By judicious use of the plug-in and resolver architectures (see later posts), this language set can be extended even further and can also include grammars that are not context-free (e.g. English) that cannot be handled by conventional predictive parsers. Once again, be warned if your language is LL(1), then the parser generator will automatically check correctness of the grammar. However once you start using the techniques above, it is up to you to ensure that your grammar works as you expect.

One more concept that must be introduced in order to understand how to build grammars for parsers is the concept of a FOLLOW set. For any non-terminal grammar symbol X, FOLLOW(X) is the set of terminal symbols that can appear immediately to the right of X in some sentential form. In other words, it is the set of things that may come immediately after that grammar symbol wherever it can occur in the source text. For example, by looking at the Calculator grammar above we can see that FOLLOW(expression) contains just three symbols, namely ‘)’, ‘;’, and ‘,’.

To see why, we look at everywhere expression occurs in the BNF. In the line “primary ::= … ( expression )” it is followed by ‘)’, in the lines (“parameter_list := … expression rstof_param_list” and “rstof_param_list ::= , expression rstof_param_list”) we see that expression may be followed by another rstof_param_list which begins with ‘,’ and finally since expression may end a statement, and statement is followed by ‘;’, we come up with FOLLOW(expression) = ‘)’ ‘;’ ‘,’.

To build a predictive parser table PS_MakeDB() must compute not only the FIRST set of all non-terminals (which determines what to PUSH onto the parsing stack), but also the FOLLOW sets (which determines when to POP the parsing stack and move to a higher level production). If the FOLLOW sets are not correct the parser will never pop it’s stack and eventually it will die. So for this reason, unlike for FIRST sets, ambiguity in the FOLLOW sets is not allowed. What this means is that for any situation in your grammar, the parser must be able to tell when it is done with a production by looking at the next token in the input stream (i.e., the first token of the next production). This is usually much easier to arrange than is unique FIRST sets. PS_MakeDB() will reject any grammar containing ambiguous FOLLOW sets.

This post discusses some very simple parser concepts, we will explore the Mitopia® parser abstraction further in later posts.