Brief Notes on Parsing



Parsing is a process to determine how a string might be derived using productions (rewrite rules) of a given grammar. It can be used to check whether or not a string belongs to a given language. When a statement written in a programming language is input, it is parsed by a compiler to check whether or not it is syntactically correct and to extract components if it is correct. Finding an efficient parser is a nontrivial problem and a great deal of research has been conducted on parser design.

In this note basic parsing techniques are introduced with examples and some of the problems involved in parsing are discussed together with brief explanations of some of the solutions to those problems

Two basic approaches to parsing are top-down parsing and bottom-up parsing. In the top-down approach, a parser tries to derive the given string from the start symbol by rewriting nonterminals one by one using productions. The nonterminal on the left hand side of a production is replaced by it right hand side in the string being parsed. In the bottom-up approach, a parser tries to reduce the given string to the start symbol step by step using productions. The right hand side of a production found in the string being parsed is replaced by its left hand side.

Let us see how string aababaa might be parsed by these two approaches for the following grammar as an example:

$S \rightarrow aSa \mid bSb \mid a \mid b$

This grammar generates the palindromes of odd lengths.

Top-down approach proceeds as follows:
(1) The start symbol $S$ is pushed into the stack without reading any input symbol.
(2) $S$ is popped and $aSa$ is pushed without reading any input symbol.
(3) As the first a in the input is read, a at the top of the stack is popped.
(4) $S$ is popped and $aSa$ is pushed without reading any input symbol.
(5) As the second a in the input is read, a at the top of the stack is popped.
(6) $S$ is popped and $bSb$ is pushed without reading any input symbol.
(7) As the first b in the input is read, b at the top of the stack is popped.
(8) $S$ is popped and $a$ is pushed without reading any input symbol.
(9) As the unread input symbols are read, abaa in the stack is popped one by one.
Since the stack is empty when the entire input string is read, the string is found to be in the language.

If we use configuration without state, that is, (unread portion of input, stack contents), this top-down parsing can be expressed as follows:

$(aababaa, Z_{0}) \vdash (aababaa, SZ_{0}) \vdash (aababaa, aSaZ_{0})
\vdash (ababaa, SaZ_{0}) $
$(ababaa, aSaaZ_{0}) \vdash (babaa, SaaZ_{0}) \vdash (babaa, bSbaaZ_{0}) \vdash
(abaa, SbaaZ_{0}) $
$(abaa, abaaZ_{0}) \vdash (baa, baaZ_{0}) \vdash (aa,aaZ_{0}) \vdash (a,aZ_{0}) \vdash (\Lambda, Z_{0}) $

In general a PDA for top-down parsing has the following four types of transitions: (a) For each production, pop the nonterminal on the left hand side of the production at the top of the stack and push its right hand side string;
(b) Pop the stack if the top of the stack matches the input symbol being read;
(c) Initially push the start symbol into the stack;
(d) Go to the final state if the entire input has been read and the stack is empty.

Bottom-up approach proceeds as follows:
(1) The string aababaa is read into the stack one by one from left until the middle a is reached.
(2) The middle a is replaced by $S$ at the top of the stack without reading any input symbol.
(3) The second b is read and pushed into the stack.
(4) $bSb$ at the top of the stack is replaced by $S$.
(5) The fourth a is read and pushed into the stack.
(6) $aSa$ at the top of the stack is replaced by $S$.
(7) The last a is read and pushed into the stack.
(8) $aSa$ at the top of the stack is replaced by $S$.
Since the stack has $S$ when the entire input string is read, the string is found to be in the language.

If we use configuration without state, this bottom-up parsing can be expressed as follows:

$(aababaa, Z_{0}) \vdash (ababaa, aZ_{0}) \vdash (babaa, aaZ_{0})
\vdash (abaa, baaZ_{0}) $
$(baa, abaaZ_{0}) \vdash (baa, SbaaZ_{0}) \vdash (aa, bSbaaZ_{0}) \vdash
(aa, SaaZ_{0}) $
$(a, aSaaZ_{0}) \vdash (a, SaZ_{0}) \vdash (\Lambda, aSaZ_{0}) \vdash (\Lambda, SZ_{0}) $

Note that the rightmost symbol on the right hand side of a production appears at the top of the stack.

In general a PDA for bottom-up parsing has the following four types of transitions: (a) Push input symbol being read into the stack -- this is called shift;
(b) Replace the right hand side of a production at the top of the stack with its left hand side -- this is called reduce;
(c) Pop the stack if the top of the stack matches the input symbol being read;
(d) If the entire input has been read and only the start symbols is in the stack, then pop the stack and go to the final state.

The structure of a derivation of a string can be represented by a tree called parse tree or a derivation tree. A parse tree has the start symbol at its root. Its internal nodes correspond to the nonterminals that appear in the derivation. The children of a node are the symbols appearing on the right hand side of the production used to rewrite the nonterminal corresponding to the node in the derivation. For example the following figure shows the parse tree of the string aababaa of the above example.

                           

The top-down parsing traverses this tree from the root down to the leaves, while the bottom-up parsing goes from the leaves up to the root.

Example 2 for top-down and bottom-up parsing:

Given the grammar $S \Rightarrow S + X \mid X$; $X \Rightarrow X * Y \mid Y$; $Y \Rightarrow (S) \mid id$, let us parse the expression a + b*c.

Top-down parsing:
$(a + b*c, Z_{0}) \vdash (a + b*c, SZ_{0}) \vdash (a + b*c, S + XZ_{0}) \vdash (a + b*c, X + XZ_{0}) \vdash $
$(a + b*c, Y + XZ_{0}) \vdash (a+b*c, a+XZ_{0}) \vdash (+b*c, +XZ_{0}) \vdash (b*c, XZ_{0}) \vdash $
$(b*c, X*YZ_{0}) \vdash (b*c, b*YZ_{0}) \vdash (*c, *YZ_{0}) \vdash (c, YZ_{0}) \vdash $
$(c, cZ_{0}) \vdash (\Lambda, Z_{0})$

Bottom-up parsing: $(a + b*c, Z_{0}) \vdash (+ b*c, aZ_{0}) \vdash (+ b*c, YZ_{0}) \vdash (+ b*c, XZ_{0}) \vdash $
$( + b*c, SZ_{0}) \vdash (b*c, +SZ_{0}) \vdash (*c, b+SZ_{0}) \vdash (*c, Y+SZ_{0}) \vdash $
$( *c, X+SZ_{0}) \vdash (c, *X+SZ_{0}) \vdash (\Lambda, c*X+SZ_{0}) \vdash (\Lambda, Y*X+SZ_{0}) \vdash $
$(\Lambda, X +SZ_{0}) \vdash (\Lambda, SZ_{0})$

Note again that the rightmost symbol on the right hand side of a production appears at the top of the stack.

Difficulties in parsing

The main difficulty in parsing is nondeterminism. That is, at some point in the derivation of a string more than one productions are applicable, though not all of them lead to the desired string, and one can not tell which one to use until after the entire string is generated. For example in the parsing of aababaa discussed above, when $S$ is at the top of the stack and a is read in the top-down parsing, there are two applicable productions, namely $S \rightarrow aSa$ and $S \rightarrow a$. However, it is not possible to decide which one to choose with the information of the input symbol being read and the top of the stack. Similarly for the bottom-up parsing, it is impossible to tell when to apply the production $S \rightarrow a$ with the same information as for the top-down parsing. Some of these nondeterminisms are due to the particular grammar being used and they can be removed by transforming grammars to other equivalent grammars while others are the nature of the language the string belongs to. Below several of the difficulties are briefly discussed.

Factoring:
Consider the following grammar:

$S \rightarrow T$; $T \rightarrow aTb \mid abT \mid ab$.

With this grammar when string aababaa is parsed top-down, after $S$ is replaced by $T$ in the first step, there is no easy way of telling which production to use to rewrite $T$ next. However, if we change this to the following grammar which is equivalent to this grammar, this nondeterminism disappears:

$S \rightarrow aU$; $U \rightarrow Sb \mid bT$; $T \rightarrow S \mid \Lambda$.

This transformation operation is called factoring as a on the right hand side of productions for $T$ in the original grammar are factored out as see n in the new grammar.

Left-recursion:
Consider the following grammar:

$S \rightarrow Sa \mid Sb \mid a$.

When a string, say aaba, is parsed top-down for this grammar, after $S$ is pushed into the stack, it needs to be replaced by the right hand side of some production. However, there is no simple way of telling which production to use and a parser may go into infinite loop especially if it is given an illegal string (a string which is not in the language). This kind of grammar is called left-recursive. Left-recursions can be removed by replacing left-recursive pairs of productions with new pairs of productions as follows:

If $X \rightarrow X\alpha_{1} \mid X\alpha_{2} \mid \beta_{1} \mid \beta_{2}$ are left-recursive productions, where $\beta$'s don't start with $X$, then replace them with $X \rightarrow \beta_{1}X' \mid \beta_{2}X'$ and $X' \rightarrow \alpha_{1}X' \mid \alpha_{2}X' \mid \Lambda$.

For example the left-recursive grammar given above can be transformed to the following non-recursive grammar:

$S \rightarrow aS'$; $S' \rightarrow aS' \mid bS' \mid \Lambda$;

Ambiguous grammar :
A context-free grammar is called ambiguous if there is at least one string that has more than one distinct derivations (or, equivalently, parse trees). For example, the grammar $S \rightarrow S + S \mid S*S \mid (S) \mid id $, where $id$ represents an identifier, produces the following two derivations for the expression $x + y*z$:

$S \Rightarrow S + S \Rightarrow id + S \Rightarrow id + S * S \Rightarrow id + id * S
\Rightarrow id + id * id$, which corresponds to $x + (y*z)$ and
$S \Rightarrow S * S \Rightarrow S+ S * S \Rightarrow id + S * S \Rightarrow id + id * S
\Rightarrow id + id * id$, which corresponds to $(x + y)*z$.

Though some context-free languages are inherently ambiguous and no unambiguous grammars can be constructed for them, it is often possible to construct unambiguous context-free grammars for unambiguous context-free languages. For example, for the language of algebraic expressions given above, the following grammar is unambiguous:

$S \Rightarrow S + X \mid X$; $X \Rightarrow X * Y \mid Y$; $Y \Rightarrow (S) \mid id$.

Nondeterministic language :
Lastly there are context-free languages that can not be parsed by a deterministic PDA. This kind of languages need nondeterministic PDAs. Hence guess work is necessary in selecting the right production at certain steps of their derivation. For example take the language of palindromes. When parsing strings for this language, the middle of a given string must be identified. But it can be shown that no deterministic PDA can do that.