Context-Free Grammars and Languages

CS390, Spring 2024

Last modified: Jan 3, 2023
Contents:

Abstract

The context-free languages are a larger set of languages than the regular languages that we have been studying so far. Context-free languages are particularly important because most programming languages are context-free (or approximately so).

Context-free languages are described by grammars. In this module, we will examine grammars, how they can be used to generate the strings of a language, and how they can be used to parse a string to see if it is in the language.

1 Context-Free Grammars

A context-free grammar (CFG) is a way of describing a language drawn from a useful set of languages called the Context-free Languages (CFLs).

CFLs are useful for describing “nested” structures (e.g., expressions with properly balanced parentheses) that occur commonly in programming languages but are known to be not regular.

Definition: A context-free grammar $G = (V,T,P,S)$ is composed of

  • V : a set of variables (also known as non-terminals), each denoting a set of strings.

  • T : a set of terminal symbols (“terminals” for short) that constitutes the alphabet over which the strings in the language are composed.

  • P : a set of productions, rules that recursively define the structure of the language.

    A production has the form $A \rightarrow \alpha$ where

    • $A$ is a variable (one of the symbols in $V$).
    • $\alpha$ is a string of zero or more symbols that may be either terminals or variables.
  • S : a starting symbol. This is a variable that denotes the set of strings comprising the entire language.
Example 1: CFG for $\\{ 0^n1^n | n \geq 1\\}$

\[ V = \{ S \} \]

\[ T = \{0, 1\} \]

Productions (P):

\[ \begin{align} S \rightarrow & 01 \\ S \rightarrow & 0S1 \\ \end{align}
\]

Start symbol: $S$

(More often than not, we will only write out the productions. By convention, the upper-case letters in the productions are the variables, the remaining characters are the terminals, and the variable on the left-hand-side of the first production is the starting symbol.)

1.1 Deriving a String from a Grammar

We derive a string from a grammar by

  1. Begin with a string consisting only of the start symbol.
  2. Pick any occurrence of a variable in the current string.
  3. Pick any production that has that variable on the left of the $\rightarrow$.
  4. Replace the chosen occurrence of that variable by the right-hand side of the chosen production.
  5. Repeat steps 2-4 until the string contains only terminals.
Example 2: Deriving 000111

From the grammar of the previous Example,

String Choices
$S$ 1) Starting symbol
2) Choose a variable. has to be $S$
3) Choose a production. We’ll take $S \rightarrow 0S1$
$0S1$ 4) Replace $S$ by $0S1$
2) Choose a variable. has to be $S$
3) Choose a production. We’ll take $S \rightarrow 0S1$
$00S11$ 4) Replace $S$ by $0S1$
2) Choose a variable. has to be $S$
3) Choose a production. We’ll take $S \rightarrow 01$
$000111$ 4) Replace $S$ by $01$

One thing that becomes clear from this example is the recursive nature of even this simple grammar. The production $S \rightarrow 0S1$ expands $S$ into a longer string that still includes $S$. That is very similar to a recursive call in a programming languages or an induction step in a proof. The production $S \rightarrow 01$ plays the role of the base case in the recursion, allowing a description of $S$ that does not rely upon itself.

In a more compact form, we use $\Rightarrow$ to mean “derives in one step”.

 

Definition: We say that $\alpha A \beta \Rightarrow \alpha \gamma \beta$ ($\alpha A \beta$ derives $\alpha \gamma \beta$ in one step) if $A \rightarrow \gamma$ is a production.

We also introduce $\Rightarrow^*$ to mean "derives in 0 or more steps.

Example 3: Deriving 000111 Again

The $\Rightarrow$ operator allows us to express a derivation in a much more compact format:

\[ S \Rightarrow 0S1 \Rightarrow 00S11 \Rightarrow 000111 \]

Though this leaves it up to the reader to verify that each derivation step really does correspond to a production in grammar.

We can also that the above as a proof that $S \Rightarrow^* 000111$.

1.2 A More Complicated Example

Your text gave an example of a grammar for simple expressions:

\[ \begin{align} E &\rightarrow I \\ E &\rightarrow E + E \\ E &\rightarrow E * E \\ E &\rightarrow (E) \\ I &\rightarrow a \\ I &\rightarrow b \\ I &\rightarrow Ia \\ I &\rightarrow Ib \\ I &\rightarrow I0 \\ I &\rightarrow I1 \\ \end{align} \]

which describes expressions like ‘a1 + b1 $*$ (abab + baba0)’.

What are the terminals in this grammar?
What are the variables in this grammar and, what, intuitively, do they stand for?
What would the start symbol for this grammar be?

More formally, remember that I said that each variable represents a set of strings? We regard each variable as the name of the set of strings that can be derived from that variables.

So $I$ in this grammars stands for the strings $\{ a, b, a0, a1, b0, b1, aa, ab, aa0, aa1, ab0, ab1, \ldots\}$, the “identifiers” in this language.

$E$ stands for all of the strings that we would regard as expressions formed by combining one or more identifiers from $I$ using plus, multiplication, and parentheses.

All in all, this isn’t a terribly complicated grammar. But the derivations to produce even a simple expression can be pretty lengthy:

Example 4: Deriving 'a0 + ab $*$ (a + b1)'

\[ \begin{align} E &\Rightarrow E + E \Rightarrow E + E * E \Rightarrow E + E * (E) \\ &\Rightarrow I + E * (E) \Rightarrow I0 + E * (E) \Rightarrow a0 + E * (E) \\ &\Rightarrow a0 + I * (E) \Rightarrow a0 + Ib * (E) \Rightarrow a0 + ab * (E) \\ &\Rightarrow a0 + ab * (E + E) \Rightarrow a0 + ab * (I + E) \Rightarrow a0 + ab * (I + I) \\ &\Rightarrow a0 + ab * (a + I) \Rightarrow a0 + ab * (a + I1) \Rightarrow a0 + ab * (a + b1) \end{align} \]

It should be pretty obvious that we could expand that grammar to include missing common operators such as subtraction and division. Of course, real programming languages would also add the relational operators $>, <, \geq, \leq, =, \neq$ as well as boolean operators for and, or, and not, and maybe some others as well.

On the other hand, when discussing real programming languages, we seldom take things all the way down to the character level. There’s no rule that says that each symbol in our alphabet has to be written as a single glyph or blob of ink on the page. I could define a grammar like this

 

\[ \begin{align} \mbox{Expr} &\rightarrow \mbox{id} \\ \mbox{Expr} &\rightarrow \mbox{Expr} + \mbox{Expr} \\ \mbox{Expr} &\rightarrow \mbox{Expr} * \mbox{Expr} \\ \mbox{Expr} &\rightarrow (\mbox{Expr}) \\ \end{align} \]

In this case “Expr” and “id” each a single symbol in our grammar. They just happen to be symbols that we write using multiple unconnected blobs of ink.

Generally, we do this with the understanding that terminals like id above name a class of tokens, and that these tokens are recognized separately using a regular-expression-based scanner:

\[ \mbox{id} \equiv (a | b)(a | b | 0 | 1)^* \]

or, in a “real” programming language and a Unix-style regular expression language:

\[ \mbox{id} \equiv [\_A-Za-z]\,[\_A-Za-z0-9]* \]

1.3 Leftmost and Rightmost Derivations

The definition/procedure given above for deriving a string from a grammar gives us a lot of flexibility in choosing which variable to expand and which production to apply. Sometimes we will want to discuss a more restricted form:

We do a leftmost derivation of a string from a grammar by

  1. Begin with a string consisting only of the start symbol.
  2. Pick the leftmost variable occurrence in the current string.
  3. Pick any production that has that variable on the left of the $\rightarrow$.
  4. Replace the chosen occurrence of that variable by the right-hand side of the chosen production.
  5. Repeat steps 2-4 until the string contains only terminals.

We do a rightmost derivation of a string from a grammar by

  1. Begin with a string consisting only of the start symbol.
  2. Pick the rightmost variable occurrence in the current string.
  3. Pick any production that has that variable on the left of the $\rightarrow$.
  4. Replace the chosen occurrence of that variable by the right-hand side of the chosen production.
  5. Repeat steps 2-4 until the string contains only terminals.
Example 5: Leftmost derivation of 'a0 + ab $*$ (a + b1)'

For the grammar with productions

\[ \begin{align} E &\rightarrow I \\ E &\rightarrow E + E \\ E &\rightarrow E * E \\ E &\rightarrow (E) \\ I &\rightarrow a \\ I &\rightarrow b \\ I &\rightarrow Ia \\ I &\rightarrow Ib \\ I &\rightarrow I0 \\ I &\rightarrow I1 \\ \end{align} \]

we previously obtained this derivation

\[ \begin{align} {\color{red} E} &\Rightarrow E + {\color{red} E} \Rightarrow E + E * {\color{red} E} \Rightarrow {\color{red} E} + E * (E) \\ &\Rightarrow {\color{red} I} + E * (E) \Rightarrow {\color{red} I}0 + E * (E) \Rightarrow a0 + {\color{red} E} * (E) \\ &\Rightarrow a0 + {\color{red} I} * (E) \Rightarrow a0 + {\color{red} I}b * (E) \Rightarrow a0 + ab * ({\color{red} E}) \\ &\Rightarrow a0 + ab * ({\color{red} E} + E) \Rightarrow a0 + ab * (I + {\color{red} E}) \Rightarrow a0 + ab * ({\color{red} I} + I) \\ &\Rightarrow a0 + ab * (a + {\color{red} I}) \Rightarrow a0 + ab * (a + {\color{red} I}1) \Rightarrow a0 + ab * (a + b1) \end{align} \]

I have used color to highlight the chosen variable being replaced at each step.

Now, suppose that we wished to perform a left-most derivation.

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \Rightarrow {\color{red} I} + E \Rightarrow {\color{red} I}0 + E \\ &\Rightarrow a0 + {\color{red} E} \Rightarrow a0 + {\color{red} E} * E \Rightarrow a0 + {\color{red} I} * E \\ &\Rightarrow a0 + {\color{red} I}b * E \Rightarrow a0 + ab * {\color{red} E} \Rightarrow a0 + ab * ({\color{red} E}) \\ &\Rightarrow a0 + ab * ({\color{red} E} + E) \Rightarrow a0 + ab * ({\color{red} I} + E) \Rightarrow a0 + ab * (a + {\color{red} E}) \\ &\Rightarrow a0 + ab * (a + {\color{red} I}) \Rightarrow a0 + ab * (a + {\color{red} I}1) \Rightarrow a0 + ab * (a + b1) \end{align} \]

Example 6: Rightmost derivation of 'a0 + ab * (a + b1)'

For the same grammar,

\[ \begin{align} {\color{red} E} &\Rightarrow E + {\color{red} E} \Rightarrow E + E * {\color{red} E} \Rightarrow E + E * ({\color{red} E}) \\ &\Rightarrow E + E * (E + {\color{red} E}) \Rightarrow E + E * (E + {\color{red} I}) \Rightarrow E + E * (E + {\color{red} I}1) \\ &\Rightarrow E + E * ({\color{red} E} + b1) \Rightarrow E + E * ({\color{red} I} + b1) \Rightarrow E + {\color{red} E} * (a + b1) \\ &\Rightarrow E + {\color{red} I} * (a + b1) \Rightarrow E + {\color{red} I}b * (a + b1) \Rightarrow {\color{red} E} + ab * (a + b1) \\ &\Rightarrow {\color{red} I} + ab * (a + b1) \Rightarrow {\color{red} I}0 + ab * (a + b1) \Rightarrow a0 + ab * (a + b1) \end{align} \]

2 Parse Trees

Parse trees are a way of recording derivations that focus less upon the order in which derivation steps were applied and more on which productions were employed.

A parse tree for a string $s$ in a languages described by a CFG is a tree in which

  1. The root of the tree is labeled with the grammar’s start symbol.
  2. Each internal node is labeled with a variable.
  3. Each leaf node is labeled with a terminal.
  4. The leaf nodes, read from left to right, provide the string $s$.
  5. Each internal node labeled with a variable $A$ has children labeled with the symbols of $\alpha$ (reading left to right through the children) only if $A \rightarrow \alpha$ is a production in the grammar.
Example 7: A Parse Tree for 'a0 + ab * (a + b1)'

We can build a parse tree by tracing one of our derivations:

 

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \\ \end{align} \]

 

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \Rightarrow {\color{red} I} + E \end{align} \]

 

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \Rightarrow {\color{red} I} + E \Rightarrow {\color{red} I}0 + E \\ \end{align} \]

 

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \Rightarrow {\color{red} I} + E \Rightarrow {\color{red} I}0 + E \\ &\Rightarrow a0 + {\color{red} E} \end{align} \]

 

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \Rightarrow {\color{red} I} + E \Rightarrow {\color{red} I}0 + E \\ &\Rightarrow a0 + {\color{red} E} \Rightarrow a0 + {\color{red} E} * E \end{align} \]

Eventually, by the end of the derivation

\[ \begin{align} {\color{red} E} &\Rightarrow {\color{red} E} + E \Rightarrow {\color{red} I} + E \Rightarrow {\color{red} I}0 + E \\ &\Rightarrow a0 + {\color{red} E} \Rightarrow a0 + {\color{red} E} * E \Rightarrow a0 + {\color{red} I} * E \\ &\Rightarrow a0 + {\color{red} I}b * E \Rightarrow a0 + ab * {\color{red} E} \Rightarrow a0 + ab * ({\color{red} E}) \\ &\Rightarrow a0 + ab * ({\color{red} E} + E) \Rightarrow a0 + ab * ({\color{red} I} + E) \Rightarrow a0 + ab * (a + {\color{red} E}) \\ &\Rightarrow a0 + ab * (a + {\color{red} I}) \Rightarrow a0 + ab * (a + {\color{red} I}1) \Rightarrow a0 + ab * (a + b1) \end{align} \]

we will have

 

Some points to observe about this:

  • Verify for yourself that the leaves, read left to right, do indeed spell out “a0 + ab $*$ (a + b1)”.
  • The tree makes it easy to see which production was applied to replace each variable, something that could be a little hard to tell by looking at the derivation.
  • To the degree that the grammar expresses the “logical structure” of the language, the parse tree makes it easy to see how that structure maps onto the string.
  • If we just look at the tree, we can’t tell the order in which the derivation steps were applied. We can’t tell for example, if, at the 2nd level “E + E” string, we chose to expand the left “E” before or after the right “E”.
    • In fact, all three of our derivations for this string would yield the same parse tree!

2.1 Ambiguity

In the example we just looked at, all three derivations we had previously obtained would have yielded the same parse tree.

That’s not always the case.

Let’s return to out simpler expression grammar:

\[ \begin{align} \mbox{Expr} &\rightarrow \mbox{id} \\ \mbox{Expr} &\rightarrow \mbox{Expr} + \mbox{Expr} \\ \mbox{Expr} &\rightarrow \mbox{Expr} * \mbox{Expr} \\ \mbox{Expr} &\rightarrow (\mbox{Expr}) \\ \end{align} \]

and consider the string “a + b * c”, or, more precisely, “id + id * id”.

 

Here is one parse tree.

Read the leaves and you can see that it matches our string. Inspect the internal nodes and you can see that each has children that correspond to a production in our grammar.

 

Here is another parse tree.

Read the leaves and you can see that it also matches our string. Inspect the internal nodes and you can see that each has children that correspond to a production in our grammar.

Both of these are valid parse trees for this string in this grammar.

We say that a string is ambiguous with respect to a grammar G if it has two distinct parse trees derivable from that grammar.

We say that a grammar is ambiguous if these is any ambiguous string in its language.

Is that a problem? Well, if we believe that the grammar is supposed to reflect the “logical structure” of our language, then this could indeed be a problem.

Our first parse tree expresses the idea that “a is added to the product of b and c”. The second tree expresses the idea that “the sum of a and b will be multiplied by c”. These are two very different and incompatible views of that calculation.

If we were writing a compiler, we would “attach” actions to each production. These actions would generate code reflecting what we thought we had discovered about the structure of the underlying computation:

production action
E -> id E.loc = id.loc;
E -> E1 + E2 E.loc = new-temp-variable; Generate(“add E.loc E1.loc E2.loc”)
E -> E1 * E2 E.loc = new-temp-variable; Generate(“mult E.loc E1.loc E2.loc”)

The code generated from the second parse tree would be incorrect for our usual interpretation of “a + b * c”.

2.1.1 Fixing Ambiguity

Comparing these parse trees to our usual ideas of algebraic expressions, we can see that the 2nd parse tree violates our conventional idea that “multiplication takes precedence over addition”. “Precedence” is an informal idea that introduces a kind of layering among different operators. We can actually enforce precedence by introducing different “layers” of expressions and subexpressions:

\[ \begin{align} \mbox{Expr} &\rightarrow \mbox{Expr} + \mbox{Expr} \\ \mbox{Expr} &\rightarrow \mbox{Expr} - \mbox{Expr} \\ \mbox{Expr} &\rightarrow \mbox{Term} \\ \mbox{Term} &\rightarrow \mbox{Term} * \mbox{Term} \\ \mbox{Term} &\rightarrow \mbox{Term} / \mbox{Term} \\ \mbox{Term} &\rightarrow (\mbox{Expr}) \\ \mbox{Term} &\rightarrow \mbox{id} \\ \end{align} \]

I’ve expanded this language to add subtraction and division as well, to illustrate that addition and subtraction are considered to be the same level of precedence, while multiplication and division are together at another level of precedence.

If we wanted to introduce still more operators, we might need more layers. For example, the relational operators in most programming languages are at lower precedence than both addition and multiplication. Languages that have an exponentiation (“to the power of”) operator usually have yet another level of precedence.

So, what does this layered approach do for us?

Let’s look at ‘a + b * c’ in this new grammar.

 

Here is a parse tree in the new grammar.

Again, take the time to read the leaves and see for yourself that it matches our string. Inspect the internal nodes and you can see that each has children that correspond to a production in our grammar.

 

Now, suppose that we deliberately try to “do it wrong” and express this string as a product of a sum with c. We can get a start on that idea as shown here.

But now we’re stuck. There’s no way to expand the Term on the left to get the string “a + b”. The only way to get a “+” is from an Expr, and the only way to get from a Term to an Expr is via the production

\[ \begin{align} \mbox{Term} &\rightarrow (\mbox{Expr}) \\ \end{align} \]

which adds parentheses around the Expr that aren’t actually present in our string.

So we can’t actually violate precedence with this grammar even if we wanted to!

 

Does that mean we have reached a non-ambiguous grammar? Unfortunately, no. Consider the string “a - b - c”.

Here is one parse tree for that string.

 

And here is another.

The previous tree suggested that the calculation a - b - c is structured as a - (b - c). This tree, on the other hand, suggests that it is structured as (a - b) - c. These two interpretations lead to very different results.

The first tree in this example violates our common interpretation of “associativity”. We generally say that addition, subtraction, multiplication, and division “associate to the left” or “are left-associative”. On the other hand, “raise to the power of” operators, in languages that have them, are usually right-associative.

Associativity rules imply that, even though we often think of these binary operators as being “symmetric” in their treatment of their operands, they really are not. And we can embed this associativity into our grammar by breaking the symmetry of production right-hand-sides like Expr + Expr and Term / Term:

\[ \begin{align} \mbox{Expr} &\rightarrow \mbox{Expr} + \mbox{Term} \\ \mbox{Expr} &\rightarrow \mbox{Expr} - \mbox{Term} \\ \mbox{Expr} &\rightarrow \mbox{Term} \\ \mbox{Term} &\rightarrow \mbox{Term} * \mbox{Factor} \\ \mbox{Term} &\rightarrow \mbox{Term} / \mbox{Factor} \\ \mbox{Factor} &\rightarrow (\mbox{Expr}) \\ \mbox{Factor} &\rightarrow \mbox{id} \\ \end{align} \]

 

Here is a parse tree for “a - b - c” in this new grammar.

Again, take the time to verify for yourself that this is a valid parse tree for the new grammar.

 

Now, again, suppose that we tried to deliberately build a parse tree that would violate associativity. We can start out like this, with the intention that the Term on the right would expand into “b -c”. But that won’t be possible, because we can’t get another subtraction operator without deriving an Expr, and the only way to derive an Expr from a Term is via the production that introduces parentheses.

2.2 The Bad News about Ambiguity

It’s not always easy to remove ambiguity from a CFG.

In fact, it’s not always possible to remove ambiguity from a CFG. Some CFL’s are inherently ambiguous. They have only ambiguous grammars.

We’ll later see that there are no algorithms to even determine if an arbitrary CFG is ambiguous.

2.3 The Good News about Ambiguity

The common forms of ambiguity (precedence & associativity restrictions) that occur when processing expressions and other common programming language structures are recognized and can be handled via the parser generators like YACC and bison that are typically used by compiler writers and others developing programming language tools. It’s fairly rare to engage in introducing asymmetry and layering into a grammar solely for the purpose of dealing with ambiguity.