Properties of Context-Free Languages

In this section we focus on the important properties of the languages themselves: * closure properties, which say how we can build new context-free languages from combinations of existing ones, and * decision properties, which allow us to decide ask questions about context-free languages

1 Normal Forms for CFLs

The notation for CFGs allows us a lot of flexibility, which we can exploit to improve the “quality” of our grammars:

Expressiveness and readability
Avoiding or removing ambiguity
Compatibility with various parsing algorithms.
- Some common algorithms, for example, have problems with “left-recursive” productions of the form
  
  \[ A \rightarrow A \alpha \]
  
  which can be quite common, for example
  
  \[ \begin{align} \mbox{Expr} &\rightarrow \mbox{Expr} + \mbox{Term} \\ \mbox{Expr} &\rightarrow \mbox{Expr} - \mbox{Term} \\ \mbox{Expr} &\rightarrow \mbox{Term} \\ \mbox{Term} &\rightarrow \mbox{Term} * \mbox{Factor} \\ \mbox{Term} &\rightarrow \mbox{Term} / \mbox{Factor} \\ \mbox{Factor} &\rightarrow (\mbox{Expr}) \\ \mbox{Factor} &\rightarrow \mbox{id} \\ \end{align} \]

But all that flexibility can complicate proving properties of CFLs, so we will work, in this section, toward a restricted syntax for CFGs.

1.1 Eliminating Useless Variables

A variable X in a grammar G is useful if there is a derivation in G: \[ S \Rightarrow^* \alpha X \beta \Rightarrow^* w \]

This breaks down to two conditions:

Can we reach $X$ from the starting symbol?
Can we use $X$ to generate at least one (possibly empty) string?

1.1.1 Eliminating Unreachable Symbols

We can eliminate useless variables by applying a simple recursion to mark the useful ones:

Basis: The starting symbol S is reachable.
Induction: If variable A has been marked as reachable, examine all productions of the form
\[ A \rightarrow \alpha \]

Any variables appearing in $\alpha$ are reachable.

Apply step 2 until no more variables can be marked as reachable. Then any variables no marked as reachable are unreachable.

If a variable is unreachable, we can eliminate the variable and all productions containing that variable on the left or right of the $\rightarrow$.

1.1.2 Eliminating Non-generating Variables

Any grammar that generates a non-empty language will have at least one production of the form $A \rightarrow w$, where $w$ is a possibly empty string of terminals. If that weren’t the case, every step in a derivation would introduce a new variable, and we could never actually derive any strings that did not include variables.

With that in mind, we can detect which variables generate terminal strings and which do not:

Basis: All variables appearing on the left-hand side of a production of the form $A \rightarrow w$ are generating.
Induction: If variable $B \rightarrow \alpha$ and all variables appearing in $\alpha$ have been marked as generating, then $B$ is generating.

Apply step 2 until no more variables can be marked as generating.

Any variables that are not generating can be dropped form a grammar, along with any productions that mention a non-generating variable on the left or right of the $\rightarrow$.

1.2 Eliminating $\epsilon$-Productions

Suppose that a language $L$ includes the empty string. Then we could designate that in a grammar $G$ by a specific production

\[ S \rightarrow \epsilon \]

We might then seek to eliminate any other $\epsilon$ productions in the grammar, so that the language of the modified grammar, without that special production, would be $L(G) - \{\epsilon\}$.

A variable is nullable is it can derive the empty string.

Basis: All variables appearing on the left-hand side of a production of the form $A \rightarrow \epsilon$ are nullable.
Induction: If variable $B \rightarrow \alpha$ and all symbols appearing in $\alpha$ are variables that have been marked as nullable, then $B$ is nullable.

Apply step 2 until no more variables can be marked as nullable.

Eliminating $\epsilon$-productions:

Suppose we have a production $A \rightarrow X_1 X_2 X_3$ where $X_1$ and $X_2$ are nullable. That doesn’t mean that $X_1$ and $X_2$ always derive the empty, only that sometime they can do so.

Because we know that $X_1 \Rightarrow^* \epsilon$ and $X_2 \Rightarrow^* \epsilon$, we could add these productions to the grammar without changing the language being accepted:

\[ \begin{align} A &\rightarrow X_2 X_3 \\ A &\rightarrow X_1 X_3 \\ A &\rightarrow X_3 \\ \end{align} \]

The first production is valid because, sometimes, $X_1$ will derive the empty string. The second production works because, sometimes $X_2$ will derive the empty string. The third production works because, sometimes both of them will derive the empty string.

With that idea, we can eliminate $\epsilon$-productions by:

Replace every production $A \rightarrow X_1\ldots X_n$ where at least one of the $X_i$ is nullable by a series of productions formed by removing all distinct combinations of the nullable symbols.
Remove all productions of the form $A \rightarrow \epsilon$.

1.3 Eliminating Unit Productions

A unit production is a production of the form $A \rightarrow B$.

If we wished to eliminate, we might consider first simple text substitution. Suppose, for example, we replace all occurrences of $A$ in the productions by $B$? After all, we can see that $A \Rightarrow B$.

Unfortunately, this can change he language being accepted.

For example,

\[ \begin{align} S &\rightarrow ABB \\ A &\rightarrow 0 \\ A &\rightarrow B \\ B &\rightarrow 1 \\ \end{align} \]

Not the most interesting grammar, as it only derives the strings (“011” and “111”). But if we replaced $A$ by $B$

\[ \begin{align} S &\rightarrow BBB \\ B &\rightarrow 0 \\ B &\rightarrow B \\ B &\rightarrow 1 \\ \end{align} \]

(and then dropped the useless $B \rightarrow B$, the resulting language derives a number of strings (e.g. “000”) that were not in the original language. You should convince yourself that simple variants on this rule (e.g., replace $A$ by $B$ only on the right-hand side of productions, replace $B$ by $A$, etc.) are also invalid.

So we need a more sophisticated approach. If $A \rightarrow B$, then for each non-unit production $B \rightarrow \alpha$, we add a new production $A \rightarrow \alpha$.

This works unless we have a “cycle” of unit productions (e.g., $A \rightarrow B$, $B \rightarrow C$, $C \rightarrow A$). Your text explains how to handle those.

1.4 The Chomsky Normal Form

First, let’s summarize what we have accomplished so far:

Theorem 7.14 If $G$ is a CFG generating $L(G)$ that contains at least one non-empty string, then there exists a grammar $G_1$ such that

$L(G_1) = L(G) - \{\epsilon\}$

$G_1$ has no $\epsilon$-productions, unit productions, or useless symbols.

This leads us to the Chomsky Normal Form (CNF)

For every non-empty CFL, there is a grammar $G$ for $L(G) - \{\epsilon\}$ in which every production has the form

\[A \rightarrow B C \]

or

\[A \rightarrow a \]

and, furthermore, $G$ has no useless symbols.

The conversion to Chomsky normal form starts by applying all of the transformations we have already looked at: eliminating useless symbols, $\epsilon$-productions, and unit productions. That would leave us with some productions that are already in the desired format for Chomsky Normal Form, and some productions that are too long, in the form

\[ A \rightarrow X_1 X_2 \alpha \]

But we can break that production up by introducing a new variable:

\[ \begin{align} A &\rightarrow X_1 C \\ C &\rightarrow X_2 \alpha \\ \end{align} \]

The right-hand side of the “C” production is shorter than the original right-hand side of the “A” production. We repeat this step as necessary until all productions have been broken down to length 1 or 2.

Now, some of the remaining productions might be in the form $A \rightarrow B a$ or $A \rightarrow a B$, which is not quite Chomsky Normal Form. But we can take care of that by introducing still more variables. For example

\[ A \rightarrow a B \]

is replaced by

\[ \begin{align} A &\rightarrow C B \\ C &\rightarrow a \\ \end{align} \]

which is perfectly Chomsky-ish.

1.4.1 Why?

Now, let’s be clear. We don’t convert perfectly good grammars to Chomsky Normal Form because the grammar is more useful in that form. We use grammars in Chomsky Normal Form to prove properties of CFLs, knowing that anything we prove about the languages of grammars in that form will apply to all CFLs.

Why is Chomsky Normal Form useful in proofs? Think about the parse trees that it generates. A parse tree has variables in its internal nodes, with each variable node being the parent of the symbols in a production right-hand-side for that variable.

If the grammar is in Chomsky normal form, each production “parent” expands to either two variable children or a single non-terminal child.

All parse trees for a Chomsky Normal Form grammar will be binary trees.

2 The Pumping Lemma for Context-Free Languages

Let $L$ be a CFL. Then there exists a constant $n$ such that if $z$ is a string in $L$ with $|z| \geq n$, then we can divide $z$ into parts $uvwxy$, such that

$|vwx| \leq n$ (the middle is not “too long”)

$vx \neq \epsilon$ (either $v$ or $x$ or both are non-empty)

$\forall i \geq 0, uv^iwx^iy \in L$ (We can “pump” or repeat the $v$ and $x$ parts aribreailty without leaving the language L.

Example 1: Pumping $0^n1^n$

This is pretty easy. Let $u = \epsilon$, $v = 0$, $w = \epsilon$, $x = 1$, $y = \epsilon$.

Then pumping gives us $\epsilon 0^i \epsilon 1^i \epsilon = 0^i1^i$

Example 2: Pumping $ww^R$

Here the value of $n$ is critical. We want $n=3$, so our “middle part” $vwx$ will have to be of size 1 or 2.

Pick a string in the language, e.g., “01011010”.

Let $v$ be the character immediately to the left of the middle (e.g. 1)

Let $x$ be the character immediately to the right of the middle (e.g. 1)

That means the $w = \epsilon$.

Let $u$ be all of the characters before the one we chose for $v$ (e.g., “010”)

Let $y$ be all of the characters after the one we chose for $x$, e.g. “010”).

Then we pump $uv^iwx^iy$. In this example:

i=0, “010010” is in $ww^R$

i=1, “01011010”, our original string, is in $ww^R$

i=2, “0101111010”, is in $ww^R$

i=3, “010111111010”, is in $ww^R$

And so on. Each higher value of $i$ adds a pair of addition 1’s right in the middle, which preserves the mirror imaging we want for this langauge.

3 Decision Properties

Now we consider some important questions for which algorithms exist to answer the question/

3.1 Is a given string in a CFL?

Given a string $w$, is it in a given CFL?

If we are given the CFL as a PDA, we can answer this by converting the PDA to a grammar.

Given the language as a grammar (either originally or after converting a PDA), we can convert the grammar to Chomsky Normal Form and parse the string to find a derivation for it.

It may not be obvious that our procedures for finding derivations will always terminate. We have seen that, when finding a derivation, we have choices as to which variable to replace and which production to use in the replacement. However,

We have previously noted that if a string is in the language, then it will have a leftmost derivation. So we can systematically always choose to replace the leftmost variable.
That leaves the choice of production. We can systematically try all available choices, in a form of backtracking.

This terminates because we only have a finite number of productions to choose among, and one of the key properties of Chomsky Normal Forms is that each derivation step either increases the total number of symbols by one or leaves the total number of symbols unchanged while replacing a variable by a terminal. So any time the total number of symbols in a derived string exceeds the length of the string we are trying to derive, we know we have chosen an incorrect production somewhere along the way and can backtrack and try a different one.

Actually, we can do better than that. The CYK algorithm can parse a string $w$ for a Chomsky Normal Form grammar in $O(|w|^3)$ time.

3.2 Is a CFL empty?

We’ve already seen how to detect whether a variable generates terminal strings.

We apply that test and determine if the grammar’s start symbol generates any terminal strings.

3.3 These are not decision properties for CFLs

No algorithm exists to determine if

Two CFLs are the same.
- Note that we were able to determine this for regular languages.
Two CFLs are disjoint (have no strings in common).

4 Closure Properties

4.1 Substitution

Given a CFG $G$, if we replace each terminal symbol by a set of strings that is itself a CFL, the result is still a CFL.

Example 3: Expressions over Mirrors

Here is a grammar for simple expressions:

\[ \begin{align} E &\rightarrow I \\ E &\rightarrow E + E \\ E &\rightarrow E * E \\ E &\rightarrow (E) \\ I &\rightarrow a \\ I &\rightarrow b \\ \end{align} \]

Here is our grammar for $ww^R$:

\[ \begin{align} S &\rightarrow 0 S 0 \\ S &\rightarrow 1 S 1 \\ S &\rightarrow \epsilon \\ \end{align} \]

Now, suppose that I wanted an expression language in which, instead of the variable name “a”, I could use any mirror-imaged string of 0’s and 1’s.

All I really have to do is to change ‘a’ in the first grammar to a variable, make that variable the new starting symbol for the $ww^R$ grammar, then combine the two grammars:

\[ \begin{align} E &\rightarrow I \\ E &\rightarrow E + E \\ E &\rightarrow E * E \\ E &\rightarrow (E) \\ I &\rightarrow a \\ I &\rightarrow S \\ S &\rightarrow 0 S 0 \\ S &\rightarrow 1 S 1 \\ S &\rightarrow \epsilon \\ \end{align} \]

It’s pretty obvious that this is still a CFG, so the resulting language is still a CFL.

4.2 Closure Under Union

Suppose you have CFLs $L_1$ and $L_2$.

Consider a language $L = \{1, 2\}$. This is clearly a CFL as well.

Now substitute $L_1$ for $1$ in $L$ and $L_2$ for $2$ in $L$. The resulting L-with-substitutions accepts all strings that are in $L_1$ or $L_2$, in other words $L_1 \cup L_2$.

4.3 Other Closure Properties

Similar substitutions allow us to quickly show that the CFLs are closed under concatenation, Kleene closure, and homomorphisms.

CFLS are not closed under intersection and difference.

But the difference of a CFL and a regular language is still a CFL.

CFLs are closed under reversal. (No surprise, given the stack-nature of PDAs.)