Properties of Regular Languages

CS390, Fall 2021

Last modified: Aug 2, 2021
Contents:

Abstract

In the previous modules, we have seen two very different ways to describe regular languages: by giving a FA or by a regular expression.

In this section we focus on the important properties of the languages themselves: * closure properties, which say how we can build new regular languages from combinations of existing ones, and * decision properties, which allow us to decide questions about regular languages

We will also look at the problem of minimizing DFAs, reducing a DFA to the smallest possible number of states without changing the language it recognizes.

1 Introduction

In this section we focus on the important properties of the languages themselves:

2 The Pumping Lemma

 

The FA on the right accepts strings over $\{a, b\}$ that start with ‘a’. This illustrates that an FA can distinguish input strings based upon the opening symbols of its input.

 

This FA, on the other hand, accepts strings over $\{a, b\}$ that end with ‘aa’. So FAs can also distinguish input strings based upon the closing symbols of their input.

If you are willing to set up enough states, you can use an FA to recognize any non-repeated pattern at the start and/or end of strings. But there are limits to how discriminating an FA can be “in between”.

The Pumping Lemma basically indicates that a FA can distinguish based upon some initial input symbols and also upon some closing input symbols, but because it has only a finite number of states it cannot “remember” everything that goes on in the middle. It must “loop” repeatedly back to the same state(s) in the middle of the input. (This is somewhat reminiscent of the argument we made in converting FAs to regular expressions.)

The Pumping lemma is most useful as a way of proving that a language cannot possibly be accepted by an FA. The examples given in the book:

\[ L = \{x \in \{a,b\}^* | n_a(x) > n_b(x) \} \]

is a “classic” of CS theory.

Also of note is this language:

\[ L = \{0^{n} 1^{n} | n \geq 1\} \]

which is suggested in exercise 4.1.1a.

These languages have a certain practical implication. If you cannot use an FA to determine to guarantee that two different characters occur the same number of times in a string, even if they are arranged conveniently with all of the occurrences of one character coming before any occurrences of the other, then you almost certainly cannot use an FA to solve the even more complicated problem of determining whether parentheses or brackets are balanced in a program. Such checks are, however, critical in compilers and other programming language processors. This tells us that the regular languages are not powerful enough to express some rather mundane characteristics of typical programming langauges.

Let’s look at how we can use the Pumping Lemma to provide that this second language is not regular.

According to the Pumping Lemma, if this were regular, then every sufficiently long string in the language can be divided into three pieces, $x$, $y$, and $z$, such that $x y^* z$ is also in the language. Let $n$ be the critical value for this language. The consider a string $s = 0^n1^n$ (which is clearly longer than $n$). The Pumping lemma says that, if this language is regular, we can divide this string into pieces $x$, $y$, and $z$ such that $|xy| \leq n$, $|y| > 0$, and all $xy^*z$ are also in the language.

Let’s try to divide that string into those pieces $x$, $y$, and $z$ such that $s = xyz$. There are two cases to consider:

  1. $y$ contains only zeros. So, we have $x=0^i$, $y=0^j$, $j > 0$, $z=0^k1^n$, such that $i + j + k = n$. But the Pumping lemma says that $xy^2z$ would also need to be in the language. But $xy^2z = 0^{i+2j+k}1^n$. That string is in the language $0^n1^n$ only if $i + 2j + k = n$. Since

    \[ i + j + k = n = i + 2j + k\]

    has no solution when $j > 0$, this is not possible.

  2. $y$ contains at least one ‘1’. But then $x$ must contain all of the 0’s (and possibly some 1’s), so $|xy| > n$, which also violates the Pumping Lemma.

Hence the Pumping Lemma tells us that $0^i1^i$ cannot be a regular language. (This is often used as a justification for the informal claim that a FA can only “remember” a finite amount of prior history of what has been input into it.)

3 Closure Properties

We say that a set $S$ is closed under a binary operation $\oplus$ iff $\forall a \in S, b \in S \Rightarrow a \oplus b \in S$. That is, pick any two elements of $S$ and apply the operator to them and the result is also in $S$.

Example 1: Familiar Closures

The set of integers, $\cal{Z}$, is closed under addition, because for any two integers $i$ and $j$, $i + j$ is still an integer. Similarly, the set of integers is closed under subtraction and multiplication.

The set of natural numbers, $\cal{N}$, on the other hand, is not closed under subtraction. We can prove this by example: 2 and 5 are both natural numbers, but $2 - 5$ is negative and therefore not a natural number.

In this section, we are interested in the kinds of operations that we can perform on Regular Languages and still be guaranteed that we will get a regular Language as a result.

3.1 Closure Under Set Operations

Languages are sets of strings, and it is natural to think of the common set operations such as union, intersection, etc., as ways to combine languages. Clearly the set of all languages will be closed under any common set operations. The union/intersection/complement/difference of sets of strings will still yield a set of strings.

But do regular languages, the set of languages accepted by regular expressions and FAs, have the same kind of closure?

3.1.1 The Regular Languages are Closed under Union

We’ve actually already shown this by showing that, given a pair of FAs, we can easily construct an NFA that accepts the union of their languages.

Your text makes the equally simple argument that, given a pair of regular expressions $R_1$ and $R_2$, we can write $(R_1) + (R_2)$, which is still a regular expression and therefore still describes a regular language.

3.1.2 The Regular Languages are Closed under Complement

A complement of something, in English, is something “completes” that something. For example, you may e familiar with the idea that all colors of light can be broken down into red, blue and green components. So if you were designing an adjustable mood lighting system and you had a red bulb and a blue bulb, you might see that as incomplete. If someone then handed you a green bulb, you might say that it complements the other two bulbs. (Don’t confuse “complement” with “compliment”, to make encouraging statements about something.)

A complement of a set S would be the set T that “completes” S in the sense that $S \cup T$ is the entire universe of set elements, and $S \cap T = \emptyset$, there is no overlap between $S$ and $T$.

For a language $L$, a complement would be the set of strings in the same universe ($\Sigma^*$) that are not members of $L$.

It’s not hard to show that regular languages are closed under complements.

 

Take any DFA for a regular language L. For example, this DFA for the language of strings over alphabet $\{a, b\}$ that end with “aa”. Suppose we want to take $\bar{L}$, the set of all strings over that same alphabet that do not end with “aa”.

 

That’s pretty easy to do with a DFA — just exchange all of the final and non-final states. Clearly the resulting FSA will accept exactly the strings that the original DFA would not have accepted.

One of the interesting things about this construction is that it’s so easy to with FAs, but trying to write a regular expression that accepts the complement of the language accepted by another regular expression can be quite tricky.

3.1.3 The Regular Languages are Closed under Intersection

One way to prove this is to modify our argument for closure under union.

  1. Start with our earlier construction for running two FAs “in parallel”.
  2. Then convert that NFA into a DFA. The resulting DFA will be labeled with sets of labels from the original two FAs, representing states that could be simultaneously active when we ran those FAs in parallel.
  3. Choose as final states in our new DFA only those states in which every label from the original FAs were final.

Alternatively, we can prove closure under intersection by reducing intersection to other operators:

\[ A \cap B = \overline{\bar{A} \cup \bar{B}} \]

and since, the regular languages are closed under union and complement, they must be closed under intersection as well.

3.1.4 The Regular Languages are Closed under Difference

Set difference selects all of the elements in one set that are not present in a second set. To show that the regular languages are closed under difference, we only have to note that

\[ A - B = A \cap \bar{B} \]

3.2 Other Closure Properties

3.2.1 The Regular Languages are Closed under Reversal

If we have a FA for a language $L$, we can construct an FA that accepts $L^R$ the set of strings that are reversed versions of some string in $L$.

For example, if we start with this FA for strings over $\{a, b\}$ that do not end with “aa”,

 

  1. Reverse all of the arcs in the transition diagram.

     

    If this leaves us with any states that lack a transition on some symbol of $\Sigma$, add a transition to a special, unescapable “trap” state.

     

  2. Make the start state the only accepting state.

     

  3. Create a new start state with transitions on $\epsilon$ to each of the original accepting states.

 

You may be able see that this automaton accepts all strings that do not begin with “aa” (i.e., that are reversed forms of strings that do not end in “aa”). If not, here is the JFLAP file. Try running this automaton and convince yourself that it works.

Could we do a simpler DFA than this? Certainly. You might want to try converting this to a DFA and minimizing that DFA (either by hand or in JFLAP). But having done the construction is enough to demonstrate the closure.

Your text carries out the formal argument that this construction can be show in general to guarantee closure.

3.2.2 The Regular Languages are Closed under Homomorphism

A homomorphism in general is a transformation (“morph”) accomplished by a uniformly applied (“homo” as in “homogenous”) substitution.

In this context, we are considering the substitution of strings for characters in $\Sigma$.

Single-Character Substitutions

Now it should be obvious that, if we have a regular language $L$, we could substitute characters in $\Sigma$ by other single characters and still be regular.

 

For example, if I show you this FA, you would have to agree that, whatever language it accepts, that language is regular.

 

And, having said that, you might agree that if I replace ‘a’ by ‘x’, ‘b’ by ‘y’, and ‘c’ by ‘z’, the existence of the resulting FA proves that the language is still regular.

 

In fact, the replacement doesn’t even have to be one-to-one. If I replace ‘a’ and ‘b’ by ‘w’, and replace ‘c’ by ‘z’, the FA becomes non-deterministic, but it’s still an FA and so the language that it accepts is therefore still regular.

String Substitutions

OK, so single character homomorphisms preserve regularity. But what about replacing each character in the language by an entire string?

After all, if we wanted to replace ‘a’ by “cat” and ‘b’ by “dog”, we can’t go from

 

to

 

can we? That’s not a FA.

But we can do this: “Split” each of the original single-character arcs by introducing intermediate states.

 

Your text formalizes this construction to prove the regular languages are closed under homomorphism.

3.2.3 The Regular Languages are Closed under Reverse Homomorphism

A reverse homomorphism replaces entire strings in a language by individual symbols. This is fairly easy to envision in a “set of strings” view, e.g., if I had a language of all strings ending in “aa”:

\[ \{ aa, aaa, baa, aaaa, abaa, baaa, bbaa, \ldots \} \]

and I decided to replace all strings “ba” by ‘w’, I would get

\[ \{ aa, aaa, wa, aaaa, awa, waa, bwa, \ldots \} \]

what I would be left with is still a language, but it’s hard to actually describe what’s in the language. If I had started with a FA or regular expression for the language, it would be hard to even describe the substitution.

Nonetheless, the regular languages are closed under this operation as well.

4 Decision Problems for Regular Languages

Decision problems are true/false questions that we might want to ask. We would like to know if certain questions are decidable, if there is some systematic way of answering the question.

Closely related is the idea of a decision procedure, an algorithm for computing the answer to such a problem. We frequently demonstrate that a problem is decidable by actually giving an algorithm for deciding it.

When we talk about “decision procedures” or “algorithms”, we refer to step-by-step procedures that are guaranteed to terminate with an answer.

You cannot, for example, answer the question is P decidable for any regular language L?“ by starting with ”make a list of all the strings in L“ or ”loop through all strings in L" because many regular languages (perhaps, most of the interesting ones) are infinite, so such a loop might not terminate.

Many decision procedures for regular languages work by treating an FA as a graph and applying common graph algorithms to it. If that doesn’t ring a bell, and particularly if you have not yet taken CS361, you might want to look at Graphs – the Basics and Traversing a Graph. You don’t really need anything more than the idea of traversing a graph, as that is enough to let you answer questions like “can I go from q0 to q12?” or “which states can I reach starting from q4?”, which are the kind of questions you need for most regular language decision problems.

4.1 Is a Regular Language Empty?

If someone shows us a regular language as a set of strings, this is trivially easy to answer. But we could ask the same question about a regular language presented to us in the form of an FA or a regular expression.

4.2 Is the String s a Member of a Regular Language L?

4.3 Do Two FAs Describe the Same Language?

Before we can answer this question, we need to introduce an intermediate problem: given any two states in an FA, are those states equivalent or distinguishable?

We’ll say that two states $p$ and $q$ are equivalent if

  1. Both are accepting or neither is accepting.

    Formally, $p \in F \Leftrightarrow q \in F$.

  2. If we feed any string $w$ into the FA starting from $p$ and then again starting from $q$, we wind up in states that match in acceptance states: both accept or neither does.

    Formally,

    \[ \forall w \in \Sigma^*, \hat{\delta}(p,w) \in F \Leftrightarrow \hat{\delta}(q,w) \in F \]

If two states are not equivalent, we say that they are distinguishable.

4.3.1 Distinguishable States

Numerous studies have shown that most people cannot distinguish the taste of one of the two most popular-selling cola brands from the other. If I were red-green color blind, I would be unable to distinguish red objects from green ones. At night, most of us cannot distinguish blue objects from green ones. I have a relative who is tone-deaf. She cannot distinguish a C from the neighboring D.

So what does it mean when we talk about a state or an FA not being able to distinguish one string from another? It means that it cannot tell them apart, no matter what we do in the way of further inspection.

 

For example, consider this FA for the language of strings in which every ‘b’ is followed immediately by an ‘a’. If I were to execute this on the strings “aaab” and “abba”, for example, winding up in states Y and W, respectively, both strings are “not accepted” as is.

So do we really need both W and Y in our FA? Yes. These are distinguishable by this language, because if I consider adding the string “a” to each of them, then “aaaba” is accepted but “abbaa” is still not accepted. So it is possible to tell those two states apart by feeding additional characters into the FA.

We can determine which pairs of states in a DFA are distinguishable from others by a “table filling algorithm”. We set up a table mapping each state onto each other state. We make a first pass by marking all of the accepting states as distinguishable from the non-accepting states.

Then the algorithm extends those sets of states that can be distinguished by taking each unmarked pair (q,r) that we have not analyzed yet and considering where we could go by inputting one more additional character. If those would take us to a pair of states already know to be distinguished, then (q,r) are distinguished as well.

 

Let’s look at this FA as an example.

 

Represent the possible pairs of states with a table. The text shows only the lower half of these tables, as it will be symmetric about the diagonal. But I find that annoying because I don’t want to have to worry about whether I should be looking at position $(i,j)$ or $(j,i)$.

We’ll put a number in a position (i,j) to indicate that the states i and j cannot be combined because they represent states that are known to be distinguishable.

first prev1 of 5next last

4.4 Determining if Two FAs Accept the Same Language

With the ability to determine if two FAs are equivalent, we can now answer the question of whether two FAs accept the same language.

  1. Pretend that we are going to run the two FSAs in parallel, much as we have done before.
  2. Do the table-filling algorithm.
  3. Check the starting states of the original FSAs. If they are marked as distinguished, that means that there exists at least one input string on which the two FSAs disagree over whether to accept that string. Thus they do not accept the same language. If the two states are unmarked and therefore equivalent, then no such string exists and the two FSAs agree on every possible input string, meaning that they accept the same language.

4.5 Minimizing the Number of States in a Finite Automaton

We can also make use of this idea of equivalent/distinguishable states to minimize the number of states in a DFA.

  1. Start by determining which pairs of states are distinguishable as described above.

  2. Look at the unmarked spaces in the final table. Those represent pairs of equivalent states. Equivalence is transitive. If A and B are equivalent, and B and C are equivalent, then A and C are equivalent.

Partition the set of states by taking the transitive closure of the state equivalence for each state. This yields multiple sets of states, with search set contining only states that are equivalent to one another.

  1. Create an FA with a state for each set in that partition.

  2. Trace the state transitions from each of these combined state for each symbol in $\Sigma$ to get the new transitions.

 

Let’s walk through the minimization of this FA.

 

We’ve already constructed the table of equivalent/distinguishable state pairs. So let’s focus on the unmarked spaces.

Looking row by row, we see that

  • State 0 is distinguishable from every other state.
  • States 1, 2, and 5 are equivalent and can be combined.
  • States 3, 4, and 8 are equivalent andcan be combined.
  • States 6 and 7 are equivalent can be combined.
  • State 9 is distinguishable from every other state.
first prev1 of 8next last

 

And we’re done.

You can run this minimization algorithm in JFLAP, but it uses a different way of visualizing the process of dividing up the states.

 

Starting from the same FA,

 

Start by dividing the set of all states into a set of final (accepted) and a set of non-final states.

Now, pick one of those sets. Pick a possible input on each of the states. If all of the states in this set transition to the same set, leave this one alone. But if they transition on that input to two or more different existing sets, divide this one accordingly.

Repeat until you cannot find an input that splits any of the existing sets of states.

first prev1 of 4next last