Strings and Languages

CS390, Spring 2024

Last modified: Jan 3, 2023
Contents:

Abstract

These notes discuss the mathematical abstractions of functions and languages, as described in chapter 1 of Hopcroft et. al. These are ideas that will be employed frequently in the remainder of the course, but because you will have encountered these terms in the context of programming, there can be a bit of a cognitive disconnect with their mathematical abstractions.

1 Functions

What do you think of when you think of a “function”?

As a programmer, you might think of the kind of function that you might implement in your favorite programming languages:

double abs (double x)
{
    if (x < 0.0)
	    return -x;
    else
	    return x;
}


char upCase (char c)
{
    if (c >= 'a' && c <= 'z')
	    return c - 'a' + 'A';
	else
	    return c;
}

int modulo (unsigned k, unsigned m)
{
    return k % m;
}

Alternatively, you might hearken back to some earlier mathematics classes and recall that some of those programming functions, like abs(x) above, were traditionally written as operators, e.g., $|x|$, and yet there were some traditional mathematical functions that were written in an almost programming-like style: $\tan(a)$, $\log(x)$.

So it’s tempting to think of a function as something that “computes” based on its parameters to generate a return value.

In this class, however, the notion of “function” is a more basic one. A functions is a set of tuples, each tuple describing a mapping from one value to another. For example, we might describe the absolute value function as

\[ \{ (0, 0), (1, 1), (-1, 1), (2, 2), (-2, 2), \ldots \} \]

A tuple like $(-1, 1)$ expresses the idea that “given input -1, we should see output 1”, except that theoreticians don’t really want us using programmer-ese terms like “input” and “output”, so they would prefer to that we say “-1 maps onto 1”. In fact, other authors will sometimes write these tuples using arrows to indicate that a mapping is going on:

\[ \{ (0\Rightarrow 0)\Rightarrow (1\Rightarrow 1)\Rightarrow (-1\Rightarrow 1)\Rightarrow (2\Rightarrow 2)\Rightarrow (-2\Rightarrow 2), \ldots \} \]

Definition: For two sets A and B, a function from A to B is a subset of $A \times B$.

Not every function has an infinite number of tuples.

Suppose that I wanted to track the years of birth of a number of historical figures:

Who? Born in
George Washington 1731
Thomas Jefferson 1743
John Adams 1735
Paul Revere 1735

I could consider that I have a set P of historical figures {washington, adams, jefferson, revere, madison, hamilton, …} and a set of natural integers $\cal{N}$. My table above uses only a subset of the possible historical figures and a subset of the possible year numbers.

A single element of “type” $P \times \cal{N}$ would be a pair or tuple consisting of one historical figure and one year, for example, (washington, 1731).

The entire set $P \times \cal{N}$ would be the infinite set {(washington,0), (washington, 1) , … (washington, 1731), … (adams, 0), (adams, 1), … (jefferson, 0) … }.

But our table represents a subset of that infinite set $P \times \cal{N}$, the subset: { (washington, 1731), (adams, 1735), (jefferson, 1743), (revere, 1735) }.

With this relation, we can answer questions such as “When was Adams born?” or “Do we know anyone born in 1742?”.

We could, just as easily, have featured a function from $\cal{N}$ to $P$: { (1731, washington), (1735, adams), (1743, jefferson), (1735, revere) }.

There is a difference between this function and its inverse. In the function { (washington, 1731), (adams, 1735), (jefferson, 1743), (revere, 1735) }, no element occurs more than once in a domain element. We might say that, although we may know the birth year of all historical figures, we believe that any given historical figure will have been born in one year, not two or more. This function is said to be one to one: given one domain element, we will find at most one corresponding range element.

By contrast, the relation { (1731, washington), (1735, adams), (1743, jefferson), (1735, revere) } has two tuples where the domain element is 1735. There is no unique answer to the question “Which historical figure was born in 1735?” This is a one to many relation (but not a function).

“Functions as sets” can be easily related to our world of “functions as computational procedures”. For example, the function equivalent to

char upCase (char c)
{
    if (c >= 'a' && c <= 'z')
	    return c - 'a' + 'A';
	else
	    return c;
}

would reflect the fact that there are only a small number of possible values for char:

\[ \begin{eqnarray} \mbox{upCase} & = & \{ (0, 0), (1, 1), \ldots, (9, 9), \\ & & (A, A), \ldots, (Z, Z), \\ & & (a, A), \ldots, (z, Z), \\ & & \ldots \} \end{eqnarray} \]

(OK, I’m lazy. Even the 128 legal values for char is more than I really want to type out.)

What about functions like

int modulo (unsigned k, unsigned m)
{
    return k % m;
}

that take two or more parameters? Well, the mathematical notion of “function” always maps one domain value onto one range value. The trick is to allow those domain values to be tuples in their own right.

So the mathematical equivalent of modulo would be a function with domain $\cal{N} \times \cal{N}$ and range $\cal{N}$.

\[ \begin{eqnarray} \mbox{modulo} & = & \{ ((0, 1), 0), ((0, 2), 0), ((0, 3), 0), \ldots \\ & & ((1, 1), 0), ((1, 2), 1), ((1, 3), 1), \ldots \\ & & ((2, 1), 0), ((2, 2), 0), ((2, 3), 2), ((2, 4), 2), \ldots \\ & & \ldots \} \end{eqnarray} \]

A tuple like $((1, 2), 1)$ expresses the idea that the tuple $(1,2)$ maps onto $1$ (because 1 modulo 2 is 1).

You can extend this to function domains comprising any number of different “inputs” of any combination of data types.

One thing that should be clear now is that, although the theoretical notion of “function” different from what you may have expected, the theoretical “function” can capture the notion of function-as-computation. (Near the end of the course we’ll see that it can even capture some functions that can’t be computed!)

2 Languages

In many ways, this course is really about languages.

You know what a language is. You’ve been speaking in one since you were a toddler. The theoretical concept of “language” isn’t nearly as general as that of a “natural language”, though they are related.

You know what a language is. You’ve been programming in one or more for at least a few semesters now. The theoretical concept of “language” isn’t nearly as general as that of a “programming language”, though they are related, somewhat more closely than theoretical and natural languages.

A language in CS theory is just a set of strings. Of course, strings are usually based upon characters, but we’ll generalize that to allow any finite set of symbols to serve as the alphabet for one of our languages. And since “symbols” can stand for just about anything, that gives us quite a lot of flexibility.

We do have one special symbol that needs to be pointed out immediately. The “empty string” is a string containing no symbols at all. In programming languages we generally write this as "". However, in our mathematical notation we do not usually surround our strings with quotation marks, and dropping the quotation marks would make "" completely invisible. So we will use $\epsilon$ (the lower-case Greek letter epsilon) to denote the empty string.

2.1 Languages as Sets

Because languages are sets, they can be manipulated using the same operators we use for ordinary sets, including $\in \; \cup \; \cap$, etc.

There are a few special operations/notations associated with languages as sets of strings, however.

If $A$ and $B$ are sets of strings, then

2.2 Levels of Language Processing

A sentence in a language is any single string that is a member of that set. For natural language, the theoretical sentences are pretty much what we consider to be a natural sentence in that language. For programming languages, an entire program or separately compilable unit is usually a single sentence.

When we process non-theoretical languages, e.g., to interpret a statement in a natural language or to compile code in a programming language, we generally do so in three distinct levels:

Lexical
At this state, we take in a continuous stream of … something … and divide that into discrete pieces called tokens. A token is chosen as a smallest unit that could not be subdivided without losing information.

If we were processing written natural language, our tokens would likely be individual words and punctuation marks. Later in the processing, we might distinguish the token kinds of some of these words as nouns, verbs, and other parts of speech.

If we were processing spoken natural language, our tokens would likely be phonemes, the basic sounds that characterize a spoken language.

In programming languages, a token might be a single variable name, a reserved word, an operator, or possibly a “structural” punctuation mark such as { or }. Comments and whitespace are generally discarded as part of this “lexical analysis” or “scanning”.

We do this for many reasons, no the least of which is that it tremendously reduces the number of elements that need to be passed on for later, more expensive processing. A token in a compiler is generally an ADT that carries

  • A “token kind”, such as “variable”, “if”, “plus-sign”, etc. Typically all user-defined names are a single token kind, each reserved word is a distinct token kind, and each operator is a distinct token kind.
  • The lexeme, the original string of characters that constitute the token. For example, the user-defined variable names x and maximumValue would both be of the same token kind, but their lexemes would be differnt: “x” and “maximumValue”, respectively.
  • Location information indicating where we encountered this token. This is used when the compiler wants to issue error messages like “Illegal syntax in line 23 of myCode.cpp”.
Syntactic
Syntax is the possible way that different tokens can be organized to make a “legal” sentence in the language.

In English, for example, we would recognize the pattern “noun active-verb noun period” as a likely sentence, such as “Giraffes eat leaves.” or “Jack sees Spot.” English speakers are often bewildered by German, where a common pattern a verb to appear at the end of a sentence allows.

In programming languages, you are familiar with a variety of structural rules such as “parentheses must be balanced”, or “an if is followed by an expression within parentheses”.

The process of trying to match a stream of incoming tokens against a language’s legal syntax is called syntactic analysis or parsing.

Semantic
The semantics of a language is the mapping of lexical and syntactic structure onto “meaning”. It is semantics that says that “Giraffes eat leaves.” makes sense but that “Leaves eat giraffes.” does not, even though the syntactic structures are identical.

As a programmer, you see reserved words like while or if and the associated structure of condition and body, and you understand that one describes an action that can be repeated any number of times and the other describes an action that is done at most once. That is semantics. You also know of rules that say that the data types of the expressions that appear to each side of an assignment or relational operator must match. That is also semantics. Most programming languages require you to declare variables before you use them. That’s also semantics.

The language theory that we study in this course is largely concerned with syntax.

So, why then am I making such a point here about the theoretical form of “language”? First, let’s not discount the importance of understanding and being able to process lexical and syntactic structure. That is a issue in compiling programming languages and, moreover, in handling many kinds of complicated, structured data inputs.

But beyond that, the question of what strings belong to a given language turns out to be very important.

These statements are equivalent:

  • This course is all about the fundamental power and limits of computing.
  • This course is all about automata.
  • This course is all about languages.