Patterns for Text: Regular Expressions

Last modified: May 25, 2025

Contents:

If wildcards provide a way to write patterns for file and directory paths, can we also write patterns for text strings? Yes, but this is not built into the shell for use by every command, the way that wildcards are.

Instead, most Unix programs and commands that do some kind of searching or matching for text will share a common notation for patterns of text to be matched. This notation is called regular expressions.

1 Searching for Text

For example, almost every text editor in any operating system will allow you to search a file for a given string. But most Unix text editors (including emacs and vim) will allow you to search for any string matching a regular expression “pattern”. sed, a useful utility for doing simple changes to text files, is most often invoked to use its “substitute” command, which replaces any text matching a regular expression by some desired replacement text. The csplit command splits a single file into multiple pieces, where the point of division is most often indicated via a regular expression. Perl and awk, available on most Unix systems but not covered in this course, are scripting (programming) languages with a heavy emphasis on text manipulation, which is accomplished largely through matching on regular expressions.

1.1 Searching Lines with grep

In an earlier example, we saw that the program grep can be used to list all lines of a file that match a given string. For example,

grep 'def' /usr/include/math.h

would list all lines in the indicated file that contain the string “def”.

The first parameter (‘def’) is actually an example of a regular expression, a special notation for writing patterns for searching and matching text.

As we will see shortly, the notation for these patterns is such that a pattern that matches exactly one string (e.g., “def”) is written as that same string — “def” is both the pattern and the string that it matches.

But regular expressions are much more powerful than that. We can write a regular expression pattern to match a wide range of related strings.

First, though, let’s look at another command that makes heavy use of regular expressions.

1.2 Rewriting lines with sed

Many other commands besides grep will use regular expressions. sed, for example, allows you to enter a variety of editing commands that will be applied to every line of a file. A common use of sed is to scan each line of the file for a pattern and to replace that pattern, wherever it occurs, by some string. The sed command to do this is

sed s/pattern/replacement/g filename

where filename is the file whose contents we want to scan and replace, pattern is a regular expression describing the text to search for in each line, and replacement is the text by which we wish to replace any thing that matches the pattern. (The ‘/’ characters are simply necessary to indicate the beginning and end of the pattern and replacement strings. They can be replaced by any character that does not appear in either the pattern or replacement strings.)

Example 1: Try This: Substitutions with sed
cd ~/playing
cp ~cs252/Assignments/ftpAsst/alas.txt .
more alas.txt
Now try operating on that file with sed:
sed s/o/X/g alas.txt
The ‘g’ at the end of each of the prior example indicates that the change should be applied every time a match is found (i.e., this is a global replacement). If the ‘g’ is dropped, only the first match in each line will be replaced.

Try the command
sed s/o/X/ alas.txt
to see the effect of dropping the ‘g’.

Neither the pattern nor the replacement are limited to single letters.

Try:
sed 's!I!you!g' alas.txt
sed 's@Horatio@George@g' alas.txt
By default, sed is case-sensitive. You can add an ‘i’ flag at the end of the substitution to change this. Compare the outputs of these commands:
sed 's/I/you/g' alas.txt
sed 's/I/you/gi' alas.txt

2 Regular Expressions

To write more powerful patterns, we need to understnad how regular expressions are constructed

A regular expression consisting of a single “non-special” character will match any string containing that character.
As it happens, none of the alphabetic and numeric characters are “special”, so the regular expression d, for example, would match any string containing a “d”.

Example 2: Try This: Regular expressions - basic characters
cd ~/playing
grep H alas.txt
Since grep works line by line, this would select every line containing an “H”.

If a set of regular expressions $r_1, r_2, \ldots, r_k$ are concatenated together to form a single larger regular expression $r_1r_2\ldots r_k$ , it matches any string that contains a substring formed from a concatenation of strings $s_1s_2\ldots s_k$ , each of which matches the corresponding regular expression.
So when we write
```
grep def /usr/include/math.h
```
the def is actually the concatenation of three regular expressions d, e, and f, and matches any string that contains a substring matching d followed immediately by a substring e followed immediately by a substring matching f.

That may seem an unnecessarily complex way to get to the original idea of “matches the string `def’,” but this idea of concatenation is a general one that becomes more important as we consider other combinations of smaller regular expressions.

Example 3: Try This: Regular expressions - concatenation

Compare the results of
grep H ~/playing/alas.txt
grep or ~/playing/alas.txt
grep Hor ~/playing/alas.txt

The real power of regular expressions comes into play when we consider the various “special” characters that serve as regular expression operators.

Regular expressions can be grouped in parentheses, written as $…$.
- The backslash is a special character, not only to grep and sed, but also to the command shell. So any command parmaters that want to use it will need to be quoted.

Example 4: Try This: Regular expressions - parentheses

Based on the definition of parentheses, these two commands should do exactly the same thing:
grep 'def' /usr/include/math.h
grep '$def$' /usr/include/math.h
The introduction of the parentheses does not change what is matched. It does, however, group things together (just as parentheses do in conventional algebra), which we can take advantage of with the operators we will introduce shortly.

But compare also
grep 'x' /usr/include/math.h
grep '$x$' /usr/include/math.h
grep '(x)' /usr/include/math.h

The vertical line | separating two regular expressions means that string matching either of those regular expressions would be accepted.
- Again, though, to get this special behavior, the | must be preceded by a backslash.
- This vertical bar is called the alternation operator. Alternation here means a choice (same root as “alternative”), not alternating from one thing to another and back again.

Example 5: Try This: Regular expressions - Alternation

Try:
grep '$Y\|H$' ~/playing/alas.txt
grep '$Y\|H$or' ~/playing/alas.txt
sed 's/$Y\|H$or/XXX/g' ~/playing/alas.txt
It is worth noting that, in sed, the expression that we match the text against in a substitution is a regular expression, but the replacement text is not. Try:
sed 's/$Y\|H$or/|||/g' ~/playing/alas.txt
sed 's/$Y\|H$or/\|\|\|/g' ~/playing/alas.txt

Square brackets [ ] containing any set of characters not beginning with ^ will match a string containing any one of those characters.

Example 6: Try This: Regular expressions - brackets

Try these commands:
grep Alas ~/playing/alas.txt
grep '[Alas]' ~/playing/alas.txt
Most Unix systems have a file /usr/share/dict/words, which is a “dictionary” used by spellcheck programs. It is not a dictionary in the sense of a list of words and definitions. It is simply a list of words, one per line, in alphabetical order.

Try:
more /usr/share/dict/words
(Remember you can quit with q). A typical version of this file will have nearly 100,000 words.

That words file can be used to search for words that match a selected criterion. For example, if you can’t remember whether a word is spelled ‘belief’ or ‘beleif’, try:
grep 'bel[ei][ei]f' /usr/share/dict/words

A particularly useful variant on the brackets is the use of character ranges. If you write two characters separated by a hyphen (-) inside brackets, e.g., [A-Z], then that is taken to mean all the characters starting at the first one and up to the last one, according to the usual ASCII character encoding rules.

Example 7: Try This: Regular expressions - brackets with character ranges

Try:
grep 'i[A-Z]' /usr/share/dict/words
sed 's/[A-Z]/*/g' ~/playing/alas.txt
sed 's/[a-z]/*/g' ~/playing/alas.txt
sed 's/[A-Hm-z]/*/g' ~/playing/alas.txt
This can be combined with ^:
sed 's/[^A-Z]/*/g' ~/playing/alas.txt

The character set inside a regular expression square brackets can be abbreviated by giving a range of characters separated by a hyphen. For example, [a-z] would match all the lower-case alphabetic characters.

Example 8: Try This: grep and -v

The -v option causes grep to list only those lines that do not match the pattern.

Try:
grep -v a ~/playing/alas.txt
grep -v e ~/playing/alas.txt
grep -v i ~/playing/alas.txt
grep -v o ~/playing/alas.txt
And
grep the ~/playing/alas.txt
grep -v the ~/playing/alas.txt

Square brackets containing any set of characters beginning with ^ [^…] will match a string containing any character not in that set._

Example 9: Try This: Regular expressions - brackets with ^

Do you remember being taught in grade-school spelling class that the letter ‘q’ is always followed by a ‘u’ in English?

Let’s checK:
grep 'q[^u]' /usr/share/dict/words
(Actually, ardent Scrabble players and crossword puzzlers will recognize that /usr/share/dict/words is missing the word “qat”.)

Or how about that rule “‘i’ before ‘e’ except after ‘c’”?
grep 'cie' /usr/share/dict/words
grep '[^c]ei' /usr/share/dict/words

. matches any single printable character, including blanks, but not including the end-of-line character._

Example 10: Try This: Regular Expressions - Matching any single character
grep 'd.f' /usr/include/math.h
Can you think of a word that has two ’z’s separated by a single letter?
grep 'z.z' /usr/share/dict/words

If $r$ is a regular expression, then $r*$ matches zero or more successive strings, each of which matches $r$ .

Example 11: Try This: Regular Expressions - Zero or more repeats
grep 'ER*N' /usr/include/math.h
grep '#.*if' /usr/include/math.h
Note that the behavior of * is very different in regular expressions than it is in file path wildcards.

If $r$ is a regular expression, then $r\backslash +$ matches one or more successive strings, each of which matches $r$ .

Example 12: Try This: Regular Expressions - One or more repeats

grep 'ERN' /usr/include/math.h
egrep 'ER+N' /usr/include/math.h
grep 'ER*N' /usr/include/math.h
grep 'ER\+N' /usr/include/math.h

If $r$ is a regular expression, then $r\backslash ?$ matches zero or one strings matching $r$ .

Example 13: Try This: Regular Expressions - Optional occurrence
egrep 'def' /usr/include/math.h
egrep 'define' /usr/include/math.h
egrep 'def(ine)\?' /usr/include/math.h
egrep 'def(ine)\?d' /usr/include/math.h
In essence, the ? makes the preceding item optional. (So the first and third commands in this example are actually equivalent.)

If r is a regular expression, then ^r matches any string that begins with a substring matching r.
- The grep and sed programs both use an entire line of text as the string to be searched, so for those programs ^ matches the beginning of a line.

Example 14: Try This: Regular Expressions - Beginning of the string

grep 'ex' /usr/include/math.h
grep '^ex' /usr/include/math.h

and

grep '[nN]' ~/playing/alas.txt
grep '^[nN]' ~/playing/alas.txt
grep '^ *[nN]' ~/playing/alas.txt

If r is a regular expression, then r$ matches any string that ends with a substring matching r.
- The grep and sed programs both use an entire line of text as the string to be searched, so for those programs $ matches the end of a line.

Example 15: Try This: Regular Expressions - End of the string

Try:
grep '_' /usr/include/math.h
grep '_$' /usr/include/math.h

With so many special characters, you might wonder just what you’re supposed to do if you really want to search for lines containing a "*“, or a ”?", or a … The answer is given in our final rule:

If c is a special character, then \c matches that character in a string.

Example 16: Try This: Regular Expressions - Quoting special characters
grep '.$' ~/playing/alas.txt
grep '\.$' ~/playing/alas.txt
grep '3.' /usr/include/math.h
grep '3\.' /usr/include/math.h

This is by no means an exhaustive list of all the regular expression operations, but it’s probably enough for most purposes.

3 Sed Redux

sed has some regular expression features not useful in grep. In particular, if you place part of a regular expression inside parentheses (written as $ and $ ), then in the replacement string you can refer to whatever got matched by the aprenthesized part of the expression via a back reference. If you have just one parenthesized expression, the back reference is \1. If you add another parenthesized expression, you can refer back to it as \2, and so on.

Example 17: Try This: Back References in sed
cd ~/playing
cp ~cs252/Assignments/ftpAsst/names.txt .
cat names.txt
sed 's/$[A-Za-z]\+$ /*/' names.txt
sed 's/$[A-Za-z]\+$ $.*$/*/' names.txt
The first parenthesized expression matches a block of 0 or more alphabetic characters. That is followed by a blank, which matches a blank in the file. The second parenthesized expression then matches a block of 0 or more of any character, in effect swallowing up the rest of the line.
sed 's/$[A-Za-z]\+$ $.*$/\1/' names.txt
sed 's/$[A-Za-z]\+$ $.*$/\2/' names.txt
sed 's/$[A-Za-z]\+$ $.*$/\2, \1/' names.txt
Note how, in the final replacement pattern, we reversed the order of the two matched blocks, as well as inserting a comma between them.