Patterns for Text: Regular Expressions

Last modified: Aug 24, 2023
Contents:

If wildcards provide a way to write patterns for file and directory paths, can we also write patterns for text strings? Yes, but this is not built into the shell for use by every command, the way that wildcards are.

Instead, most Unix programs and commands that do some kind of searching or matching for text will share a common notation for patterns of text to be matched. This notation is called regular expressions.

1 Searching for Text

For example, almost every text editor in any operating system will allow you to search a file for a given string. But most Unix text editors (including emacs and vim) will allow you to search for any string matching a regular expression “pattern”. sed, a useful utility for doing simple changes to text files, is most often invoked to use its “substitute” command, which replaces any text matching a regular expression by some desired replacement text. The csplit command splits a single file into multiple pieces, where the point of division is most often indicated via a regular expression. Perl and awk, available on most Unix systems but not covered in this course, are scripting (programming) languages with a heavy emphasis on text manipulation, which is accomplished largely through matching on regular expressions.

1.1 Searching Lines with grep

In an earlier example, we saw that the program grep can be used to list all lines of a file that match a given string. For example,

grep 'def' /usr/include/math.h

would list all lines in the indicated file that contain the string “def”.

The first parameter (‘def’) is actually an example of a regular expression, a special notation for writing patterns for searching and matching text.

As we will see shortly, the notation for these patterns is such that a pattern that matches exactly one string (e.g., “def”) is written as that same string — “def” is both the pattern and the string that it matches.

But regular expressions are much more powerful than that. We can write a regular expression pattern to match a wide range of related strings.

First, though, let’s look at another command that makes heavy use of regular expressions.

1.2 Rewriting lines with sed

Many other commands besides grep will use regular expressions. sed, for example, allows you to enter a variety of editing commands that will be applied to every line of a file. A common use of sed is to scan each line of the file for a pattern and to replace that pattern, wherever it occurs, by some string. The sed command to do this is

sed s/pattern/replacement/g filename

where filename is the file whose contents we want to scan and replace, pattern is a regular expression describing the text to search for in each line, and replacement is the text by which we wish to replace any thing that matches the pattern. (The ‘/’ characters are simply necessary to indicate the beginning and end of the pattern and replacement strings. They can be replaced by any character that does not appear in either the pattern or replacement strings.)

Example 1: Try This: Substitutions with sed
cd ~/playing
cp ~cs252/Assignments/ftpAsst/alas.txt .
more alas.txt

Now try operating on that file with sed:

sed s/o/X/g alas.txt

The ‘g’ at the end of each of the prior example indicates that the change should be applied every time a match is found (i.e., this is a global replacement). If the ‘g’ is dropped, only the first match in each line will be replaced.

Try the command

sed s/o/X/ alas.txt

to see the effect of dropping the ‘g’.

Neither the pattern nor the replacement are limited to single letters.

Try:

sed 's!I!you!g' alas.txt
sed 's@Horatio@George@g' alas.txt

By default, sed is case-sensitive. You can add an ‘i’ flag at the end of the substitution to change this. Compare the outputs of these commands:

sed 's/I/you/g' alas.txt
sed 's/I/you/gi' alas.txt

2 Regular Expressions

To write more powerful patterns, we need to understnad how regular expressions are constructed

Example 2: Try This: Regular expressions - basic characters
cd ~/playing
grep H alas.txt

Since grep works line by line, this would select every line containing an “H”.

Example 3: Try This: Regular expressions - concatenation

Compare the results of

grep H ~/playing/alas.txt
grep or ~/playing/alas.txt
grep Hor ~/playing/alas.txt

The real power of regular expressions comes into play when we consider the various “special” characters that serve as regular expression operators.

Example 4: Try This: Regular expressions - parentheses

Based on the definition of parentheses, these two commands should do exactly the same thing:

grep 'def' /usr/include/math.h
grep '\(def\)' /usr/include/math.h

The introduction of the parentheses does not change what is matched. It does, however, group things together (just as parentheses do in conventional algebra), which we can take advantage of with the operators we will introduce shortly.

But compare also

grep 'x' /usr/include/math.h
grep '\(x\)' /usr/include/math.h
grep '(x)' /usr/include/math.h
Example 5: Try This: Regular expressions - Alternation

Try:

grep '\(Y\|H\)' ~/playing/alas.txt
grep '\(Y\|H\)or' ~/playing/alas.txt
sed 's/\(Y\|H\)or/XXX/g' ~/playing/alas.txt

It is worth noting that, in sed, the expression that we match the text against in a substitution is a regular expression, but the replacement text is not. Try:

sed 's/\(Y\|H\)or/|||/g' ~/playing/alas.txt
sed 's/\(Y\|H\)or/\|\|\|/g' ~/playing/alas.txt
Example 6: Try This: Regular expressions - brackets

Try these commands:

grep Alas ~/playing/alas.txt
grep '[Alas]' ~/playing/alas.txt

Most Unix systems have a file /usr/share/dict/words, which is a “dictionary” used by spellcheck programs. It is not a dictionary in the sense of a list of words and definitions. It is simply a list of words, one per line, in alphabetical order.

Try:

more /usr/share/dict/words

(Remember you can quit with q). A typical version of this file will have nearly 100,000 words.

That words file can be used to search for words that match a selected criterion. For example, if you can’t remember whether a word is spelled ‘belief’ or ‘beleif’, try:

grep 'bel[ei][ei]f' /usr/share/dict/words

A particularly useful variant on the brackets is the use of character ranges. If you write two characters separated by a hyphen (-) inside brackets, e.g., [A-Z], then that is taken to mean all the characters starting at the first one and up to the last one, according to the usual ASCII character encoding rules.

Example 7: Try This: Regular expressions - brackets with character ranges

Try:

grep 'i[A-Z]' /usr/share/dict/words
sed 's/[A-Z]/*/g' ~/playing/alas.txt
sed 's/[a-z]/*/g' ~/playing/alas.txt
sed 's/[A-Hm-z]/*/g' ~/playing/alas.txt

This can be combined with ^:

sed 's/[^A-Z]/*/g' ~/playing/alas.txt
Example 8: Try This: grep and -v

The -v option causes grep to list only those lines that do not match the pattern.

Try:

grep -v a ~/playing/alas.txt
grep -v e ~/playing/alas.txt
grep -v i ~/playing/alas.txt
grep -v o ~/playing/alas.txt

And

grep the ~/playing/alas.txt
grep -v the ~/playing/alas.txt
Example 9: Try This: Regular expressions - brackets with ^

Do you remember being taught in grade-school spelling class that the letter ‘q’ is always followed by a ‘u’ in English?

Let’s checK:

grep 'q[^u]' /usr/share/dict/words

(Actually, ardent Scrabble players and crossword puzzlers will recognize that /usr/share/dict/words is missing the word “qat”.)

Or how about that rule “‘i’ before ‘e’ except after ‘c’”?

grep 'cie' /usr/share/dict/words
grep '[^c]ei' /usr/share/dict/words
Example 10: Try This: Regular Expressions - Matching any single character
grep 'd.f' /usr/include/math.h

Can you think of a word that has two ’z’s separated by a single letter?

grep 'z.z' /usr/share/dict/words
Example 11: Try This: Regular Expressions - Zero or more repeats
grep 'ER*N' /usr/include/math.h
grep '#.*if' /usr/include/math.h

Note that the behavior of * is very different in regular expressions than it is in file path wildcards.

Example 12: Try This: Regular Expressions - One or more repeats
grep 'ERN' /usr/include/math.h
egrep 'ER+N' /usr/include/math.h
grep 'ER*N' /usr/include/math.h
grep 'ER\+N' /usr/include/math.h
Example 13: Try This: Regular Expressions - Optional occurrence
egrep 'def' /usr/include/math.h
egrep 'define' /usr/include/math.h
egrep 'def(ine)\?' /usr/include/math.h
egrep 'def(ine)\?d' /usr/include/math.h

In essence, the ? makes the preceding item optional. (So the first and third commands in this example are actually equivalent.)

Example 14: Try This: Regular Expressions - Beginning of the string
grep 'ex' /usr/include/math.h
grep '^ex' /usr/include/math.h

and

grep '[nN]' ~/playing/alas.txt
grep '^[nN]' ~/playing/alas.txt
grep '^ *[nN]' ~/playing/alas.txt
Example 15: Try This: Regular Expressions - End of the string

Try:

grep '_' /usr/include/math.h
grep '_$' /usr/include/math.h

With so many special characters, you might wonder just what you’re supposed to do if you really want to search for lines containing a "*“, or a ”?", or a … The answer is given in our final rule:

Example 16: Try This: Regular Expressions - Quoting special characters
grep '.$' ~/playing/alas.txt
grep '\.$' ~/playing/alas.txt
grep '3.' /usr/include/math.h
grep '3\.' /usr/include/math.h

This is by no means an exhaustive list of all the regular expression operations, but it’s probably enough for most purposes.

3 Sed Redux

sed has some regular expression features not useful in grep. In particular, if you place part of a regular expression inside parentheses (written as \( and \) ), then in the replacement string you can refer to whatever got matched by the aprenthesized part of the expression via a back reference. If you have just one parenthesized expression, the back reference is \1. If you add another parenthesized expression, you can refer back to it as \2, and so on.

Example 17: Try This: Back References in sed
cd ~/playing
cp ~cs252/Assignments/ftpAsst/names.txt .
cat names.txt
sed 's/\([A-Za-z]\+\) /*/' names.txt
sed 's/\([A-Za-z]\+\) \(.*\)/*/' names.txt

The first parenthesized expression matches a block of 0 or more alphabetic characters. That is followed by a blank, which matches a blank in the file. The second parenthesized expression then matches a block of 0 or more of any character, in effect swallowing up the rest of the line.

sed 's/\([A-Za-z]\+\) \(.*\)/\1/' names.txt
sed 's/\([A-Za-z]\+\) \(.*\)/\2/' names.txt
sed 's/\([A-Za-z]\+\) \(.*\)/\2, \1/' names.txt

Note how, in the final replacement pattern, we reversed the order of the two matched blocks, as well as inserting a comma between them.