Patterns for Text: Regular Expressions
If wildcards provide a way to write patterns for file and directory paths, can we also write patterns for text strings? Yes, but this is not built into the shell for use by every command, the way that wildcards are.
Instead, most Unix programs and commands that do some kind of searching or matching for text will share a common notation for patterns of text to be matched. This notation is called regular expressions.
1 Searching for Text
For example, almost every text editor in any operating system will allow you to search a file for a given string. But most Unix text editors (including emacs
and vim
) will allow you to search for any string matching a regular expression “pattern”. sed
, a useful utility for doing simple changes to text files, is most often invoked to use its “substitute” command, which replaces any text matching a regular expression by some desired replacement text. The csplit
command splits a single file into multiple pieces, where the point of division is most often indicated via a regular expression. Perl
and awk
, available on most Unix systems but not covered in this course, are scripting (programming) languages with a heavy emphasis on text manipulation, which is accomplished largely through matching on regular expressions.
1.1 Searching Lines with grep
In an earlier example, we saw that the program grep
can be used to list all lines of a file that match a given string. For example,
grep 'def' /usr/include/math.h
would list all lines in the indicated file that contain the string “def”.
The first parameter (‘def’) is actually an example of a regular expression, a special notation for writing patterns for searching and matching text.
As we will see shortly, the notation for these patterns is such that a pattern that matches exactly one string (e.g., “def”) is written as that same string — “def” is both the pattern and the string that it matches.
But regular expressions are much more powerful than that. We can write a regular expression pattern to match a wide range of related strings.
First, though, let’s look at another command that makes heavy use of regular expressions.
1.2 Rewriting lines with sed
Many other commands besides grep
will use regular expressions. sed
, for example, allows you to enter a variety of editing commands that will be applied to every line of a file. A common use of sed
is to scan each line of the file for a pattern and to replace that pattern, wherever it occurs, by some string. The sed
command to do this is
sed s/pattern/replacement/g filename
where filename is the file whose contents we want to scan and replace, pattern is a regular expression describing the text to search for in each line, and replacement is the text by which we wish to replace any thing that matches the pattern. (The ‘/’ characters are simply necessary to indicate the beginning and end of the pattern and replacement strings. They can be replaced by any character that does not appear in either the pattern or replacement strings.)
Example 1: Try This: Substitutions with sedcd ~/playing cp ~cs252/Assignments/ftpAsst/alas.txt . more alas.txt
Now try operating on that file with
sed
:sed s/o/X/g alas.txt
The ‘g’ at the end of each of the prior example indicates that the change should be applied every time a match is found (i.e., this is a global replacement). If the ‘g’ is dropped, only the first match in each line will be replaced.
Try the command
sed s/o/X/ alas.txt
to see the effect of dropping the ‘g’.
Neither the pattern nor the replacement are limited to single letters.
Try:
sed 's!I!you!g' alas.txt sed 's@Horatio@George@g' alas.txt
By default,
sed
is case-sensitive. You can add an ‘i’ flag at the end of the substitution to change this. Compare the outputs of these commands:sed 's/I/you/g' alas.txt sed 's/I/you/gi' alas.txt
2 Regular Expressions
To write more powerful patterns, we need to understnad how regular expressions are constructed
- A regular expression consisting of a single “non-special” character will match any string containing that character.
As it happens, none of the alphabetic and numeric characters are “special”, so the regular expression
d
, for example, would match any string containing a “d”.
Example 2: Try This: Regular expressions - basic characterscd ~/playing grep H alas.txt
Since
grep
works line by line, this would select every line containing an “H”.
- If a set of regular expressions \( r_1, r_2, \ldots, r_k \) are concatenated together to form a single larger regular expression $ r_1r_2\ldots r_k $, it matches any string that contains a substring formed from a concatenation of strings $ s_1s_2\ldots s_k $, each of which matches the corresponding regular expression.
So when we write
grep def /usr/include/math.h
the
def
is actually the concatenation of three regular expressionsd
,e
, andf
, and matches any string that contains a substring matchingd
followed immediately by a substringe
followed immediately by a substring matchingf
.That may seem an unnecessarily complex way to get to the original idea of “matches the string `def’,” but this idea of concatenation is a general one that becomes more important as we consider other combinations of smaller regular expressions.
Example 3: Try This: Regular expressions - concatenationCompare the results of
grep H ~/playing/alas.txt grep or ~/playing/alas.txt grep Hor ~/playing/alas.txt
The real power of regular expressions comes into play when we consider the various “special” characters that serve as regular expression operators.
- Regular expressions can be grouped in parentheses, written as
\(
…\)
.-
Without the backslash (\), the parentheses are just regular characters – they match a parenthesis in the lines of text.
-
The backslash is a special character, not only to
grep
andsed
, but also to the command shell. So any command parmaters that want to use it will need to be quoted.
-
Example 4: Try This: Regular expressions - parenthesesBased on the definition of parentheses, these two commands should do exactly the same thing:
grep 'def' /usr/include/math.h grep '\(def\)' /usr/include/math.h
The introduction of the parentheses does not change what is matched. It does, however, group things together (just as parentheses do in conventional algebra), which we can take advantage of with the operators we will introduce shortly.
But compare also
grep 'x' /usr/include/math.h grep '\(x\)' /usr/include/math.h grep '(x)' /usr/include/math.h
- The vertical line
|
separating two regular expressions means that string matching either of those regular expressions would be accepted.- Again, though, to get this special behavior, the
|
must be preceded by a backslash. - This vertical bar is called the alternation operator. Alternation here means a choice (same root as “alternative”), not alternating from one thing to another and back again.
- Again, though, to get this special behavior, the
Example 5: Try This: Regular expressions - AlternationTry:
grep '\(Y\|H\)' ~/playing/alas.txt grep '\(Y\|H\)or' ~/playing/alas.txt sed 's/\(Y\|H\)or/XXX/g' ~/playing/alas.txt
It is worth noting that, in
sed
, the expression that we match the text against in a substitution is a regular expression, but the replacement text is not. Try:sed 's/\(Y\|H\)or/|||/g' ~/playing/alas.txt sed 's/\(Y\|H\)or/\|\|\|/g' ~/playing/alas.txt
- Square brackets
[
]
containing any set of characters not beginning with ^ will match a string containing any one of those characters.
Example 6: Try This: Regular expressions - bracketsTry these commands:
grep Alas ~/playing/alas.txt grep '[Alas]' ~/playing/alas.txt
Most Unix systems have a file
/usr/share/dict/words
, which is a “dictionary” used by spellcheck programs. It is not a dictionary in the sense of a list of words and definitions. It is simply a list of words, one per line, in alphabetical order.Try:
more /usr/share/dict/words
(Remember you can quit with
q
). A typical version of this file will have nearly 100,000 words.That
words
file can be used to search for words that match a selected criterion. For example, if you can’t remember whether a word is spelled ‘belief’ or ‘beleif’, try:grep 'bel[ei][ei]f' /usr/share/dict/words
A particularly useful variant on the brackets is the use of character ranges. If you write two characters separated by a hyphen (-) inside brackets, e.g., [A-Z]
, then that is taken to mean all the characters starting at the first one and up to the last one, according to the usual ASCII character encoding rules.
Example 7: Try This: Regular expressions - brackets with character rangesTry:
grep 'i[A-Z]' /usr/share/dict/words sed 's/[A-Z]/*/g' ~/playing/alas.txt sed 's/[a-z]/*/g' ~/playing/alas.txt sed 's/[A-Hm-z]/*/g' ~/playing/alas.txt
This can be combined with
^
:sed 's/[^A-Z]/*/g' ~/playing/alas.txt
- The character set inside a regular expression square brackets can be abbreviated by giving a range of characters separated by a hyphen. For example,
[a-z]
would match all the lower-case alphabetic characters.
Example 8: Try This: grep and -vThe
-v
option causesgrep
to list only those lines that do not match the pattern.Try:
grep -v a ~/playing/alas.txt grep -v e ~/playing/alas.txt grep -v i ~/playing/alas.txt grep -v o ~/playing/alas.txt
And
grep the ~/playing/alas.txt grep -v the ~/playing/alas.txt
- Square brackets containing any set of characters beginning with ^
[^
…]
will match a string containing any character not in that set._
Example 9: Try This: Regular expressions - brackets with ^Do you remember being taught in grade-school spelling class that the letter ‘q’ is always followed by a ‘u’ in English?
Let’s checK:
grep 'q[^u]' /usr/share/dict/words
(Actually, ardent Scrabble players and crossword puzzlers will recognize that
/usr/share/dict/words
is missing the word “qat”.)Or how about that rule “‘i’ before ‘e’ except after ‘c’”?
grep 'cie' /usr/share/dict/words grep '[^c]ei' /usr/share/dict/words
.
matches any single printable character, including blanks, but not including the end-of-line character._
Example 10: Try This: Regular Expressions - Matching any single charactergrep 'd.f' /usr/include/math.h
Can you think of a word that has two ’z’s separated by a single letter?
grep 'z.z' /usr/share/dict/words
- If \(r\) is a regular expression, then \(r*\) matches zero or more successive strings, each of which matches \(r\).
Example 11: Try This: Regular Expressions - Zero or more repeatsgrep 'ER*N' /usr/include/math.h grep '#.*if' /usr/include/math.h
Note that the behavior of
*
is very different in regular expressions than it is in file path wildcards.
- If \(r\) is a regular expression, then \(r\backslash +\) matches one or more successive strings, each of which matches \(r\).
Example 12: Try This: Regular Expressions - One or more repeatsgrep 'ERN' /usr/include/math.h egrep 'ER+N' /usr/include/math.h grep 'ER*N' /usr/include/math.h grep 'ER\+N' /usr/include/math.h
- If \(r\) is a regular expression, then \(r\backslash ?\) matches zero or one strings matching \(r\).
Example 13: Try This: Regular Expressions - Optional occurrenceegrep 'def' /usr/include/math.h egrep 'define' /usr/include/math.h egrep 'def(ine)\?' /usr/include/math.h egrep 'def(ine)\?d' /usr/include/math.h
In essence, the
?
makes the preceding item optional. (So the first and third commands in this example are actually equivalent.)
- If \(r\) is a regular expression, then \(\hat{ }r\) matches any string that begins with a substring matching \(r\).
-
The
grep
andsed
programs both use an entire line of text as the string to be searched, so for those programs^
matches the beginning of a line.
-
Example 14: Try This: Regular Expressions - Beginning of the stringgrep 'ex' /usr/include/math.h grep '^ex' /usr/include/math.h
and
grep '[nN]' ~/playing/alas.txt grep '^[nN]' ~/playing/alas.txt grep '^ *[nN]' ~/playing/alas.txt
- If \(r\) is a regular expression, then \(r$\) matches any string that ends with a substring matching \(r\).
-
The
grep
andsed
programs both use an entire line of text as the string to be searched, so for those programs$
matches the end of a line.
-
Example 15: Try This: Regular Expressions - End of the stringTry:
grep '_' /usr/include/math.h grep '_$' /usr/include/math.h
With so many special characters, you might wonder just what you’re supposed to do if you really want to search for lines containing a "*“, or a ”?", or a … The answer is given in our final rule:
- If
c
is a special character, then\c
matches that character in a string.
Example 16: Try This: Regular Expressions - Quoting special charactersgrep '.$' ~/playing/alas.txt grep '\.$' ~/playing/alas.txt
grep '3.' /usr/include/math.h grep '3\.' /usr/include/math.h
This is by no means an exhaustive list of all the regular expression operations, but it’s probably enough for most purposes.
3 Sed Redux
sed has some regular expression features not useful in grep. In particular, if you place part of a regular expression inside parentheses (written as \(
and \)
), then in the replacement string you can refer to whatever got matched by the aprenthesized part of the expression via a back reference. If you have just one parenthesized expression, the back reference is \1. If you add another parenthesized expression, you can refer back to it as \2, and so on.
Example 17: Try This: Back References in sedcd ~/playing cp ~cs252/Assignments/ftpAsst/names.txt . cat names.txt sed 's/\([A-Za-z]\+\) /*/' names.txt sed 's/\([A-Za-z]\+\) \(.*\)/*/' names.txt
The first parenthesized expression matches a block of 0 or more alphabetic characters. That is followed by a blank, which matches a blank in the file. The second parenthesized expression then matches a block of 0 or more of any character, in effect swallowing up the rest of the line.
sed 's/\([A-Za-z]\+\) \(.*\)/\1/' names.txt sed 's/\([A-Za-z]\+\) \(.*\)/\2/' names.txt sed 's/\([A-Za-z]\+\) \(.*\)/\2, \1/' names.txt
Note how, in the final replacement pattern, we reversed the order of the two matched blocks, as well as inserting a comma between them.