Using Patterns in Commands
In the command examples we have used so far, we have always written a single file path or a single text string. In many cases, however, we want to supply commands with a whole list of files or text strings. Typing out the whole list, one at a time, would be tedious, so we usually write some kind of pattern that describes multiple items instead.
1 Wild Cards
Whenever we have a command that can take multiple filenames, we can often write a single pattern for several files. Patterns for file names use wildcard characters, the most common of which is "*", which tells the shell (the program that reads your keyboard input, determines what command or program you want to run, then launches that program) to substitute any combination of zero or more characters that results in an existing file name.
Example 1: Try This: Wild cards in common commandsls ~/playing rm ~/playing/* ls ~/playing
What files were matched by the wildcard pattern in the
rm
command?ls /usr/include
Notice that there are a number of files ending with .h
cp /usr/include/m*.h /usr/include/s*.h ~/playing ls ~/playing ls ~/playing/se* ls ~/playing/sq*.h ls ~/playing/se* ~/playing/sq*.h
Again, note the use of the wildcard to form a pattern for multiple file names. In cases, like this, where there are multiple possible matches, the shell forms a list of all the matches. So * the earlier
rm
command actually saw a list of all the files in the~/playing
directory. * Thecp
command saw all the files in the in the/usr/include
directory whose names began with “s” and ended with “.h”. * The variousls
commands saw restricted sets of files based upon the non-special characters intermixed with the wildcards.
One good way to figure out what files will match a wildcard pattern is to use the echo
command. echo
simply prints out its arguments. But since the arguments in the command line are processed by the shell before invoking the echo
program, any wildcard patterns will have already been expanded.
Example 2: Try This: Showing the effects of a wildcard patternls /usr/include echo /usr/include echo /usr/include/*.* echo /usr/include/*
The difference between the last two may be subtle. The “.” pattern will match only files that contain a “.”. Unlike Windows, Unix does not require file names to end with a period and a three-letter extension. Some sort of period and extension is common, but directory names and executable programs often have no extension and no period. (In Windows, you can create a file with an empty extension, but Windows insists on adding a period at the end.)
echo /usr/include/f*.* echo /usr/include/*f*.*
Many of the commands that we have already looked at will allow you to specify multiple files to operate on at one time. The easiest ways to give multiple files will be to use wildcards.
Example 3: Try This: Operating on multiple files at onceDo:
man grep
Look at the explanations of the -i and -l flags. Then observe the differences among the following outputs:
grep LARGE /usr/include/*.h grep -i LARGE /usr/include/*.h grep -l LARGE /usr/include/*.h grep -i -l LARGE /usr/include/*.h
2 Regular Expressions
If wildcards provide a way to write patterns for file and directory paths, can we also write patterns for text strings? Yes, but this is not built into the shell for use by every command, the way that wildcards are. Instead, most Unix programs and commands that do some kind of searching or matching for text will share a common notation for patterns of text to be matched. This notation is called regular expressions. For example, almost every text editor in any operating system will allow you to search a file for a given string. But most Unix text editors (including the emacs
editor we’ll study later) will allow you to search for any string matching a regular expression “pattern”. sed
, a useful utility for doing simple changes to text files, is most often invoked to use its “substitute” command, which replaces any text matching a regular expression by some desired replacement text. The csplit
command splits a single file into multiple pieces, where the point of division is most often indicated via a regular expression. Perl
and awk
, available on most Unix systems but not covered in this course, are scripting (programming) languages with a heavy emphasis on text manipulation, which is accomplished largely through matching on regular expressions.
2.1 Regular Expressions and grep
In an earlier example, we saw that the program grep
can be used to list all lines of a file that match a given string. For example,
grep 'def' /usr/include/math.h
would list all lines in the indicated file that contain the string “def”.
The first parameter (‘def’) is actually an example of a regular expression, a special notation for writing patterns for searchin and matching text. The above example works because of the way that regular expressions are composed. The principle rules of regular expressions are:
- A regular expression consisting of a single “non-special” character will match any string containing that character.
As it happens, none of the alphabetic and numeric characters are “special”, so the regular expression
d
would match any string containing a “d”.
Example 4: Try This: Regular expressions - basic charactersgrep d /usr/include/math.h
Since
grep
works line by line, this would select every line containing a “d”.
- If a set of regular expressions \( r_1, r_2, \ldots, r_k \) are concatenated together to form a single larger regular expression \( r{1}r_{2}\ldots r_{k} \), it matches any string that contains a substring formed from a concatenation of strings \( s_{1}s_{2}\ldots s_{k} \), each of which matches the corresponding regular expression._
So when we write
grep def /usr/include/math.h
the
def
is actually the concatenation of three regular expressionsd
,e
, andf
, and matches any string that contains a substring matchingd
followed immediately by a substringe
followed immediately by a substring matchingf
.That may seem an unnecessarily complex way to get to the original idea of “matches the string `def’,” but this idea of concatenation is a general one that becomes more important as we consider other combinations of smaller regular expressions.
The real power of regular expressions comes into play when we consider the various “special” characters that serve as regular expression operators.
- Regular expressions can be grouped in parentheses, written as
(
…)
.
Example 5: Try This: Regular expressions - parenthesesBased on the definition of parentheses, these two commands should do exactly the same thing:
egrep 'def' /usr/include/math.h egrep '(def)' /usr/include/math.h
The introduction of the parentheses does not change what is matched. It does, however, group things together (just as parentheses do in conventional algebra), which we can take advantage of with the operators we will introduce shortly.
Try:
grep '(' /usr/include/math.h egrep '(' /usr/include/math.h egrep '\(' /usr/include/math.h
“(” is a special character to
egrep
, but not togrep
. To take away the “special-ness”, we can quote the parenthesis character when presenting it toegrep
.
- Square brackets
[
]
containing any set of characters not beginning with ^ will match a string containing any one of those characters._
Example 6: Try This: Regular expressions - bracketsTry these commands:
grep math /usr/include/math.h grep '[mM][aA][tT][hH]' /usr/include/math.h grep -i math /usr/include/math.h
The regular expression
[mM]
matches any single character that is either “m” or “M”. It’s sometimes useful, but in this case it is a rather clumsy way to make our match case-insensitive.Doing case-insensitive matches is so common that
grep
provides a shortcut. The-i
flag makes all character matches case insensitive, so the last two commands are identical.
Example 7: Try This: Regular expressions - brackets containing special charactersCompare the output from
grep '(' /usr/include/math.h grep ' (' /usr/include/math.h grep '[ (]' /usr/include/math.h grep '((' /usr/include/math.h grep '[ (](' /usr/include/math.h
- The character set inside a regular expression square brackets can be abbreviated by giving a range of characters separated by a hyphen. For example,
[a-z]
would match all the lower-case alphabetic characters.
Example 8: Try This: grep and -vWhat lines are not listed by the following?
grep '_[a-zA-Z]' /usr/include/math.h
If you can’t figure it out, you can check with the command
grep -v '_[a-zA-Z]' /usr/include/math.h
The
-v
option causesgrep
to list only those lines that do not match the pattern.
- Square brackets containing any set of characters beginning with ^
[^
…]
will match a string containing any character not in that set._
Example 9: Try This: Regular expressions - brackets with ^grep 'def' /usr/include/math.h grep 'def[i]' /usr/include/math.h grep 'def[^i]' /usr/include/math.h
.
matches any single printable character, including blanks, but not including the end-of-line character._
Example 10: Try This: Regular Expressions - Matching any single charactergrep 'd.f' /usr/include/math.h
- If \(r\) is a regular expression, then \(r*\) matches zero or more successive strings, each of which matches \(r\).
Example 11: Try This: Regular Expressions - Zero or more repeatsgrep 'ER*N' /usr/include/math.h grep '#.*if' /usr/include/math.h
Note that the behavior of
*
is very different in regular expressions than it is in file path wildcards.
- If \(r\) is a regular expression, then \(r+\) matches one or more successive strings, each of which matches \(r\).
Example 12: Try This: Regular Expressions - One or more repeatsgrep 'ERN' /usr/include/math.h egrep 'ER+N' /usr/include/math.h grep 'ER*N' /usr/include/math.h grep 'ER\+N' /usr/include/math.h
(The use of the + operator in regular expressions is fairly common. However, in some early standards, it is defined as an “extended” operator available in probrams like egrep but not egrep. In later years, it was added into programs like grep and, as we’ll see later, sed, but to preserve compatibility with people who had got used to + being an ordinary character in those programs, the “one or more of” operator in grep and sed was made “+”.)
- If \(r\) is a regular expression, then \(r?\) matches zero or one strings matching \(r\).
Example 13: Try This: Regular Expressions - Optional occurrenceegrep 'def' /usr/include/math.h egrep 'define' /usr/include/math.h egrep 'def(ine)?' /usr/include/math.h egrep 'def(ine)?d' /usr/include/math.h
In essence, the
?
makes the preceding item optional. (So the first and third commands in this example are actually equivalent.)
- If \(r\) is a regular expression, then \(\hat{ }r\) matches any string that begins with a substring matching \(r\).
Example 14: Try This: Regular Expressions - Beginning of the stringgrep 'ex' /usr/include/math.h grep '^ex' /usr/include/math.h
- If \(r\) is a regular expression, then \(r$\) matches any string that ends with a substring matching \(r\).
Example 15: Try This: Regular Expressions - End of the stringgrep '_' /usr/include/math.h grep '_$' /usr/include/math.h
With so many special characters, you might wonder just what you’re supposed to do if you really want to search for lines containing a "*“, or a ”?", or a … The answer is given in our final rule:
- If
c
is a special character, then\c
matches that character in a string.
Example 16: Try This: Regular Expressions - Quoting special charactersgrep '3.' /usr/include/math.h grep '3\.' /usr/include/math.h
This is by no means an exhaustive list of all the regular expression operations, but it’s probably enough for most purposes.
2.2 Regular Expressions and sed
As noted earlier, many other commands besides grep
will use regular expressions. sed
, for example, allows you to enter a variety of editing commands that will be applied to every line of a file. A common use of sed
is to scan each line of the file for a pattern and to replace that pattern, wherever it occurs, by some string. The sed
command to do this is
sed s/pattern/replacement/g filename
where filename is the file whose contents we want to scan and replace, pattern is a regular expression describing the text to search for in each line, and replacement is the text by which we wish to replace any thing that matches the pattern. (The ‘/’ characters are simply necessary to indicate the beginning and end of the pattern and replacement strings. They can be replaced by any character that does not appear in either the pattern or replacement strings.)
Example 17: Try This: Substitutions with sedcd ~/playing cp ~cs252/Assignments/ftpAsst/alas.txt . more alas.txt
Now try operating on that file with
sed
:sed s/e/Q/g alas.txt
for a simple text-to-text substitution, and
sed 's/[aeiou]/X/g' alas.txt
to see that the pattern really is interpreted as a regular expression. (The quotes are required because, as explained in the next section, the ‘[‘ and ’]’ characters are normally considered special characters by the shell, and the quotes tell the shell to leave those characters alone, so that
sed
can actually see them.)
The ‘g’ at the end of each of the prior examples indicates that the change should be applied every time a match is found. If the ‘g’ is dropped, only the first match in each line will be replaced.
Example 18: Try This: Substitutions (first occurrence only) with sedsed 's/[aeiou]/X/' alas.txt
to see the effect of dropping the ‘g’.
sed has some regular expression features not useful in grep. In particular, if you place part of a regular expression inside parentheses (written as \(
and \)
), then in the replacement string you can refer to whatever got matched by the aprenthesized part of the expression via a back reference. If you have just one parenthesized expression, the back reference is \1. If you add another parenthesized expression, you can refer back to it as \2, and so on.
Example 19: Try This: Back References in sedcd ~/playing cp ~cs252/Assignments/ftpAsst/names.txt . cat names.txt sed 's/\([A-Za-z]\+\) \(.*\)/\2, \1/' names.txt
The first parenthesized expression matches a block of 0 or more alphabetic characters. That is followed by a blank, which matches a blank in the file. The second parenthesized expression then matches a block of 0 or more of any character, in effect swallowing up the rest of the line.
Note how, in the replacement pattern, we reversed the order of the two matched blocks, as well as inserting a comma between them.