Using Patterns in Commands

Last modified: Dec 22, 2017
Contents:
 

In the command examples we have used so far, we have always written a single file path or a single text string. In many cases, however, we want to supply commands with a whole list of files or text strings. Typing out the whole list, one at a time, would be tedious, so we usually write some kind of pattern that describes multiple items instead.

1 Wild Cards

Whenever we have a command that can take multiple filenames, we can often write a single pattern for several files. Patterns for file names use wildcard characters, the most common of which is "*", which tells the shell (the program that reads your keyboard input, determines what command or program you want to run, then launches that program) to substitute any combination of zero or more characters that results in an existing file name.

Example 1: Try This: Wild cards in common commands
ls ~/playing
rm ~/playing/*
ls ~/playing

What files were matched by the wildcard pattern in the rm command?

ls /usr/include

Notice that there are a number of files ending with .h

cp /usr/include/m*.h /usr/include/s*.h ~/playing
ls ~/playing
ls ~/playing/se*
ls ~/playing/sq*.h
ls ~/playing/se* ~/playing/sq*.h

Again, note the use of the wildcard to form a pattern for multiple file names. In cases, like this, where there are multiple possible matches, the shell forms a list of all the matches. So * the earlier rm command actually saw a list of all the files in the ~/playing directory. * The cp command saw all the files in the in the /usr/include directory whose names began with “s” and ended with “.h”. * The various ls commands saw restricted sets of files based upon the non-special characters intermixed with the wildcards.

One good way to figure out what files will match a wildcard pattern is to use the echo command. echo simply prints out its arguments. But since the arguments in the command line are processed by the shell before invoking the echo program, any wildcard patterns will have already been expanded.

Example 2: Try This: Showing the effects of a wildcard pattern
ls /usr/include
echo /usr/include
echo /usr/include/*.*
echo /usr/include/*

The difference between the last two may be subtle. The “.” pattern will match only files that contain a “.”. Unlike Windows, Unix does not require file names to end with a period and a three-letter extension. Some sort of period and extension is common, but directory names and executable programs often have no extension and no period. (In Windows, you can create a file with an empty extension, but Windows insists on adding a period at the end.)

echo /usr/include/f*.*
echo /usr/include/*f*.*

Many of the commands that we have already looked at will allow you to specify multiple files to operate on at one time. The easiest ways to give multiple files will be to use wildcards.

Example 3: Try This: Operating on multiple files at once

Do:

man grep

Look at the explanations of the -i and -l flags. Then observe the differences among the following outputs:

grep LARGE /usr/include/*.h
grep -i LARGE /usr/include/*.h
grep -l LARGE /usr/include/*.h
grep -i -l LARGE /usr/include/*.h

2 Regular Expressions

If wildcards provide a way to write patterns for file and directory paths, can we also write patterns for text strings? Yes, but this is not built into the shell for use by every command, the way that wildcards are. Instead, most Unix programs and commands that do some kind of searching or matching for text will share a common notation for patterns of text to be matched. This notation is called regular expressions. For example, almost every text editor in any operating system will allow you to search a file for a given string. But most Unix text editors (including the emacs editor we’ll study later) will allow you to search for any string matching a regular expression “pattern”. sed, a useful utility for doing simple changes to text files, is most often invoked to use its “substitute” command, which replaces any text matching a regular expression by some desired replacement text. The csplit command splits a single file into multiple pieces, where the point of division is most often indicated via a regular expression. Perl and awk, available on most Unix systems but not covered in this course, are scripting (programming) languages with a heavy emphasis on text manipulation, which is accomplished largely through matching on regular expressions.

2.1 Regular Expressions and grep

In an earlier example, we saw that the program grep can be used to list all lines of a file that match a given string. For example,

grep 'def' /usr/include/math.h

would list all lines in the indicated file that contain the string “def”.

The first parameter (‘def’) is actually an example of a regular expression, a special notation for writing patterns for searchin and matching text. The above example works because of the way that regular expressions are composed. The principle rules of regular expressions are:

Example 4: Try This: Regular expressions - basic characters
grep d /usr/include/math.h

Since grep works line by line, this would select every line containing a “d”.

The real power of regular expressions comes into play when we consider the various “special” characters that serve as regular expression operators.

Example 5: Try This: Regular expressions - parentheses
 

Based on the definition of parentheses, these two commands should do exactly the same thing:

egrep 'def' /usr/include/math.h
egrep '(def)' /usr/include/math.h

The introduction of the parentheses does not change what is matched. It does, however, group things together (just as parentheses do in conventional algebra), which we can take advantage of with the operators we will introduce shortly.

Try:

grep '(' /usr/include/math.h
egrep '(' /usr/include/math.h
egrep '\(' /usr/include/math.h

“(” is a special character to egrep, but not to grep. To take away the “special-ness”, we can quote the parenthesis character when presenting it to egrep.

Example 6: Try This: Regular expressions - brackets

Try these commands:

grep math /usr/include/math.h
grep '[mM][aA][tT][hH]' /usr/include/math.h
grep -i math /usr/include/math.h

The regular expression [mM] matches any single character that is either “m” or “M”. It’s sometimes useful, but in this case it is a rather clumsy way to make our match case-insensitive.

Doing case-insensitive matches is so common that grep provides a shortcut. The -i flag makes all character matches case insensitive, so the last two commands are identical.

Example 7: Try This: Regular expressions - brackets containing special characters

Compare the output from

grep '(' /usr/include/math.h
grep ' (' /usr/include/math.h
grep '[ (]' /usr/include/math.h
grep '((' /usr/include/math.h
grep '[ (](' /usr/include/math.h
Example 8: Try This: grep and -v

What lines are not listed by the following?

grep '_[a-zA-Z]' /usr/include/math.h

If you can’t figure it out, you can check with the command

grep -v '_[a-zA-Z]' /usr/include/math.h

The -v option causes grep to list only those lines that do not match the pattern.

Example 9: Try This: Regular expressions - brackets with ^
grep 'def' /usr/include/math.h
grep 'def[i]' /usr/include/math.h
grep 'def[^i]' /usr/include/math.h
Example 10: Try This: Regular Expressions - Matching any single character
grep 'd.f' /usr/include/math.h
Example 11: Try This: Regular Expressions - Zero or more repeats
grep 'ER*N' /usr/include/math.h
grep '#.*if' /usr/include/math.h

Note that the behavior of * is very different in regular expressions than it is in file path wildcards.

Example 12: Try This: Regular Expressions - One or more repeats
grep 'ERN' /usr/include/math.h
egrep 'ER+N' /usr/include/math.h
grep 'ER*N' /usr/include/math.h
grep 'ER\+N' /usr/include/math.h

(The use of the + operator in regular expressions is fairly common. However, in some early standards, it is defined as an “extended” operator available in probrams like egrep but not egrep. In later years, it was added into programs like grep and, as we’ll see later, sed, but to preserve compatibility with people who had got used to + being an ordinary character in those programs, the “one or more of” operator in grep and sed was made “+”.)

Example 13: Try This: Regular Expressions - Optional occurrence
egrep 'def' /usr/include/math.h
egrep 'define' /usr/include/math.h
egrep 'def(ine)?' /usr/include/math.h
egrep 'def(ine)?d' /usr/include/math.h

In essence, the ? makes the preceding item optional. (So the first and third commands in this example are actually equivalent.)

Example 14: Try This: Regular Expressions - Beginning of the string
grep 'ex' /usr/include/math.h
grep '^ex' /usr/include/math.h
Example 15: Try This: Regular Expressions - End of the string
grep '_' /usr/include/math.h
grep '_$' /usr/include/math.h

With so many special characters, you might wonder just what you’re supposed to do if you really want to search for lines containing a "*“, or a ”?", or a … The answer is given in our final rule:

Example 16: Try This: Regular Expressions - Quoting special characters
grep '3.' /usr/include/math.h
grep '3\.' /usr/include/math.h

This is by no means an exhaustive list of all the regular expression operations, but it’s probably enough for most purposes.

2.2 Regular Expressions and sed

As noted earlier, many other commands besides grep will use regular expressions. sed, for example, allows you to enter a variety of editing commands that will be applied to every line of a file. A common use of sed is to scan each line of the file for a pattern and to replace that pattern, wherever it occurs, by some string. The sed command to do this is

sed s/pattern/replacement/g filename

where filename is the file whose contents we want to scan and replace, pattern is a regular expression describing the text to search for in each line, and replacement is the text by which we wish to replace any thing that matches the pattern. (The ‘/’ characters are simply necessary to indicate the beginning and end of the pattern and replacement strings. They can be replaced by any character that does not appear in either the pattern or replacement strings.)

Example 17: Try This: Substitutions with sed
cd ~/playing
cp ~cs252/Assignments/ftpAsst/alas.txt .
more alas.txt

Now try operating on that file with sed:

sed s/e/Q/g alas.txt

for a simple text-to-text substitution, and

sed 's/[aeiou]/X/g' alas.txt

to see that the pattern really is interpreted as a regular expression. (The quotes are required because, as explained in the next section, the ‘[‘ and ’]’ characters are normally considered special characters by the shell, and the quotes tell the shell to leave those characters alone, so that sed can actually see them.)

The ‘g’ at the end of each of the prior examples indicates that the change should be applied every time a match is found. If the ‘g’ is dropped, only the first match in each line will be replaced.

Example 18: Try This: Substitutions (first occurrence only) with sed
sed 's/[aeiou]/X/' alas.txt

to see the effect of dropping the ‘g’.

sed has some regular expression features not useful in grep. In particular, if you place part of a regular expression inside parentheses (written as \( and \) ), then in the replacement string you can refer to whatever got matched by the aprenthesized part of the expression via a back reference. If you have just one parenthesized expression, the back reference is \1. If you add another parenthesized expression, you can refer back to it as \2, and so on.

Example 19: Try This: Back References in sed
cd ~/playing
cp ~cs252/Assignments/ftpAsst/names.txt .
cat names.txt
sed 's/\([A-Za-z]\+\) \(.*\)/\2, \1/' names.txt

The first parenthesized expression matches a block of 0 or more alphabetic characters. That is followed by a blank, which matches a blank in the file. The second parenthesized expression then matches a block of 0 or more of any character, in effect swallowing up the rest of the line.

Note how, in the replacement pattern, we reversed the order of the two matched blocks, as well as inserting a comma between them.