Bash Regular Expressions Example
A regular expression is a pattern that describes a set of strings. Regular expressions are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions.
The fundamental building blocks are the regular expressions that match a single character. Most characters, including all letters and digits, are regular expressions that match themselves. Any metacharacter with special meaning may be quoted by preceding it with a backslash.
The following table shows an overview of the whole article:
Table Of Contents
1. Regular expression metacharacters
A regular expression may be followed by one of several repetition operators (metacharacters):
Operator | Effect |
---|---|
. | Matches any single character. |
? | The preceding item is optional and will be matched, at most, once. |
* | The preceding item will be matched zero or more times. |
+ | The preceding item will be matched one or more times. |
{N} | The preceding item is matched exactly N times. |
{N,} | The preceding item is matched N or more times. |
{N,M} | The preceding item is matched at least N times, but not more than M times. |
– | represents the range if it’s not first or last in a list or the ending point of a range in a list. |
^ | Matches the empty string at the beginning of a line; also represents the characters not in the range of a list. |
$ | Matches the empty string at the end of a line. |
\b | Matches the empty string at the edge of a word. |
\B | Matches the empty string provided it’s not at the edge of a word. |
\< | Match the empty string at the beginning of word. |
\> | Match the empty string at the end of word. |
Two regular expressions may be concatenated; the resulting regular expression matches any string formed by concatenating two substrings that respectively match the concatenated subexpressions.
Two regular expressions may be joined by the infix operator “|”. The resulting regular expression matches any string matching either subexpression.
Repetition takes precedence over concatenation, which in turn takes precedence over alternation. A whole subexpression may be enclosed in parentheses to override these precedence rules.
2. Examples using grep
The command grep searches the input files for lines containing a match to a given pattern list. When it finds a match in a line, it copies the line to standard output (by default), or whatever other sort of output you have requested with options.
Though grep expects to do the matching on text, it has no limits on input line length other than available memory, and it can match arbitrary characters within a line. If the final byte of an input file is not a newline, grep silently supplies one. Since newline is also a separator for the list of patterns, there is no way to match newline characters in a text.
The following Textfile will be used for the next examples:
Text.txt
We shall not spend a large expense of time Before we reckon with your several loves, And make us even with you. My thanes and kinsmen, Henceforth be earls, the first that ever Scotland In such an honour named. What's more to do, Which would be planted newly with the time, As calling home our exiled friends abroad That fled the snares of watchful tyranny; Producing forth the cruel ministers Of this dead butcher and his fiend-like queen, Who, as 'tis thought, by self and violent hands Took off her life; this, and what needful else That calls upon us, by the grace of Grace, We will perform in measure, time and place: So, thanks to all at once and to each one, Whom we invite to see us crown'd at Scone. Macbeth, William Shakespeare
With the first command, the lines from Test.txt containing the string with will be displayed.
The next command displays the line numbers containing this search string.
2.1 Line and word anchors
In the following example, we now exclusively want to display lines starting with the string “We”.
In the next example, we search for lines ending in “:”.
2.2 Character classes
A bracket expression is a list of characters enclosed by “[” and “]”. It matches any single character in that list.
If the first character of the list is the caret, “^”, then it matches any character NOT in the list. For example, the regular expression “[0123456789]” matches any single digit.
Within a bracket expression, a range expression consists of two characters separated by a hyphen. It matches any single character that sorts between the two characters, inclusive, using the locale’s collating sequence and character set.
For example, in the default C locale, “[a-d]” is equivalent to “[abcd]”. Many locales sort characters in dictionary order, and in these locales “[a-d]” is typically not equivalent to “[abcd]”; it might be equivalent to “[aBbCcDd]”, for example.
To obtain the traditional interpretation of bracket expressions, you can use the C locale by setting the LC_ALL environment variable to the value “C”.
In the following example, all the lines containing either a “y” or “c” character are displayed:
2.3 Wildcards
Use the “.” for a single character match. If you want to get a list of all five-character English dictionary words starting with “c” and ending in “s” (handy for solving crosswords):
If you want to display lines containing the literal dot character, use the -F option to grep.
For matching multiple characters, use the asterisk. This example selects all words starting with “c” and ending in “s” from the system’s dictionary:
3. Pattern matching using Bash features
3.1 Character ranges
Apart from grep and regular expressions, there’s a good deal of pattern matching that you can do directly in the shell, without having to use an external program.
As you already know, the asterisk (*) and the question mark (?) match any string or any single character, respectively. Quote these special characters to match them literally:
This lists all files in the current directory, starting with “A”, “B” or “C”.
If the first character within the braces is “!” or “^”, any character not enclosed will be matched. To match the dash (“-“), include it as the first or last character in the set.
The sorting depends on the current locale and of the value of the LC_COLLATE variable, if it is set. Mind that other locales might interpret “[a-cx-z]” as “[aBbCcXxYyZz]” if sorting is done in dictionary order.
If you want to be sure to have the traditional interpretation of ranges, force this behavior by setting LC_COLLATE or LC_ALL to “C”.
3.2 Character classes
Character classes can be specified within the square braces, using the syntax [:CLASS:], where CLASS is defined in the POSIX standard and has one of the values:
- alnum
- alpha
- ascii
- blank
- cntrl
- digit
- graph
- lower
- punct
- space
- upper
- word
- xdigit
In the following example are all Files listed, which Begins with an Uppercase letter.
When the extglob shell option is enabled (using the shopt built-in), several extended pattern matching operators are recognized.
4. Summary
Regular expressions are powerful tools for selecting particular lines from files or output. A lot of UNIX commands use regular expressions: vim, perl, the PostgreSQL database and so on.
They can be made available in any language or application using external libraries, and they even found their way to non-UNIX systems. For instance, regular expressions are used in the Excell spreadsheet that comes with the MicroSoft Windows Office suite.
In this chapter we got the feel of the grep command, which is indispensable in any UNIX environment.