PrevIndexNext
Regular Expressions
To see if a line has the string "abc" in
it we can do this:
if (index($line, "abc") >= 0) {
This searching for a string or a more complex
pattern is very useful and Perl offers another
much more powerful (and, of course, more concise!)
way to do it:
if ($line =~ /abc/) {
This should be read "if line matches /abc/".
The "=~" is the match operator that takes
a scalar on its left and a pattern on the right.
It returns true if the pattern matches otherwise false.
Between the two slashes comes the pattern.
which can be very complex with LOTS of
special characters with special meanings.
Another term for these patterns is "regular expressions"
although they are not really very 'regular'!
That term comes from theoretical computer science.
"Pattern" or "Rule" is somehow more appropriate
but you will often see "regular expression" or "regexp".
Developing the patterns can be a real challenge
since they are a little language unto themselves.
This bit of code is a useful way to test and develop
a pattern:
while (my $line = <DATA>) {
if ($line =~ /abc/) {
print "matched: $line";
}
}
__DATA__
hello
abracadabra
abcdefg
The lines after the __DATA__ token are
a representative sample
of the expected input. They are easily read with
the <DATA> construct. You don't need to do an 'open' at all!
With this in place you can keep fiddling with
the pattern until you are satisfied it is correct
for both the lines it matches AND the ones
it doesn't match.
You can apply some syntactic sugar to make
the above more concise, if you wish.
while (<DATA>) {
if (/abc/) {
print "matched: $_";
}
}
If a pattern appears alone without the "=~" operator
then it applies to the (you guessed it) default variable $_.
A further simplification would yield:
while (<DATA>) {
print "matched: $_" if /abc/;
}
So ... what is a pattern?
There are 4 concepts in the construction of
a pattern: Atoms, Anchors, Repetition, and Alternation.
Atoms
An atom matches a single character.
The letters a-z, A-Z, 0-9 each match themselves literally.
/abc/
/9B3/
/hello/
A period '.' matches ANY character (except a newline).
It is like a wildcard when playing poker.
/a.c/ # abc, axc, aEc, ...
/a..d/ # a12d, aqqd, aGGd, ...
If you want to match a real dot precede the '.' with a backslash.
/3\.14159/
A character class matches any one character
from a set of characters. You enclose the set of
characters in square brackets [ ] like so:
/[abc]/ # matches a, b, or c
/[abcABC]/
/[0-9]/ # ranges are okay
/[a-zA-Z]/ # any letter
/[a-z][0-9]/
There are some abbreviations for common character classes:
\d [0-9] # digits
\w [a-zA-Z0-9_] # word character
\s [\t\n ] # white space
If the first character in a character class
is a hat/circumflex '^' then the atom matches
any character that is NOT in the set.
[^abc] # matches anything EXCEPT a, b, or c
More abbreviations:
\D [^0-9]
\W [^a-zA-Z0-9_]
\S [^\t\n ]
Anchors
To match at a particular place in the line
there are some special anchors:
/^abc/ # matches abc at the beginning of the line
/abc$/ # matches abc at the end of the line
Note that the hat '^' serves two purposes in patterns.
Repetition
There are three pattern characters used for repetition
- a kind of loop construct:
/z+/ # matches one or more z's in a row
/z*/ # matches zero or more z's in a row
/z?/ # matches zero or one z - i.e. the z is optional.
These are also called "quantifiers" in some books.
These quantifiers apply only to the immediately preceding
atom. If you want it to apply to more than just the
one atom you can use parentheses to group atoms:
/(ab)+/ # one or more sequences of 'ab'.
A common example:
/ab.*yz/ # 'ab' followed by
# ANY AMOUNT OF ANYTHING followed by 'yz'.
You will see '.*' a lot to skip over things you don't care about.
Alternation
The pipe '|' introduces an OR into patterns:
/hello|goodbye/ # matches hello OR goodbye
The pipe serves as a larger grouping construct as well.
The above is NOT the same as:
/hell[og]oodbye/
There is a great deal more to what you can put inside
a pattern but this will suffice for the moment.
Learn these things well - they will be used over and over again.
What Matched?
If you surround a portion of the pattern with parentheses
then once a match has been verified you can see
exactly what matched that portion. For example:
$line = "the_count = 45";
if ($line =~ /(\w+)\s*=\s*(\d+)/) {
print "name: $1 and number: $2\n";
}
See the ( ) surrounding \w+ and \d+?
When there is a successful match the name and
the number are put in the variables $1 and $2.
Scalar vs List Context
A regular expression match has a value.
In scalar context the value is a boolean that says whether
the match succeeded or not. This is illustrated by the
previous many examples.
If the match is put in list context it returns
the list of matched variables $1, $2, etc. This is best
illustrated by rewriting the above example:
$line = "the_count = 45";
if (($name, $number) = $line =~ /(\w+)\s*=\s*(\d+)/) {
print "name: $name and number: $number\n";
}
This makes for very clear and compact code, yes?
Substitutions
Regular expressions can also be used to modify scalars.
$line = "Now is the time for the election";
$line =~ s/e/E/;
The above is a substitute statement.
It will replace the first 'e' in $line with 'E'
resulting in:
Now is thE time for the election
Appending a 'g' will do the replacements globally:
$line =~ s/e/E/g;
Which results in:
Now is thE timE for thE ElEction
Between the first two slashes in the substitute
statement you can put ANY arbitrarily complex pattern.
$line =~ s/[aeiou]/X/g;
will result in:
NXw Xs thX tXmX fXr thX XlXctXXn
Without the "$var =~" the substitution will take
place on $_, as you might expect by now.
Other Modifiers
In addition to the 'g' modifier above there are also
'i', 'x' and 'e'.
Case Insensitivity
Ignoring the case of letters in a pattern is very simple to do:
if ($name =~ /Charles/i) {
...
}
See the little 'i' after the closing slash?
It stands for insensitive - specifically case insensitivity.
The pattern will match Charles, CHARLES and even ChArLeS!
Expanding the Regular Expression
With the 'x' operator you can eXpand the pattern
by putting white space and comments to make it more
readable. Instead of:
if (/^\s*([a-z])+\s*=\s*(\d+)\s*$/) {
print "var: $1 num: $2\n";
}
you can do this:
if (/
^\s*
([a-z]+) # variable
\s*=\s* # equals
(\d+) # number
\s*$
/x
) {
print "var: $1 num: $2\n";
}
The important difference is the 'x' after the
final slash. This makes the cryptic expression
much clearer. This also illustrates a conventional exception
to the indentation rules. Note the placement of
the closing parenthesis and the opening brace '{'.
The new code alignment rule is this: If the boolean expression
of an 'if' or 'while' extends across more than one line
then the parenthesis and brace ') {' is not put on the last line (where
it might be lost in the visual shuffle). Instead, it is
aligned with the keyword on a line by itself.
Evaluating the Regular Expression
With the 'e' modifier the replacement portion of
a substitute statement will be Evaluated as an
arbitrary Perl expression and the result will be
used as the replacement string. Like so:
$line = "hello there 3 and more 67";
$line =~ s/(\d+)/ $1 * 2 /eg;
results in:
hello there 6 and more 134
The ability to evaluate arbitrary Perl code is a very powerful
extension of the regular expression machinery. All kinds of
fancy things become possible!
The 'g' above caused a global replacement.
Exercises
-
Using the test program to read lines
from DATA make regular expressions to match:
- ab followed by a digit followed by a capital letter
- xy followed by ANY two characters and then an F.
- lines beginning with a 'G'
- lines that end with a digit
- lines containing the name 'fred' or 'Fred' or 'FRED' or ...
(ignore the case of letters).
- A perl identifier
(Begins with a letter followed by any number of letters, digits or underscores).
- Describe (in English) what this code does:
while (<DATA>) {
next unless /\S/;
next if /^\s*#/;
print;
}
- Make a program to identify lines as one of these categories:
Have at least 20 lines of input with 2 or 3
examples of each category.
Don't try to be overly precise.
Just get some practice with patterns.
- Read lines from DATA and print to STDOUT.
Skip entirely blank lines and lines that begin with a sharp.
Some lines will be "definition lines".
Use the definitions later when the variable
appears surrounded by percent signs.
Definition lines begin with a name
followed by an equal sign. The rest of the line is the value.
For example:
-- input --
# definitions
age = 23
year = 1979
name = Mary Smith
# lines to process and output:
Her name is %name% and she is %age% years old.
%name% was born in %year%.
-- output --
Her name is Mary Smith and she is 23 years old.
Mary Smith was born in 1979.
- Read lines from DATA and print them
followed by the number of vowels in the line.
Do this with regular expressions
by deleting all non-vowels in the line. All that
will remain will be the vowels so the length
of the line will be the number of vowels!
Include both upper and lower case vowels.
-- input --
HELLO how are you doing?
I'm fine.
-- output --
HELLO how are you doing?: 9
I'm fine: 3
PrevIndexNext