Regular expressions#
We often need to extract information from text by matching patterns. Regular expressions are a powerful tool for defining patterns.
Motivating Example#
A More Realistic Example#
California bill 1793, passed in 2018, provides for resentencing of people convicted of low-level marijuana crimes before decriminalization. Only a tiny fraction of those eligible for expungement could afford the time and attorney fees to go through the process. San Francisco teamed with Code for America to automatically search legal records to identify tens of thousands eligible people from criminal records.
Source code for identifying and summarizing convictions to be expunged is at codeforamerica/autoclearance. Several uses of regular expressions can be found in codeforamerica/autoclearance. This code is in the Ruby language, but the regular expression code is fairly similar to regular expressions in Python, e.g.,
def strip_addresses(text)
address_regex = /COM:\s*ADR-[\d]{6,8}\s*\([^)]*\)/m
text.gsub(address_regex, 'COM: [ADDRESS REDACTED]')
end
But that’s so unreadable!#
Regular expressions are notoriously difficult
to read. If I encountered the regular expression
above (/COM:\s*ADR-[\d]{6,8}\s*\([^)]*\)/m
)
in code without other contextual cues, I could
not easily guess that it is a pattern that
matches street addresses in criminal records.
Python makes it possible to write somewhat
more readable patterns.
In this chapter and in our project, I do not expect that you will become expert in reading and writing complex regular expressions. My goals for this introduction and pattern are:
I want you to be able to read very basic regular expressions for text that has a restricted, very predictable format.
I want you to be familiar enough with regular expressions that sometime in the future, when you encounter a problem for which more complex regular expressions are the right tool, you will be ready to open up one of the many, many tutorials or reference guides available online and learn what you need to solve your immediate problem.
Patterns are Like Programs#
Regular expressions are like a little
programming language designed just for
matching patterns in text. In Python we
write these little pattern programs as
quoted strings which are interpreted by the
re
module.
The most basic regular expression is a
string to be matched. For example, if we
wanted to find the word cat
in string s
,
we can write
cat_pat = "cat"
s = "My cat's name is Nora"
m = re.search(cat_pat, s)
re.search
will either return a match object
or None
. Since “cat” does appear in
string s
,
if m:
print(f"Matched {m.start()}..{m.end()}")
else:
print("*** No Match ***")
will print Matched 3..6
.
While re.search
looks for any occurrence of
the pattern, re.match
checks whether the
whole string matches the pattern. Thus
m = re.match(cat_pat, s)
if m:
print(f"Matched {m.start()}..{m.end()}")
else:
print("*** No Match ***")
prints *** No Match ***
.
White space and comments in regular expressions#
By default, a space in a regular expression matches a space in a string.
pat = " is "
s = "My cat's name is Nora"
m = re.search(pat, s)
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{s[start:end]}'")
else:
print("*** No Match ***")
prints
Matched 13..17
Matched substring is ' is '
Imagine if the only spaces permitted in a Python
program were those required by Python syntax,
and there was no way to include comments.
Our Python programs would be nearly as
unreadable as many regular expressions.
The Python re
module, however, gives us
the option of adding white space, comments,
and newlines. Adding the re.X
flag
argument to re.search
or re.match
enables
these readability aids:
pat = r"""
c # C is for cat
a # an a in the middle
t # This pattern really didn't need comments, but others do
"""
s = "My cat's name is Nora"
m = re.search(pat, s, re.X) ## re.X allows commented patterns
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{s[start:end]}'")
else:
print("*** No Match ***")
Composing Patterns#
Regular expressions wouldn’t be very
powerful if we could only search for
fixed strings. Fortunately we can build up
complex patterns with operators. This why
they are called “expressions”; we combine simple
patterns into more complex expressions just as
we combine simple arithmetic expressions into
more complex expressions with arithmetic operators
like +
and -
.
Or (|
)#
Suppose e1 and e2 are regular expressions,
and that they match m1 and m2, respectively.
We can write a pattern e1 | e2 to match
either what e1 matches or what e2 matches,
with parentheses to avoid ambiguity.
pat = "(cat)|(dog)|(ferret)"
s = "My cat's name is Nora"
m = re.search(pat, s, re.X) ## re.X allows commented patterns
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{s[start:end]}'")
else:
print("*** No Match ***")
Repetition (*
and +
)#
We can write e1* to match any number of repetitions of what e1 matches, including zero, or we can write e1+ to match one or more repetitions.
pat = "(ch-)*(changes)"
bowie = "Ch-ch-ch-ch-changes # Turn and face the strange"
m = re.search(pat, bowie, re.X) ## re.X allows commented patterns
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
Prints:
Matched 3..19
Matched substring is 'ch-ch-ch-changes'
Concatenation#
The examples above are actually using another “operator”
for regular expressions. The pattern "cat"
is actually
three patterns (c
, a
, and t
) that are concatenated
to form one longer pattern to match the concatenation of
what each of the individual letter patterns matches.
A more interesting application of concatenation appears
in "(ch-)*(changes)"
. Here ch-
pattern may be
repeated, but it must be followed by something that
matches changes
. We have used parentheses to be
clear that the whole string “ch-” must be repeated, and
not just the hyphen.
Character sets ([]
)#
Notice that "(ch-)*(changes)"
did not match the
beginning “Ch” of “Ch-ch-ch-ch-changes”. We could
use |
allow the “c” to be matched in lower case or
capitalized:
pat = "((c|C)h-)*(changes)"
bowie = "Ch-ch-ch-ch-changes # Turn and face the strange"
m = re.search(pat, bowie, re.X) ## re.X allows commented patterns
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
Often it is simpler to give a set or range of characters.
[axe]
matches any of “a”, “x”, or “e”. [c-f]
matches
any of “c”, “d”, “e”, or “f”.
pat = "([Cc]h-)*([a-z]+) # Turn and face the strange"
bowie = "Ch-ch-ch-ch-changes "
m = re.search(pat, bowie, re.X) ## re.X allows commented patterns
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
Prints
Matched 0..19
Matched substring is 'Ch-ch-ch-ch-changes'
Wild card (.
)#
The symbol .
(pronounced “dot”) is a wild card that matches
any single character. We can use .*
to match a substring
of any length (including length zero), containing
anything. The wild card is particularly useful for matching
text between delimiters, but we must be careful because it is
“greedy” and will prefer the longest possible match. For example:
m = re.search(pat, bowie, re.X) ## re.X allows commented patterns
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
Prints:
Matched 2..12
Matched substring is '-ch-ch-ch-'
Search, Match, and Fullmatch#
In the examples above, we have used the search
function from
re
to find a substring matching the pattern anywhere in the
string. That is often the right choice if we are searching
in unstructured text, most of which we skip over. When we process
structured text (like a spreadsheet, or assembly language source code)
or semi-structured text (like html or markdown documents),
we may not want skip over everything that doesn’t match
the pattern. If we want the matched substring to start at the
beginning of the string, we can use match
instead of search
.
If the whole string should match, we can use fullmatch
.
|
matches |
---|---|
|
pattern anywhere in string (ignore the rest) |
|
pattern must match beginning of string |
|
pattern must match the whole string |
For example, if the string is “Ch-ch-ch-ch-changes”, then we get the following results:
pattern |
|
|
|
---|---|---|---|
|
|
(no match) |
(no match) |
|
|
|
(no match) |
|
|
|
|
Capturing#
Often it is not enough to know the position of the whole match.
Particularly when we are using re.fullmatch
, we typically want
to know which parts of the pattern matched which parts of the string.
These are called “captures”. Each parenthesized part of the
pattern creates a “capture group”, which can be extracted from
the match object:
pat = "((c|C)h-)*(changes) # Turn and face the strange"
bowie = "Ch-ch-ch-ch-changes"
m = re.search(pat, bowie, re.X)
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
print(f"Groups {m.groups()}")
Prints:
Matched 0..19
Matched substring is 'Ch-ch-ch-ch-changes'
Groups ('ch-', 'c', 'changes')
We can improve on this by giving the capture group names.
We assign a name to a group by enclosing it in parentheses
and prepending ?P<name>
. We can then obtain the named
groups in the form of a Python dict
using the groupdict
method:
pat = "(?P<prefix> ((c|C)h-)*) (?P<suffix> changes)"
bowie = "Ch-ch-ch-ch-changes"
m = re.search(pat, bowie, re.X)
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
print(f"Groups {m.groupdict()}")
Prints:
Matched 0..19
Matched substring is 'Ch-ch-ch-ch-changes'
Groups {'prefix': 'Ch-ch-ch-ch-', 'suffix': 'changes'}
In the example above, we have made the “prefix”
group contain all repetitions of “Ch-” or “ch-“.
We can also make a group that captures just the
last repetition, and groups can be nested:
pat = "(?P<prefix>(?P<lastch> (c|C)h-)*) (?P<suffix> changes)"
bowie = "Ch-ch-ch-ch-changes"
m = re.search(pat, bowie, re.X)
if m:
start, end = m.start(), m.end()
print(f"Matched {start}..{end}")
print(f"Matched substring is '{bowie[start:end]}'")
print(f"Groups {m.groupdict()}")
This prints:
Matched 0..19
Matched substring is 'Ch-ch-ch-ch-changes'
Groups {'prefix': 'Ch-ch-ch-ch-', 'lastch': 'ch-', 'suffix': 'changes'}
Perspective#
One could write
a whole book
on regular expressions.
Or another.
Or many others. At some time it may be worthwhile to
extensively study regular expressions, but this is probably
not that time. The selection of examples above is designed to
be just enough to get through the project.
My goals in this chapter, and in the
accompanying project, are two:
You should be able to read a fairly simple regular expression and make sense of it.
You should understand why we use regular expressions well enough that in the future, when you are faced with a task for which regular expressions are the best tool, you will be prepared to learn what you need from books and online tutorials and references.