In this tutorial, you’ll explore regular expressions, also known as regexes, in Python. A regex is a special sequence of characters that defines a pattern for complex string-matching functionality.
Earlier in this series, in the tutorial Strings and Character Data in Python, you learned how to define and manipulate string objects. Since then, you’ve seen some ways to determine whether two strings match each other:
-
You can test whether two strings are equal using the equality (
==) operator. -
You can test whether one string is a substring of another with the
inoperator or the built-in string methods.find()and.index().
String matching like this is a common task in programming, and you can get a lot done with string operators and built-in methods. At times, though, you may need more sophisticated pattern-matching capabilities.
In this tutorial, you’ll learn:
- How to access the
remodule, which implements regex matching in Python - How to use
re.search()to match a pattern against a string - How to create complex matching pattern with regex metacharacters
Fasten your seat belt! Regex syntax takes a little getting used to. But once you get comfortable with it, you’ll find regexes almost indispensable in your Python programming.
Free Download: Get a sample chapter from Python Tricks: The Book that shows you Python’s best practices with simple examples you can apply instantly to write more beautiful + Pythonic code.
Regexes in Python and Their Uses
Imagine you have a string object s. Now suppose you need to write Python code to find out whether s contains the substring '123'. There are at least a couple ways to do this. You could use the in operator:
>>> s = 'foo123bar'
>>> '123' in s
True
If you want to know not only whether '123' exists in s but also where it exists, then you can use .find() or .index(). Each of these returns the character position within s where the substring resides:
>>> s = 'foo123bar'
>>> s.find('123')
3
>>> s.index('123')
3
In these examples, the matching is done by a straightforward character-by-character comparison. That will get the job done in many cases. But sometimes, the problem is more complicated than that.
For example, rather than searching for a fixed substring like '123', suppose you wanted to determine whether a string contains any three consecutive decimal digit characters, as in the strings 'foo123bar', 'foo456bar', '234baz', and 'qux678'.
Strict character comparisons won’t cut it here. This is where regexes in Python come to the rescue.
A (Very Brief) History of Regular Expressions
In 1951, mathematician Stephen Cole Kleene described the concept of a regular language, a language that is recognizable by a finite automaton and formally expressible using regular expressions. In the mid-1960s, computer science pioneer Ken Thompson, one of the original designers of Unix, implemented pattern matching in the QED text editor using Kleene’s notation.
Since then, regexes have appeared in many programming languages, editors, and other tools as a means of determining whether a string matches a specified pattern. Python, Java, and Perl all support regex functionality, as do most Unix tools and many text editors.
The re Module
Regex functionality in Python resides in a module named re. The re module contains many useful functions and methods, most of which you’ll learn about in the next tutorial in this series.
For now, you’ll focus predominantly on one function, re.search().
re.search(<regex>, <string>)
Scans a string for a regex match.
re.search(<regex>, <string>) scans <string> looking for the first location where the pattern <regex> matches. If a match is found, then re.search() returns a match object. Otherwise, it returns None.
re.search() takes an optional third <flags> argument that you’ll learn about at the end of this tutorial.
How to Import re.search()
Because search() resides in the re module, you need to import it before you can use it. One way to do this is to import the entire module and then use the module name as a prefix when calling the function:
import re
re.search(...)
Alternatively, you can import the function from the module by name and then refer to it without the module name prefix:
from re import search
search(...)
You’ll always need to import re.search() by one means or another before you’ll be able to use it.
The examples in the remainder of this tutorial will assume the first approach shown—importing the re module and then referring to the function with the module name prefix: re.search(). For the sake of brevity, the import re statement will usually be omitted, but remember that it’s always necessary.
For more information on importing from modules and packages, check out Python Modules and Packages—An Introduction.
First Pattern-Matching Example
Now that you know how to gain access to re.search(), you can give it a try:
1>>> s = 'foo123bar'
2
3>>> # One last reminder to import!
4>>> import re
5
6>>> re.search('123', s)
7<_sre.SRE_Match object; span=(3, 6), match='123'>
Here, the search pattern <regex> is 123 and <string> is s. The returned match object appears on line 7. Match objects contain a wealth of useful information that you’ll explore soon.
For the moment, the important point is that re.search() did in fact return a match object rather than None. That tells you that it found a match. In other words, the specified <regex> pattern 123 is present in s.
A match object is truthy, so you can use it in a Boolean context like a conditional statement:
>>> if re.search('123', s):
... print('Found a match.')
... else:
... print('No match.')
...
Found a match.
The interpreter displays the match object as <_sre.SRE_Match object; span=(3, 6), match='123'>. This contains some useful information.
span=(3, 6) indicates the portion of <string> in which the match was found. This means the same thing as it would in slice notation:
>>> s[3:6]
'123'
In this example, the match starts at character position 3 and extends up to but not including position 6.
match='123' indicates which characters from <string> matched.
This is a good start. But in this case, the <regex> pattern is just the plain string '123'. The pattern matching here is still just character-by-character comparison, pretty much the same as the in operator and .find() examples shown earlier. The match object helpfully tells you that the matching characters were '123', but that’s not much of a revelation since those were exactly the characters you searched for.
You’re just getting warmed up.
Python Regex Metacharacters
The real power of regex matching in Python emerges when <regex> contains special characters called metacharacters. These have a unique meaning to the regex matching engine and vastly enhance the capability of the search.
Consider again the problem of how to determine whether a string contains any three consecutive decimal digit characters.
In a regex, a set of characters specified in square brackets ([]) makes up a character class. This metacharacter sequence matches any single character that is in the class, as demonstrated in the following example:
>>> s = 'foo123bar'
>>> re.search('[0-9][0-9][0-9]', s)
<_sre.SRE_Match object; span=(3, 6), match='123'>
[0-9] matches any single decimal digit character—any character between '0' and '9', inclusive. The full expression [0-9][0-9][0-9] matches any sequence of three decimal digit characters. In this case, s matches because it contains three consecutive decimal digit characters, '123'.
These strings also match:
>>> re.search('[0-9][0-9][0-9]', 'foo456bar')
<_sre.SRE_Match object; span=(3, 6), match='456'>
>>> re.search('[0-9][0-9][0-9]', '234baz')
<_sre.SRE_Match object; span=(0, 3), match='234'>
>>> re.search('[0-9][0-9][0-9]', 'qux678')
<_sre.SRE_Match object; span=(3, 6), match='678'>
On the other hand, a string that doesn’t contain three consecutive digits won’t match:
>>> print(re.search('[0-9][0-9][0-9]', '12foo34'))
None
With regexes in Python, you can identify patterns in a string that you wouldn’t be able to find with the in operator or with string methods.
Take a look at another regex metacharacter. The dot (.) metacharacter matches any character except a newline, so it functions like a wildcard:
>>> s = 'foo123bar'
>>> re.search('1.3', s)
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> s = 'foo13bar'
>>> print(re.search('1.3', s))
None
In the first example, the regex 1.3 matches '123' because the '1' and '3' match literally, and the . matches the '2'. Here, you’re essentially asking, “Does s contain a '1', then any character (except a newline), then a '3'?” The answer is yes for 'foo123bar' but no for 'foo13bar'.
These examples provide a quick illustration of the power of regex metacharacters. Character class and dot are but two of the metacharacters supported by the re module. There are many more. Next, you’ll explore them fully.
Metacharacters Supported by the re Module
The following table briefly summarizes all the metacharacters supported by the re module. Some characters serve more than one purpose:
| Character(s) | Meaning |
|---|---|
. |
Matches any single character except newline |
^ |
∙ Anchors a match at the start of a string ∙ Complements a character class |
$ |
Anchors a match at the end of a string |
* |
Matches zero or more repetitions |
+ |
Matches one or more repetitions |
? |
∙ Matches zero or one repetition ∙ Specifies the non-greedy versions of *, +, and ?∙ Introduces a lookahead or lookbehind assertion ∙ Creates a named group |
{} |
Matches an explicitly specified number of repetitions |
\ |
∙ Escapes a metacharacter of its special meaning ∙ Introduces a special character class ∙ Introduces a grouping backreference |
[] |
Specifies a character class |
| |
Designates alternation |
() |
Creates a group |
:#=! |
Designate a specialized group |
<> |
Creates a named group |
This may seem like an overwhelming amount of information, but don’t panic! The following sections go over each one of these in detail.
The regex parser regards any character not listed above as an ordinary character that matches only itself. For example, in the first pattern-matching example shown above, you saw this:
>>> s = 'foo123bar'
>>> re.search('123', s)
<_sre.SRE_Match object; span=(3, 6), match='123'>
In this case, 123 is technically a regex, but it’s not a very interesting one because it doesn’t contain any metacharacters. It just matches the string '123'.
Things get much more exciting when you throw metacharacters into the mix. The following sections explain in detail how you can use each metacharacter or metacharacter sequence to enhance pattern-matching functionality.
Metacharacters That Match a Single Character
The metacharacter sequences in this section try to match a single character from the search string. When the regex parser encounters one of these metacharacter sequences, a match happens if the character at the current parsing position fits the description that the sequence describes.
[]
Specifies a specific set of characters to match.
Characters contained in square brackets ([]) represent a character class—an enumerated set of characters to match from. A character class metacharacter sequence will match any single character contained in the class.
You can enumerate the characters individually like this:
>>> re.search('ba[artz]', 'foobarqux')
<_sre.SRE_Match object; span=(3, 6), match='bar'>
>>> re.search('ba[artz]', 'foobazqux')
<_sre.SRE_Match object; span=(3, 6), match='baz'>
The metacharacter sequence [artz] matches any single 'a', 'r', 't', or 'z' character. In the example, the regex ba[artz] matches both 'bar' and 'baz' (and would also match 'baa' and 'bat').
A character class can also contain a range of characters separated by a hyphen (-), in which case it matches any single character within the range. For example, [a-z] matches any lowercase alphabetic character between 'a' and 'z', inclusive:
>>> re.search('[a-z]', 'FOObar')
<_sre.SRE_Match object; span=(3, 4), match='b'>
[0-9] matches any digit character:
>>> re.search('[0-9][0-9]', 'foo123bar')
<_sre.SRE_Match object; span=(3, 5), match='12'>
In this case, [0-9][0-9] matches a sequence of two digits. The first portion of the string 'foo123bar' that matches is '12'.
[0-9a-fA-F] matches any hexadecimal digit character:
>>> re.search('[0-9a-fA-f]', '--- a0 ---')
<_sre.SRE_Match object; span=(4, 5), match='a'>
Here, [0-9a-fA-F] matches the first hexadecimal digit character in the search string, 'a'.
Note: In the above examples, the return value is always the leftmost possible match. re.search() scans the search string from left to right, and as soon as it locates a match for <regex>, it stops scanning and returns the match.
You can complement a character class by specifying ^ as the first character, in which case it matches any character that isn’t in the set. In the following example, [^0-9] matches any character that isn’t a digit:
>>> re.search('[^0-9]', '12345foo')
<_sre.SRE_Match object; span=(5, 6), match='f'>
Here, the match object indicates that the first character in the string that isn’t a digit is 'f'.
If a ^ character appears in a character class but isn’t the first character, then it has no special meaning and matches a literal '^' character:
>>> re.search('[#:^]', 'foo^bar:baz#qux')
<_sre.SRE_Match object; span=(3, 4), match='^'>
As you’ve seen, you can specify a range of characters in a character class by separating characters with a hyphen. What if you want the character class to include a literal hyphen character? You can place it as the first or last character or escape it with a backslash (\):
>>> re.search('[-abc]', '123-456')
<_sre.SRE_Match object; span=(3, 4), match='-'>
>>> re.search('[abc-]', '123-456')
<_sre.SRE_Match object; span=(3, 4), match='-'>
>>> re.search('[ab\-c]', '123-456')
<_sre.SRE_Match object; span=(3, 4), match='-'>
If you want to include a literal ']' in a character class, then you can place it as the first character or escape it with backslash:
>>> re.search('[]]', 'foo[1]')
<_sre.SRE_Match object; span=(5, 6), match=']'>
>>> re.search('[ab\]cd]', 'foo[1]')
<_sre.SRE_Match object; span=(5, 6), match=']'>
Other regex metacharacters lose their special meaning inside a character class:
>>> re.search('[)*+|]', '123*456')
<_sre.SRE_Match object; span=(3, 4), match='*'>
>>> re.search('[)*+|]', '123+456')
<_sre.SRE_Match object; span=(3, 4), match='+'>
As you saw in the table above, * and + have special meanings in a regex in Python. They designate repetition, which you’ll learn more about shortly. But in this example, they’re inside a character class, so they match themselves literally.
dot (.)
Specifies a wildcard.
The . metacharacter matches any single character except a newline:
>>> re.search('foo.bar', 'fooxbar')
<_sre.SRE_Match object; span=(0, 7), match='fooxbar'>
>>> print(re.search('foo.bar', 'foobar'))
None
>>> print(re.search('foo.bar', 'foo\nbar'))
None
As a regex, foo.bar essentially means the characters 'foo', then any character except newline, then the characters 'bar'. The first string shown above, 'fooxbar', fits the bill because the . metacharacter matches the 'x'.
The second and third strings fail to match. In the last case, although there’s a character between 'foo' and 'bar', it’s a newline, and by default, the . metacharacter doesn’t match a newline. There is, however, a way to force . to match a newline, which you’ll learn about at the end of this tutorial.


