Regular expressions are specially formatted strings or patterns used to match character combinations in strings. They can be used to find or replace string patterns in texts.
Regular expressions are also used for data validations, especially when submitting information like email addresses, and telephone numbers in applications.
It is a language designed primarily for matching text patterns. Therefore, in order to perform operations related to text-matching or manipulations, you are required to specify some set of rules.
Say, you want to match strings that are composed of only digits or a string that starts with two digits and ends with 2 digits and 5 alphabets in between.
The overall concept of regular expressions are the same in almost all programming language, but the implementations somehow differ.
In order to use regular expressions in Python, you have to first import the re module.
Character patterns with regular expressions
In regular expressions, every character matches itself. Hence, in the following examples, we will be using the findall() method in the re module to return a list of non-overlapping items matched in a string.
import re result = re.findall('b', 'bona') print(result) result = re.findall('p', 'apple') print(result) result = re.findall('ap', 'apple') print(result) #outputs #['b'] #['p', 'p'] #['ap']
In the above example, the findall() method locates the occurrences of a given substring in a string and returns a list of the matches found.
No doubt, these kinds of matching can only work for trivial things, and cannot be used for sophisticated operations. You can only match specified characters or substrings.
What if you want to find out if a given string is a number, alphanumeric or alphabet?
What if you want to check whether the characters in a string are structured in a given way?
For instance, you may want to check if the first item in a string is a number or whether a string contains a specific number of characters.
Hence, you would have to make use of metacharacters.
Meta Characters
These literal characters have special meanings to the regular expression engine.
As you’ve seen, every character or substring matches itself in a text, but with metacharacters, you can be able to do much more.
Now, let’s take a look at the various metacharacters in regular expressions and how they can be used to perform matching operations in Python.
Period .
The period matches one or more characters. One period matches one character, two periods match two characters and so on. However, it does not match the new line.
import re result = re.findall('.', 'hello') print(result) result = re.findall('..', 'hello') print(result) result = re.findall('...', 'hello') print(result) #outputs #['h', 'e', 'l', 'l', 'o'] #['he', 'll'] #['hel']
Caret ^
This metacharacter is used to match characters or groups of characters occurring at the beginning of a string.
result = re.findall('^h', 'hello') print(result) result = re.findall('^he', 'hello') print(result) result = re.findall('^hello', 'hello') print(result) result = re.findall('^b', 'banana') print(result) result = re.findall('^a', 'banana') print(result) result = re.findall('^ba', 'banana') print(result) #outputs #['he'] #['hello'] #['b'] #[] #['ba']
Pipe |
The pipe sign or alternation is used to match characters or patterns indicated in between the sign.
result = re.findall('a|n', 'banana') print(result) result = re.findall('a|n', 'bona') print(result) #output #['a', 'n', 'a', 'n', 'a'] #['n', 'a']
Dollar Sign $
This is used to match a character or group of characters occurring at the end of a string.
result = re.findall('a$', 'banana') print(result) result = re.findall('b$', 'banana') print(result) result = re.findall('na$', 'banana') print(result) #outputs #['a'] #[] #['na']
Square bracket []
The square bracket is used to indicate a set of characters to match a string. A match occurs anywhere any of the characters in the square bracket is found in a string.
Items inside a square bracket are also known as character class or character set.
result = re.findall('[e]', 'hello') print(result) result = re.findall('[le]', 'hello') print(result) result = re.findall('[hello]', 'hello') print(result) #output #['e'] #['e', 'l', 'l'] #['h', 'e', 'l', 'l', 'o']
Instead of specifying characters one after the other, you can provide a range of characters in the square bracket using the -.
re.findall('[a-e]', 'hello world') #output - ['e', 'd']
Asterix *
This metacharacter matches zero or more occurrences of a given pattern on the left side of it.
result = re.findall('a*n', 'banana') print(result) result = re.findall('a*n', 'bona') print(result) result = re.findall('a*p', 'apple') print(result) result = re.findall('a*', 'apple') print(result) result = re.findall('a*', 'candara') print(result) #output #['an', 'an'] #['n'] #['ap', 'p'] #['a', '', '', '', '', ''] #['', 'a', '', '', 'a', '', 'a', '']
Plus +
The plus matches one or more occurrences of a given pattern on the left side of it.
result = re.findall('a+n', 'banana') print(result) result = re.findall('a+n', 'bona') print(result) result = re.findall('a+p', 'apple') print(result) result = re.findall('a+', 'apple') print(result) result = re.findall('a+', 'candara') print(result) #output #['an', 'an'] #[] #['ap'] #['a'] #['a', 'a', 'a']
Question Mark ?
The question mark matches zero or 1 occurrence of a pattern on the left side of it.
result = re.findall('a?n', 'banana') print(result) result = re.findall('a?n', 'bona') print(result) result = re.findall('a?p', 'apple') print(result) result = re.findall('a?', 'apple') print(result) result = re.findall('a?', 'candara') print(result) #output #['an', 'an'] #['n'] #['ap', 'p'] #['a', '', '', '', '', ''] #['', 'a', '', '', 'a', '', 'a', '']
Curly Braces {}
Curly braces are used to indicate the minimum and maximum number of repetitions of a pattern on the left side of it.
- {n} matches exactly n number of the preceding pattern
- {n,} matches at least n number of the preceding pattern
- {n,m} matches between n and m number of the preceding pattern
result = re.findall('a{1}n', 'banana') print(result) result = re.findall('a{1,2}n', 'banana') print(result) result = re.findall('a{1,2}n', 'banaana') print(result) result = re.findall('a{1,2}n', 'banaaaana') print(result) result = re.findall('a{1,}n', 'banaaaana') print(result) #output #['an', 'an'] #['an', 'an'] #['an', 'aan'] #['an', 'aan'] #['an', 'aaaan']
Parenthesis ()
The parenthesis, also known as a group, unlike the square brackets, groups all the characters or patterns in it as one unit. For instance,
result = re.findall('(baa)', 'banana') print(result) result = re.findall('[baa]', 'banana') print(result) result = re.findall('(ba)', 'banana') print(result) result = re.findall('(na)', 'banana') print(result) #output #[] #['b', 'a', 'a', 'a'] #['ba'] #['na', 'na']
Backslash \
The backslash is used to escape characters that have special meaning in regular expression or metacharacters and treat them like regular characters. It is combined with other metacharacters to form special sequences.
Regular expression special sequences
A special sequence is followed by a backslash \ and has special meaning in regular expressions. Here are some of them and their meanings.
\A – returns a match if the character or patterns in front of the sequence is at the beginning of the string.
result = re.findall("\AGo", 'Good morning sir!') print(result) result = re.findall("\Amo", 'Good morning sir!') print(result) result = re.findall("\A.+d", 'Good morning sir!') print(result) #output #['Go'] #[] #['Good']
\d – this sequence returns a match where the string contains digits or numbers.
result = re.findall("\d", 'Hello 123') print(result) result = re.findall("\d", '2uj0hs') print(result) #output #['1', '2', '3'] #['2', '0']
\D – this returns a match for characters that are not digits in a string.
result = re.findall("\D", 'Hello 123') print(result) result = re.findall("\D", '2uj0hs') print(result) #output #['H', 'e', 'l', 'l', 'o', ' '] #['u', 'j', 'h', 's']
\s – returns a match where the string contains white spaces.
result = re.findall("\s", 'Hello, how are you?') print(result) result = re.findall("\s", 'Hello \nSteven!') print(result) #outputs #[' ', ' ', ' '] #[' ', '\n']
\S – returns a match where the string doesn’t contain white spaces.
result = re.findall("\S", 'Hello, how are you?') print(result) result = re.findall("\S", 'Hello \nSteven!') print(result) #outputs #['H', 'e', 'l', 'l', 'o', ',', 'h', 'o', 'w', 'a', 'r', 'e', 'y', 'o', 'u', '?'] #['H', 'e', 'l', 'l', 'o', 'S', 't', 'e', 'v', 'e', 'n', '!']
\w – returns a match where the string contains any word characters or alphanumerical characters.
result = re.findall("\w", 'Hello!') print(result) result = re.findall("\w", 'Hellothere 123') print(result) result = re.findall("\w", 'Hello there &3') print(result) #outputs #['H', 'e', 'l', 'l', 'o'] #['H', 'e', 'l', 'l', 'o', 't', 'h', 'e', 'r', 'e', '1', '2', '3'] #['H', 'e', 'l', 'l', 'o', 't', 'h', 'e', 'r', 'e', '3']
\W – returns a match where the string doesn’t contain word characters or alphanumerical characters.
result = re.findall("\W", 'Hello!') print(result) result = re.findall("\W", 'Hello-there 123') print(result) result = re.findall("\W", 'Hello there &3') print(result) #outputs #['!'] #['-', ' '] #[' ', ' ', '&']
\b – this returns a match if the specified character or pattern appears at the beginning of a word. In order to use this sequence, the string has to be in raw format as shown below.
result = re.findall(r'\bhello', 'hello there') print(result) result = re.findall(r'\bre', 'remember here') print(result) result = re.findall(r'\bre', 'member here') print(result) result = re.findall(r'\bare', 'hello how are you?') print(result) #outputs #['hello'] #['re'] #[] #['are']
\B – returns a match if specified characters are present at the end but not at the beginning of the words in the string.
result = re.findall(r'\Bhello', 'hello there') print(result) result = re.findall(r'\Bre', 'remember') print(result) result = re.findall(r'\Bre', 'here here') print(result) result = re.findall(r'\Bare', 'hello how are you?') print(result) #outputs #[] #[] #['re', 're'] #[]
\Z – returns a match if the specified pattern is at the end of the string.
result = re.findall(r'hello\Z', 'hello there') print(result) result = re.findall(r'hello\Z', 'hello there hello') print(result) result = re.findall(r're\Z', 'member here') print(result) result = re.findall(r'are\Z', 'hello how are you?') print(result) #outputs #[] #['hello'] #['re'] #[]
In-built functions in re module
Let’s look at some of the in-built functions available in Python for performing regular expressions.
Findall
This is the method we have been using this method in previous examples. It returns a list of characters containing all the matches found in the string.
result = re.findall('[a-z]', 'hello world') print(result) result = re.findall('\w+', 'hello world') print(result) #outputs #['h', 'e', 'l', 'l', 'o', 'w', 'o', 'r', 'l', 'd'] #['hello', 'world']
Search
This method returns a match object for the location where the pattern is matched in the string as shown in the example below:
result = re.search('\w+', 'hello world') print(result) #output #<re.Match object; span=(0, 5), match='hello'>
Split
This returns a list of the substrings of the string after removing a given pattern from the string.
result = re.split('\d', 'he scored 23 in one of the tests and 50 in another test') print(result) #output #['he scored ', '', ' in one of the tests and ', '', ' in another test']
Sub
This returns a string by replacing the patterns in the string with a different substring.
result = re.sub('h', 'j', 'hello world') print(result) result = re.sub('\d', '-', 'he scored 23 in one of the tests and 50 in another test') print(result) #output #jello world #he scored -- in one of the tests and -- in another test
Subn
This method makes replacements in a string and returns the string with replacements and the total number of replacements done.
re.subn('h', 'j', 'hello world') ('jello world', 1) re.subn('h', 'j', 'hello hello') ('jello jello', 2)
Finditer
This function identifies all matches available and returns the outcome in the form of a generator.
result = re.finditer(r'(h\w+)(lo)\Z', 'hello hello hello') match = next(result) print(match[0]) print(match[1]) print(match[2]) #outputs #hello #hel #lo
Match
The match function returns a match object indicating the start and end indexes of a search pattern found in a string. The match will return none if there’s no match found.
result = re.match(r'h\w+', 'hello') print(result) result = re.match(r'e\w+', 'hello') print(result) #outputs #<re.Match object; span=(0, 5), match='hello'> #None
Group
A group is a pattern for regular expression string matching enclosed in parenthesis. For instance, the example below contains two groups, which are:
(h\w+) and (lo). These are otherwise known as the subgroups. result = re.match(r'(h\w+)(lo)\Z', 'hello')
However, the complete pattern for matching in the above example is (h\w+)(lo)\Z. The outcome of the above expression is a match object containing the complete match found in the string. And this brings us to group and groups method.
The group method returns the first match found in a string as shown below.
result = re.match(r'(h\w+)(lo)\Z', 'hello').group() print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group(0) print(result) #outputs #hello #hello
Hence, group() and group(0) return the same things.
Groups
This method returns a tuple of subgroups found in the match. These are matches for the various groups that are part of the match pattern.
Subgroups are numbered from 1 upwards.
result = re.match(r'(h\w+)(lo)\Z', 'hello').groups() print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group(1) print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group(2) print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group() print(result) #outputs #('hel', 'lo') #hel #lo #hello
Lastgroup
This method returns the name of the last captured group as shown in the example below:
result = re.match(r'(h\w+)(lo)\Z', 'hello').lastgroup print(result) #outputs #lo
Lastindex
This returns the index of the last captured groups.
result = re.match(r'(h\w+)(lo)\Z', 'hello').lastindex print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group(2) print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group(1) print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').group(0) print(result) #outputs #2 #lo #hel #hello
Span
This method returns a tuple containing the start index and index before the end index of the match.
result = re.match(r'(h\w+)(lo)\Z', 'hello').span() print(result) #output #(0, 5)
Start and End
The start and end functions return the start and end indexes of the match.
result = re.match(r'(h\w+)(lo)\Z', 'hello').start() print(result) result = re.match(r'(h\w+)(lo)\Z', 'hello').end() print(result) #outputs #0 #5
Flags or Options
These are options that are available in regular expressions to control the outcome of pattern matching.
For example.
re.IGNORECASE - ignore cases when matching patterns re.MULTILINE - indicates that the string is in multiple lines. re.DOTALL - makes the dot match any kind of character including new lines. result = re.findall(r'(h\w+)(lo)\Z', 'hello', re.IGNORECASE) print(result) result = re.findall(r'(h\w+)(lo)\Z', 'HELLO', re.IGNORECASE) print(result) result = re.findall(r'(h\w+)(lo)\Z', 'HELLO') print(result) #outputs #[('hel', 'lo')] #[('HEL', 'LO')] #[]
Precompiling regular expressions
The compile method of the re module enables you to precompile regular expressions as shown below:
re_pattern = r'(h\w+)' p_compile = re.compile(re_pattern) result = re.findall(p_compile, 'hello world how are you') print(result) #output #['hello', 'how']