Introduction
Regular expressions (regex) are a powerful tool for matching patterns in text. They allow you to search, match, and manipulate strings in a flexible and efficient way. This tutorial will cover the basics of regular expressions in Python, common patterns, and practical use cases with examples for each pattern.
What is a Regular Expression?
A regular expression is a sequence of characters that forms a search pattern. You can use it to check if a string contains a specified pattern, extract parts of the string, replace substrings, and more.
In Python, the re
module provides support for working with regular expressions.
import re
Basic Operations
1. Search for a Pattern
The search()
function searches the string for a match to the pattern and returns a match object if a match is found.
import re
pattern = r"hello"
text = "Hello, world! hello again."
match = re.search(pattern, text, re.IGNORECASE)
if match:
print("Match found:", match.group()) # Output: Match found: Hello
else:
print("No match found")
2. Find All Matches
The findall()
function returns a list of all matches found in the string.
pattern = r"\d+" # Pattern to match one or more digits
text = "There are 12 apples, 5 bananas, and 3 oranges."
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['12', '5', '3']
3. Substitute Patterns
The sub()
function replaces matches of the pattern with a specified replacement string.
pattern = r"\d+" # Pattern to match one or more digits
text = "There are 12 apples, 5 bananas, and 3 oranges."
replacement = "#"
result = re.sub(pattern, replacement, text)
print("Result:", result) # Output: Result: There are # apples, # bananas, and # oranges.
Common Patterns
1. Character Classes
\d
: Matches any digit (equivalent to[0-9]
).
pattern = r"\d" # Match any digit
text = "Phone number: 123-456-7890"
matches = re.findall(pattern, text)
print("Digits found:", matches) # Output: Digits found: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
'\
D’: Matches any non-digit.
pattern = r"\D" # Match any non-digit
text = "Phone number: 123-456-7890"
matches = re.findall(pattern, text)
print("Non-digits found:", matches)
# Output: Non-digits found: ['P', 'h', 'o', 'n', 'e', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ':', ' ', '-', '-', '-']
'\w
‘: Matches any word character (equivalent to[a-zA-Z0-9_]
).
pattern = r"\w" # Match any word character
text = "Hello, World!"
matches = re.findall(pattern, text)
print("Word characters found:", matches)
# Output: Word characters found: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']
'\W
‘: Matches any non-word character.
pattern = r"\W" # Match any non-word character
text = "Hello, World!"
matches = re.findall(pattern, text)
print("Non-word characters found:", matches)
# Output: Non-word characters found: [',', ' ', '!']
- ‘
\s'
: Matches any whitespace character.
pattern = r"\s" # Match any whitespace character
text = "Hello, World!"
matches = re.findall(pattern, text)
print("Whitespace characters found:", matches) # Output: Whitespace characters found: [' ']
- ‘
\S'
: Matches any non-whitespace character.
pattern = r"\S" # Match any non-whitespace character
text = "Hello, World!"
matches = re.findall(pattern, text)
print("Non-whitespace characters found:", matches)
# Output: Non-whitespace characters found: ['H', 'e', 'l', 'l', 'o', ',', 'W', 'o', 'r', 'l', 'd', '!']
2. Anchors
^
: Matches the start of a string.
pattern = r"^Hello" # Match if the string starts with "Hello"
text = "Hello, world!"
match = re.search(pattern, text)
if match:
print("Match found:", match.group()) # Output: Match found: Hello
else:
print("No match found")
$
: Matches the end of a string.
pattern = r"world!$" # Match if the string ends with "world!"
text = "Hello, world!"
match = re.search(pattern, text)
if match:
print("Match found:", match.group()) # Output: Match found: world!
else:
print("No match found")
\b
: Matches a word boundary.
pattern = r"\bworld\b" # Match the word "world" as a whole word
text = "Hello, world!"
match = re.search(pattern, text)
if match:
print("Match found:", match.group()) # Output: Match found: world
else:
print("No match found")
\B
: Matches a non-word boundary.
pattern = r"o\B" # Match "o" not at a word boundary
text = "Hello, world!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['o']
3. Quantifiers
*
: Matches 0 or more repetitions.
pattern = r"lo*" # Match "l" followed by zero or more "o"
text = "Hello, world!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['llo', 'l']
+
: Matches 1 or more repetitions.
pattern = r"lo+" # Match "l" followed by one or more "o"
text = "Hello, world!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['llo']
?
: Matches 0 or 1 repetition.
pattern = r"lo?" # Match "l" followed by zero or one "o"
text = "Hello, world!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['llo', 'l']
{n}
: Matches exactly n repetitions.
pattern = r"o{2}" # Match exactly two "o"
text = "Ooooh!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['oo']
{n,}
: Matches n or more repetitions.
pattern = r"o{2,}" # Match two or more "o"
text = "Ooooh!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['oooo']
{n,m}
: Matches between n and m repetitions.
pattern = r"o{2,4}" # Match between two and four "o"
text = "Ooooh!"
matches = re.findall(pattern, text)
print("Matches found:", matches) # Output: Matches found: ['oooo']
4. Groups and Capturing
Parentheses ()
are used to create groups and capture parts of the match.
pattern = r"(\d{3})-(\d{3})-(\d{4})" # Capture groups for phone number parts
text = "Phone number: 123-456-7890"
match = re.search(pattern, text)
if match:
print("Area code:", match.group(1)) # Output: Area code: 123
print("Exchange:", match.group(2)) # Output: Exchange: 456
print("Subscriber:", match.group(3)) # Output: Subscriber: 7890
else:
print("No match found")
Practical Use Cases
1. Email Validation
pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
email = "[email protected]"
if re.match(pattern, email):
print("Valid email")
else:
print("Invalid email")
2. Extracting URLs
pattern = r"https?://(?:www\.)?\S+\.\S+"
text = "Visit our website at http://www.example.com or https://example.org for more info."
urls = re.findall(pattern, text)
print("URLs found:", urls)