Understanding Regular Expressions in Python

Introduction

Regular expressions (regex) are a powerful tool for matching patterns in text. They allow you to search, match, and manipulate strings in a flexible and efficient way. This tutorial will cover the basics of regular expressions in Python, common patterns, and practical use cases with examples for each pattern.

What is a Regular Expression?

A regular expression is a sequence of characters that forms a search pattern. You can use it to check if a string contains a specified pattern, extract parts of the string, replace substrings, and more.

In Python, the re module provides support for working with regular expressions.

import re

Basic Operations

1. Search for a Pattern

The search() function searches the string for a match to the pattern and returns a match object if a match is found.

import re

pattern = r"hello"
text = "Hello, world! hello again."

match = re.search(pattern, text, re.IGNORECASE)
if match:
    print("Match found:", match.group())  # Output: Match found: Hello
else:
    print("No match found")

2. Find All Matches

The findall() function returns a list of all matches found in the string.

pattern = r"\d+"  # Pattern to match one or more digits
text = "There are 12 apples, 5 bananas, and 3 oranges."

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['12', '5', '3']

3. Substitute Patterns

The sub() function replaces matches of the pattern with a specified replacement string.

pattern = r"\d+"  # Pattern to match one or more digits
text = "There are 12 apples, 5 bananas, and 3 oranges."
replacement = "#"

result = re.sub(pattern, replacement, text)
print("Result:", result)  # Output: Result: There are # apples, # bananas, and # oranges.

Common Patterns

1. Character Classes

\d: Matches any digit (equivalent to [0-9]).

pattern = r"\d"  # Match any digit
text = "Phone number: 123-456-7890"

matches = re.findall(pattern, text)
print("Digits found:", matches)  # Output: Digits found: ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']

'\D’: Matches any non-digit.

pattern = r"\D"  # Match any non-digit
text = "Phone number: 123-456-7890"

matches = re.findall(pattern, text)
print("Non-digits found:", matches)
# Output: Non-digits found: ['P', 'h', 'o', 'n', 'e', ' ', 'n', 'u', 'm', 'b', 'e', 'r', ':', ' ', '-', '-', '-']

'\w‘: Matches any word character (equivalent to [a-zA-Z0-9_]).

pattern = r"\w"  # Match any word character
text = "Hello, World!"

matches = re.findall(pattern, text)
print("Word characters found:", matches)
# Output: Word characters found: ['H', 'e', 'l', 'l', 'o', 'W', 'o', 'r', 'l', 'd']

'\W‘: Matches any non-word character.

pattern = r"\W"  # Match any non-word character
text = "Hello, World!"

matches = re.findall(pattern, text)
print("Non-word characters found:", matches)
# Output: Non-word characters found: [',', ' ', '!']

‘\s': Matches any whitespace character.

pattern = r"\s"  # Match any whitespace character
text = "Hello, World!"

matches = re.findall(pattern, text)
print("Whitespace characters found:", matches)  # Output: Whitespace characters found: [' ']

‘\S': Matches any non-whitespace character.

pattern = r"\S"  # Match any non-whitespace character
text = "Hello, World!"

matches = re.findall(pattern, text)
print("Non-whitespace characters found:", matches)
# Output: Non-whitespace characters found: ['H', 'e', 'l', 'l', 'o', ',', 'W', 'o', 'r', 'l', 'd', '!']

2. Anchors

^: Matches the start of a string.

pattern = r"^Hello"  # Match if the string starts with "Hello"
text = "Hello, world!"

match = re.search(pattern, text)
if match:
    print("Match found:", match.group())  # Output: Match found: Hello
else:
    print("No match found")

$: Matches the end of a string.

pattern = r"world!$"  # Match if the string ends with "world!"
text = "Hello, world!"

match = re.search(pattern, text)
if match:
    print("Match found:", match.group())  # Output: Match found: world!
else:
    print("No match found")

\b: Matches a word boundary.

pattern = r"\bworld\b"  # Match the word "world" as a whole word
text = "Hello, world!"

match = re.search(pattern, text)
if match:
    print("Match found:", match.group())  # Output: Match found: world
else:
    print("No match found")

\B: Matches a non-word boundary.

pattern = r"o\B"  # Match "o" not at a word boundary
text = "Hello, world!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['o']

3. Quantifiers

*: Matches 0 or more repetitions.

pattern = r"lo*"  # Match "l" followed by zero or more "o"
text = "Hello, world!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['llo', 'l']

+: Matches 1 or more repetitions.

pattern = r"lo+"  # Match "l" followed by one or more "o"
text = "Hello, world!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['llo']

?: Matches 0 or 1 repetition.

pattern = r"lo?"  # Match "l" followed by zero or one "o"
text = "Hello, world!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['llo', 'l']

{n}: Matches exactly n repetitions.

pattern = r"o{2}"  # Match exactly two "o"
text = "Ooooh!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['oo']

{n,}: Matches n or more repetitions.

pattern = r"o{2,}"  # Match two or more "o"
text = "Ooooh!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['oooo']

{n,m}: Matches between n and m repetitions.

pattern = r"o{2,4}"  # Match between two and four "o"
text = "Ooooh!"

matches = re.findall(pattern, text)
print("Matches found:", matches)  # Output: Matches found: ['oooo']

4. Groups and Capturing

Parentheses () are used to create groups and capture parts of the match.

pattern = r"(\d{3})-(\d{3})-(\d{4})"  # Capture groups for phone number parts
text = "Phone number: 123-456-7890"

match = re.search(pattern, text)
if match:
    print("Area code:", match.group(1))  # Output: Area code: 123
    print("Exchange:", match.group(2))   # Output: Exchange: 456
    print("Subscriber:", match.group(3)) # Output: Subscriber: 7890
else:
    print("No match found")

Practical Use Cases

1. Email Validation

pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
email = "[email protected]"

if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")

2. Extracting URLs

pattern = r"https?://(?:www\.)?\S+\.\S+"
text = "Visit our website at http://www.example.com or https://example.org for more info."

urls = re.findall(pattern, text)
print("URLs found:", urls)