Skip to Content
Course content

9.3 Regular Expressions

Regular Expressions (RegEx) are patterns used to match and manipulate strings in a flexible and efficient way. They allow you to search for specific patterns in text, such as validating email addresses, extracting data, and replacing or modifying parts of strings.

In Python, the re module is used to work with regular expressions. It provides a set of functions for searching, matching, and manipulating strings based on specific patterns.

1. Introduction to Regular Expressions

A regular expression is a special sequence of characters that helps you match or find other strings or sets of strings using a specialized syntax. Common use cases include:

  • Validating input (e.g., email addresses, phone numbers).
  • Searching for patterns in large bodies of text.
  • Extracting parts of a string, such as dates, names, or emails.

A regular expression consists of:

  • Literal characters: Characters that match themselves (e.g., a, 1, @).
  • Special characters: Characters that have a special meaning in regular expressions, such as . (dot), * (asterisk), + (plus), etc.

2. Basic Syntax of Regular Expressions

Here are some basic elements of a regular expression:

  • . (dot): Matches any character except newline.
  • ^: Matches the start of a string.
  • $: Matches the end of a string.
  • []: A set of characters, matches any single character inside the brackets.
  • |: Alternation, matches one of several patterns.
  • *: Matches 0 or more repetitions of the preceding character.
  • +: Matches 1 or more repetitions of the preceding character.
  • ?: Matches 0 or 1 occurrence of the preceding character.
  • \d: Matches any digit, equivalent to [0-9].
  • \w: Matches any alphanumeric character (letters and digits), equivalent to [a-zA-Z0-9_].
  • \s: Matches any whitespace character (spaces, tabs, newlines).

3. Using the re Module

The re module in Python provides several functions to work with regular expressions. Here are some of the most commonly used functions:

3.1 re.match()

This function checks if the regular expression matches the beginning of a string.

import re

pattern = r"hello"
text = "hello world"
result = re.match(pattern, text)
if result:
    print("Match found:", result.group())
else:
    print("No match")

Output:

Match found: hello

3.2 re.search()

This function searches for the pattern anywhere in the string and returns the first match.

import re

pattern = r"world"
text = "hello world"
result = re.search(pattern, text)
if result:
    print("Search found:", result.group())
else:
    print("No match")

Output:

Search found: world

3.3 re.findall()

This function finds all matches of the pattern in the string and returns them as a list.

import re

pattern = r"\d+"  # Find all numbers
text = "There are 10 apples and 5 bananas."
matches = re.findall(pattern, text)
print("Numbers found:", matches)

Output:

Numbers found: ['10', '5']

3.4 re.sub()

This function is used to replace occurrences of the pattern with a specified string.

import re

pattern = r"\d+"  # Match all numbers
text = "There are 10 apples and 5 bananas."
result = re.sub(pattern, "X", text)
print("Replaced text:", result)

Output:

Replaced text: There are X apples and X bananas.

3.5 re.split()

This function splits the string based on the given pattern and returns a list.

import re

pattern = r"\s+"  # Split based on one or more spaces
text = "Hello   world!    How are   you?"
result = re.split(pattern, text)
print("Split text:", result)

Output:

Split text: ['Hello', 'world!', 'How', 'are', 'you?']

4. Common Use Cases of Regular Expressions

4.1 Email Validation

A common use case for regular expressions is validating email addresses.

import re

pattern = r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$"
email = "test@example.com"
if re.match(pattern, email):
    print("Valid email")
else:
    print("Invalid email")

Output:

Valid email

4.2 Extracting Dates

Regular expressions can also be used to extract dates from text.

import re

pattern = r"\d{2}/\d{2}/\d{4}"  # Matches dates in the format DD/MM/YYYY
text = "The event is on 25/12/2024, and another one on 01/01/2025."
dates = re.findall(pattern, text)
print("Dates found:", dates)

Output:

Dates found: ['25/12/2024', '01/01/2025']

4.3 Finding Phone Numbers

You can use regular expressions to identify phone numbers in various formats.

import re

pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
text = "Call me at (123) 456-7890 or 987-654-3210."
phone_numbers = re.findall(pattern, text)
print("Phone numbers found:", phone_numbers)

Output:

Phone numbers found: ['(123) 456-7890', '987-654-3210']

5. Regular Expressions Best Practices

  • Use raw strings (r"pattern"): In Python, using raw strings for regular expressions is important because it prevents the need for escaping backslashes. For example, r"\d" is much more readable than "\\d".
  • Be cautious with greedy matches: Regular expressions are often greedy by default (they match as much as possible). Use ? to make them non-greedy if necessary (e.g., .*?).
  • Test regular expressions: Use online tools like regex101 to test and debug your regular expressions.
  • Use specific patterns: Avoid using overly broad patterns like .* unless necessary, as they can lead to inefficient matching.

6. Summary

  • Regular Expressions are powerful tools for pattern matching in strings.
  • The re module provides several functions like match(), search(), findall(), sub(), and split() to work with regular expressions.
  • Regular expressions are commonly used for tasks like validation, searching, and extracting data.
  • It's important to understand the syntax and use regular expressions efficiently to solve real-world problems.

Commenting is not enabled on this course.