Building a Resilient Phone Number Regex
I recently discovered that using regex to extract phone numbers from text is significantly hard than I initially thought.
What started as a mild curiosity quickly morphed into a descent into a regex rabbit hole, culminating in the magnificent beast you see below:
import re
pattern = re.compile(
    r"""
        (?<![\w$])                     # start word boundary, exclude $ prefix
        (?!\d{1,2}[-/]\d{1,2}[-/]\d{2,4}\b)  # negative lookahead for date patterns
        (
            # Pattern 1: International numbers starting with +
            (?:\+\d{1,3}(?:[\s.-]*\(?\d{1,4}\)?)?(?:[\s.-]*\d+)+)
            |
            # Pattern 2: Numbers with area codes in parentheses
            (?:\(\d{1,4}\)(?:[\s.-]*\d+)+)
            |
            # Pattern 3: Long sequences of digits (8+ digits, without separators)
            (?:\d{8,})
            |
            # Pattern 4: Numbers with separators (dots, spaces, or dashes)
            (?:\d{1,4}[\s.-]+\d+(?:[\s.-]+\d+)+)
        )
        (?![\w])                      # end word boundary
    """,
    re.VERBOSE | re.IGNORECASE,
)
text = """
...
"""
result = []
for match in pattern.finditer(text):
    start_idx, end_idx = match.span()
    result.append(text[start_idx:end_idx])Why is this so hard? At first glance, a phone number is just a sequence of digits. But I discovered there’s a lot of subtle nuances when trying to capture phone numbers across a range of dialects:
| Category | Input | Expected Output | 
|---|---|---|
| I. Australian Numbers | “My number is 0412 345 678 and office is (02) 9876 5432.” | ["0412 345 678", "(02) 9876 5432"] | 
| “Call me on +61 412 345 678 now!” | ["+61 412 345 678"] | |
| “Emergency: 000. Landline: 0391234567, Mobile: 0412-345-678.” | ["0391234567", "0412-345-678"] | |
| “Support: 1300 123 456 or 1800 987 654.” | ["1300 123 456", "1800 987 654"] | |
| “Another mobile 0412.345.678” | ["0412.345.678"] | |
| “No leading zero landline (2) 1234 5678” | ["(2) 1234 5678"] | |
| “International dial-out from AU: 001161412345678” | ["001161412345678"] | |
| “Some short number 1234” | [] | |
| “A number with an X: 0398765432 x123” | ["0398765432"] | |
| II. International Numbers | “US number: +1 (555) 123-4567 ext 890” | ["+1 (555) 123-4567"] | 
| (Various Formats) | “UK contact: +44 20 7946 0123” | ["+44 20 7946 0123"] | 
| “German number: +49 (0) 30 12345678” | ["+49 (0) 30 12345678"] | |
| “French mobile: +33 6 12 34 56 78” | ["+33 6 12 34 56 78"] | |
| “Japanese: +81 3-1234-5678” | ["+81 3-1234-5678"] | |
| “No country code US: (212) 555-1234 X12” | ["(212) 555-1234"] | |
| “Plain US number: 555-867-5309” | ["555-867-5309"] | |
| “Swiss number: +41 22 345 67 89” | ["+41 22 345 67 89"] | |
| “Irish number: +353 1 123 4567” | ["+353 1 123 4567"] | |
| “Another ext example: 0412345678 EXT: 123” | ["0412345678"] | |
| “Another ext example: 0412345678 #123” | ["0412345678"] | |
| III. Edge Cases and False | “This is not a number: 123abc456” | [] | 
| Positives | “A date: 12-12-2023” | [] | 
| “A time: 10:30” | [] | |
| “My postcode is 3000” | [] | |
| “A two-digit number 42” | [] | |
| “Price: $1,234.56” | [] | |
| “Short numbers like 123 (PST codes) might be captured.” | [] | |
| “Sequence of numbers 123456789012345 This is a long number” | ["123456789012345"] | |
| “Multiple numbers in one string 0411111111, +61 2 2222 2222, 001161333333333” | ["0411111111", "+61 2 2222 2222", "001161333333333"] | |
| “Number with parentheses, dashes, and periods: (02)-1234.5678” | ["(02)-1234.5678"] | |
| “Number with common prefix, and no separator +61412345678” | ["+61412345678"] | |
| “Number with common prefix, and a lot of separator +61 412 345 678” | ["+61    412    345    678"] | 
1. (?<![\w$])
(?<!...): This is the syntax for a negative lookbehind.
- It asserts that whatever is inside the (...)does not immediately precede the current position in the string.
- Crucially, the matched characters within the lookbehind are not included in the overall match. They are just a condition.
[\w$]: This is a character class for “word” characters and the literal $
character
Here, this prevents matching numbers embedded within words (e.g.,
product123-4567), or numbers corresponding to prices like ($123.50)
2. (?!\d{1,4}[-/]\d{1,2}[-/]\d{2,4}\b)
?!: This is the syntax for a negative lookahead. It asserts that the
pattern inside the lookahead must not appear immediately after the current
position in the string. If the pattern inside does appear, the entire
lookahead assertion fails, and thus the match fails at that position.
\d{1,4}: matches any digit one to four times
[-/]: Matches a literal - or /
\b: This is a word boundary. It asserts that the current position is a
boundary between a word character (like a letter, number, or underscore) and a
non-word character (like a space, punctuation, or the beginning/end of the
string). In this context, it’s crucial for ensuring that the date pattern isn’t
a part of a larger number or string, but rather a distinct “word” or
unit.
This is crucial for excluding common date patterns, which were a major source of false positives
3. (?:...) | (?:...) | (?:...) | (?:...)
This is where I split out the different forms of phone number patterns. Each
components is a non-capturing group (?:...), which group parts of your
pattern together without “capturing” their matched content into a separate
back-reference. That is, unlike a capturing group (...), you can’t refer
to the group later in the regex with \1, \2, … or extract it as a separate
group in your code.
They’re also used for
- 
Grouping for Applying Quantifiers: This is their primary use. If you want to apply a quantifier (like *,+,?,{n,m}) to multiple characters as a unit, but you don’t care about extracting that unit’s match later, a non-capturing group is perfect.- Example: (?:abc)+will match “abc”, “abcabc”, “abcabcabc”, etc. without each “abc” being stored.
 
- Example: 
- 
Applying Alternation ( |) to a Specific Segment: They allow you to define a set of alternatives for a specific part of your pattern without creating a new capture group.- Example: foo(?:bar|baz)will match “foobar” or “foobaz”. If it werefoo(bar|baz),barorbazwould be captured.
 
- Example: 
3a. \+\d{1,3}(?:[\s.-]*\(?\d{1,4}\)?)?(?:[\s.-]*\d+)+
Targets international numbers.
\+\d{1,3}: Starts with a + followed by 1 to 3 digits (country code).
(?:[\s.-]*\(?\d{1,4}\)?)?: Optionally allows for separators (space, dot,
dash), an optional opening bracket, 1-4 digits (area code/city code), and an
optional closing bracket.
- The ?at the end makes this whole section optional, catering to numbers like+44 20...or+1(555)....
(?:[\s.-]*\d+)+: Continues with one or more groups of separators followed by
one or more digits. This repeatedly matches the remaining parts of the number.
3b. \(\d{1,4}\)(?:[\s.-]*\d+)+
Specifically handles numbers where the area code (1-4 digits) is contained in
brackets, like (555) 123-4567.
\(\d{1,4}\): Matches the parenthesized area code.
(?:[\s.-]*\d+)+: Matches the remaining digits, allowing for separators.
3c. \d{8,}
This is our catch-all for long, uninterrupted sequences of digits of at least 8 digits.
3d. \d{1,4}[\s.-]+\d+(?:[\s.-]+\d+)+
Captures classic XXX-XXX-XXXX or XXX.XXX.XXXX or XXX XXX XXXX styles.
\d{1,4}[\s.-]+: Starts with 1 to 4 digits followed by one or more spaces,
dots, or dashes.
\d+: Then one or more digits.
(?:[\s.-]+\d+)+: And then one or more repetitions of separators followed by
digits. This ensures we’re dealing with at least two separated parts
4. (?!\w):
Another negative lookahead to ensure that the match isn’t followed by a word
character. Prevents matching 123-4567abc.