Regex comes up all the time in NLP, and it’s worth having an understanding of the basics. In recent times the quickest way to construct a regex is to go ‘Hey <favourite LLM>, make me a regex to do xyz‘ and yet it is unsatisfying not to understand the construction of the provided regex – and LLM’s are not perfect so troubleshooting may still be required. The following cheatsheet is my summary of Jurafsky & Martin’s very thorough walkthrough in Speech and Language Processing.
The website regexr.com is a great place to test out regexes (I usually work with the multiline flag enabled).
Concatenations
This is the simplest kind of regex: finding a matching sequence of characters, in other words looking for an exact match for something.
| RE | Match | Example patterns matched |
/woodchucks/ | woodchucks | interesting links to woodchucks and lemurs |
/!/ | ! | I can’t believe you said that! |
Disjunctions
A string of characters inside braces specifies a disjunction of characters to match – think ‘this or that or the other thing’. These examples show how you can specify specific characters or a range of characters to match.
| RE | Match | Example patterns matched |
/[Ww]oodchucks/ | Woodchucks or woodchucks | Woodchucks are my favourites |
/[abc]/ | a or b or c | I bit a crusty apple |
/[A-Z]/ | any upper case letter | we should call it Drenched Blossoms |
| /[0-9]/ | any digit | Chapter 1: Down the rabbit hole |
The pipe operator | is to match this or that or the other at a group level and is known as the disjunction operator. I want to say word level, but note how and picks up both ‘sand’ as well as ‘and’.
| RE | Match | Example patterns matched |
/the|and|of/ | the or and or of | In the dark sand and muck of Louisiana… |
Also be aware that /[the|or|of]/ would match any of the individual characters within the brackets! This is where regex can become quite subtle: each component has a very specific meaning that can determine quite specific behaviours.
Negations
The caret operator ^ precedes the things that are being negated and it only functions as a negator inside the brackets.
| RE | Match | Example patterns matched |
/[^A-Z]/ | not any capital letter | I think June should be a holiday |
/[^Ss]/ | neither S nor s | Joe Soap sank his teeth in |
Note how a caret in other positions is just a caret!
| RE | Match | Example patterns matched |
/e\^c/ | e^c | Look for e^c please |
/[e^c]/ | e or ^ or c | Look for e’s or ^’s or c’s please |
However, the caret may also act as an anchor (see Anchors section below)!
Optionality
The question mark ? is short for ‘0 or 1 instances of the previous character’.
| RE | Match | Example patterns matched |
/woodchucks?/ | woodchuck with or without that last s | This woodchuck is the best of woodchucks |
/colou?r/ | colour or color | This is the color you’re looking for |
The Kleene * says ‘0 or more occurrences of the previous character (or regular expression)’.
| RE | Match | Example patterns matched |
/ba*/ | b, ba, baa, baaa, etc. | b, ba, baa, baaa black sheep |
Whereas the Kleene + says ‘1 or more occurrences of the previous character (or regular expression)’.
| RE | Match | Example patterns matched |
/[0-9]+/ | one or more digits | 1 or 100 or 1000000 |
/ba+/ | b, ba, baa, baaa, etc. | b, ba, baa, baaa black sheep |
The period . represents any single character except carriage return.
| RE | Match | Example patterns matched |
/beg.n/ | any character | begin or began or begun or beginning |
Some additional optionality abbreviations you may find useful:
| RE | Match |
{n} | n occurrences of the previous char or expression |
{n,m} | from n to m occurrences of the previous char or expression |
{n,} | at least n occurrences of the previous char or expression |
{,m} | up to n occurrences of the previous char or expression |
Aliases
Aliases provide shortcuts for some commonly used patterns.
| RE | Expansion of | Match |
\d | [0-9] | any digit |
\D | [^0-9] | any non-digit |
\w | [a-zA-Z0-0_] | any alphanumeric or underscore |
\W | [^\w] | any non-alphanumeric or non-underscore |
\s | [ \r\t\n\f] | any whitespace (space or tab) |
\S | [^\s] | any non-whitespace |
NOTE \r = carriage return, \t = tab, \n = line feed, \f = form feed.
Anchors
An anchor restricts regular expressions to particular point(s) in the string.
| RE | Match | Example patterns matched |
/^The/ | The, but only at the beginning of a line | The dog at The Diner |
/dog$/ | dog, but only at the end of a line | The dog chased another dog |
/^The dog$/ | The dog, but only where that is the only thing on the line | The dog The dog and his boy |
/\bdog\b/ | dog (surrounded by word boundaries) but not dogged where it’s part of a greater word | dog his dogged steps |
/dog\B/ | dog but only where it’s followed by a non-word boundary so within dogged | dog his dogged steps |
Grouping
Parentheses () are used to group terms when specifying the order in which operations should apply.
| RE | Match | Example patterns matched |
/guppy|ies/ | Incorrect: in this regex guppy takes precedence and so the full word guppies is only partially matched | I have a special guppy of guppies |
/gupp(y|ies)/ | Correct: grouping helps us say that | only applies within the group | I have a special guppy of guppies |
Parentheses () are also used to capture matches in the regex register (referenced by \1, \2, \3, etc.).
Capture groups are formed by parentheses (). Every time a capture group is used, the resulting match is stored in the numbered register.
Non-capture groups are also formed by parentheses but with ?: at the start (?:). They are used when you don’t want to capture the group in the register.
In the following example \1 refers back to whatever is picked up as a match in the first (and only) capture group (people|cats). In other words if the match picked up in the capture group is ‘cats’ then ‘like some’ has to also be followed by ‘cats’ in order for a match to occur. This is why in the third example no match occurs: cats are not people!
| RE | Example patterns |
(?:some|a few) (people|cats) like some \1 | some cats like some cats |
| a few people like some people | |
| some cats like some people |
Operator precedence
Note that operators are processed in the following order:
| Operator | Description | Order |
() | Parenthesis | 1st |
* + ? {} | Counters | 2nd |
| ^ $ \b \B | Sequencers and anchors | 3rd |
| | | Disjuntions | 4th |
Greediness
[a-z]* applied to ‘once upon a time’ could match nothing (remember Kleene * says 0 or more occurrences), it could match ‘o’, ‘on’, ‘onc’, etc. The greediness come in where regex always matches the largest string it can possibly match. In this case the match would result in ‘once upon a time‘.
Positive & negative lookahead
Positive lookaheads are formed by parentheses but with ?= at the start (?=).
Negative lookaheads are also formed by parentheses but with ?! at the start (?!).
| RE | Match | Example patterns matched |
^(?=Volcano).* | Any line that DOES start with Volcano | Volcanoes are so cool. Icebergs are not. |
^(?!Volcano).* | Any line that DOESN’T start with Volcano | Volcanoes are so cool. Icebergs are not. |
Special characters
Any special characters in regex must be escaped when you want to match the actual character, so if you want to find an actual question mark in your text you’ll specify it with \?.
Fun fact –

One of the earliest chatbots, Eliza, relied extensively on regex patterns to simulate therapeutic conversation. Jurafsky & Martin give some examples: “ELIZA works by having a series or cascade of regular expression substitutions each of which matches and changes some part of the input lines. Input lines are first uppercased. The first substitutions then change all instances of MY to YOUR, and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces other patterns in the input.” In the following example we can see capture groups at work:
s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/
