Regex comes up all the time in NLP, and it’s worth having an understanding of the basics. In recent times the quickest way to construct a regex is to go ‘Hey <favourite LLM>, make me a regex to do xyz‘ and yet it is unsatisfying not to understand the construction of the provided regex – and LLM’s are not perfect so troubleshooting may still be required. The following cheatsheet is my summary of Jurafsky & Martin’s very thorough walkthrough in Speech and Language Processing.

The website regexr.com is a great place to test out regexes (I usually work with the multiline flag enabled).

Concatenations

This is the simplest kind of regex: finding a matching sequence of characters, in other words looking for an exact match for something.

REMatchExample patterns matched
/woodchucks/woodchucksinteresting links to woodchucks and lemurs
/!/ !I can’t believe you said that!

Disjunctions


A string of characters inside braces specifies a disjunction of characters to match – think ‘this or that or the other thing’. These examples show how you can specify specific characters or a range of characters to match.

REMatchExample patterns matched
/[Ww]oodchucks/Woodchucks or woodchucks Woodchucks are my favourites
/[abc]/a or b or cI bit a crusty apple
/[A-Z]/any upper case letterwe should call it Drenched Blossoms
/[0-9]/any digit Chapter 1: Down the rabbit hole


The pipe operator | is to match this or that or the other at a group level and is known as the disjunction operator. I want to say word level, but note how and picks up both ‘sand’ as well as ‘and’.

REMatchExample patterns matched
/the|and|of/the or and or ofIn the dark sand and muck of Louisiana…

Also be aware that /[the|or|of]/ would match any of the individual characters within the brackets! This is where regex can become quite subtle: each component has a very specific meaning that can determine quite specific behaviours.

Negations

The caret operator ^ precedes the things that are being negated and it only functions as a negator inside the brackets.

REMatchExample patterns matched
/[^A-Z]/not any capital letterI think June should be a holiday
/[^Ss]/neither S nor sJoe Soap sank his teeth in

Note how a caret in other positions is just a caret!

REMatchExample patterns matched
/e\^c/e^cLook for e^c please
/[e^c]/e or ^ or cLook for e’s or ^’s or c’s please

However, the caret may also act as an anchor (see Anchors section below)!

Optionality

The question mark ? is short for ‘0 or 1 instances of the previous character’.

REMatchExample patterns matched
/woodchucks?/woodchuck with or without that last sThis woodchuck is the best of woodchucks
/colou?r/colour or colorThis is the color you’re looking for


The Kleene * says ‘0 or more occurrences of the previous character (or regular expression)’.

REMatchExample patterns matched
/ba*/b, ba, baa, baaa, etc.b, ba, baa, baaa black sheep

Whereas the Kleene + says ‘1 or more occurrences of the previous character (or regular expression)’.

REMatchExample patterns matched
/[0-9]+/one or more digits1 or 100 or 1000000
/ba+/b, ba, baa, baaa, etc.b, ba, baa, baaa black sheep

The period . represents any single character except carriage return.

REMatchExample patterns matched
/beg.n/any characterbegin or began or begun or beginning

Some additional optionality abbreviations you may find useful:

REMatch
{n}n occurrences of the previous char or expression
{n,m}from n to m occurrences of the previous char or expression
{n,}at least n occurrences of the previous char or expression
{,m}up to n occurrences of the previous char or expression

Aliases

Aliases provide shortcuts for some commonly used patterns.

REExpansion ofMatch
\d[0-9]any digit
\D[^0-9]any non-digit
\w[a-zA-Z0-0_]any alphanumeric or underscore 
\W[^\w]any non-alphanumeric or non-underscore
\s[ \r\t\n\f]any whitespace (space or tab) 
\S[^\s]any non-whitespace

NOTE \r = carriage return, \t = tab, \n = line feed, \f = form feed.

Anchors

An anchor restricts regular expressions to particular point(s) in the string.

REMatchExample patterns matched
/^The/The, but only at the beginning of a line The dog at The Diner
/dog$/dog, but only at the end of a lineThe dog chased another dog
/^The dog$/The dog, but only where that is the only thing on the lineThe dog
The dog and his boy
/\bdog\b/dog (surrounded by word boundaries) but not dogged where it’s part of a greater worddog his dogged steps
/dog\B/ dog but only where it’s followed by a non-word boundary so within doggeddog his dogged steps

Grouping

Parentheses () are used to group terms when specifying the order in which operations should apply.

REMatchExample patterns matched
/guppy|ies/Incorrect: in this regex guppy takes precedence and so the full word guppies is only partially matchedI have a special guppy of guppies
/gupp(y|ies)/Correct: grouping helps us say that | only applies within
the group
I have a special guppy of guppies

Parentheses () are also used to capture matches in the regex register (referenced by \1, \2, \3, etc.).

Capture groups are formed by parentheses (). Every time a capture group is used, the resulting match is stored in the numbered register.

Non-capture groups are also formed by parentheses but with ?: at the start (?:). They are used when you don’t want to capture the group in the register.

In the following example \1 refers back to whatever is picked up as a match in the first (and only) capture group (people|cats). In other words if the match picked up in the capture group is ‘cats’ then ‘like some’ has to also be followed by ‘cats’ in order for a match to occur. This is why in the third example no match occurs: cats are not people!

REExample patterns
(?:some|a few) (people|cats) like some \1some cats like some cats
a few people like some people
some cats like some people

Operator precedence

Note that operators are processed in the following order:

OperatorDescriptionOrder
()Parenthesis1st
* + ? {}Counters2nd
^ $ \b \BSequencers and anchors3rd
|Disjuntions4th

Greediness

[a-z]* applied to ‘once upon a time’ could match nothing (remember Kleene * says 0 or more occurrences), it could match ‘o’, ‘on’, ‘onc’, etc. The greediness come in where regex always matches the largest string it can possibly match. In this case the match would result in ‘once upon a time‘.

Positive & negative lookahead

Positive lookaheads are formed by parentheses but with ?= at the start (?=).

Negative lookaheads are also formed by parentheses but with ?! at the start (?!).

REMatchExample patterns matched
^(?=Volcano).* Any line that DOES start with Volcano Volcanoes are so cool.
Icebergs are not.
^(?!Volcano).*Any line that DOESN’T start with Volcano Volcanoes are so cool.
Icebergs are not.

Special characters

Any special characters in regex must be escaped when you want to match the actual character, so if you want to find an actual question mark in your text you’ll specify it with \?.

Fun fact –

Image from ELIZA on Wikipedia

One of the earliest chatbots, Eliza, relied extensively on regex patterns to simulate therapeutic conversation. Jurafsky & Martin give some examples: “ELIZA works by having a series or cascade of regular expression substitutions each of which matches and changes some part of the input lines. Input lines are first uppercased. The first substitutions then change all instances of MY to YOUR, and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces other patterns in the input.” In the following example we can see capture groups at work:

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/