Regex basics

Regex comes up all the time in NLP, and it’s worth having an understanding of the basics. In recent times the quickest way to construct a regex is to go ‘Hey <favourite LLM>, make me a regex to do xyz‘ and yet it is unsatisfying not to understand the construction of the provided regex – and LLM’s are not perfect so troubleshooting may still be required. The following cheatsheet is my summary of Jurafsky & Martin’s very thorough walkthrough in Speech and Language Processing.

The website regexr.com is a great place to test out regexes (I usually work with the multiline flag enabled).

Concatenations

This is the simplest kind of regex: finding a matching sequence of characters, in other words looking for an exact match for something.

RE	Match	Example patterns matched
`/woodchucks/`	woodchucks	interesting links to woodchucks and lemurs
`/!/`	!	I can’t believe you said that!

Disjunctions

A string of characters inside braces specifies a disjunction of characters to match – think ‘this or that or the other thing’. These examples show how you can specify specific characters or a range of characters to match.

RE	Match	Example patterns matched
`/[Ww]oodchucks/`	Woodchucks or woodchucks	Woodchucks are my favourites
`/[abc]/`	a or b or c	I bit a crusty apple
`/[A-Z]/`	any upper case letter	we should call it Drenched Blossoms
/[0-9]/	any digit	Chapter 1: Down the rabbit hole

The pipe operator | is to match this or that or the other at a group level and is known as the disjunction operator. I want to say word level, but note how and picks up both ‘sand’ as well as ‘and’.

RE	Match	Example patterns matched
`/the\|and\|of/`	the or and or of	In the dark sand and muck of Louisiana…

Also be aware that /[the|or|of]/ would match any of the individual characters within the brackets! This is where regex can become quite subtle: each component has a very specific meaning that can determine quite specific behaviours.

Negations

The caret operator ^ precedes the things that are being negated and it only functions as a negator inside the brackets.

RE	Match	Example patterns matched
`/[^A-Z]/`	not any capital letter	I think June should be a holiday
`/[^Ss]/`	neither S nor s	Joe Soap sank his teeth in

Note how a caret in other positions is just a caret!

RE	Match	Example patterns matched
`/e\^c/`	e^c	Look for e^c please
`/[e^c]/`	e or ^ or c	Look for e’s or ^’s or c’s please

However, the caret may also act as an anchor (see Anchors section below)!

Optionality

The question mark ? is short for ‘0 or 1 instances of the previous character’.

RE	Match	Example patterns matched
`/woodchucks?/`	woodchuck with or without that last s	This woodchuck is the best of woodchucks
`/colou?r/`	colour or color	This is the color you’re looking for

The Kleene * says ‘0 or more occurrences of the previous character (or regular expression)’.

RE	Match	Example patterns matched
`/ba*/`	b, ba, baa, baaa, etc.	b, ba, baa, baaa black sheep

Whereas the Kleene + says ‘1 or more occurrences of the previous character (or regular expression)’.

RE	Match	Example patterns matched
`/[0-9]+/`	one or more digits	1 or 100 or 1000000
`/ba+/`	b, ba, baa, baaa, etc.	b, ba, baa, baaa black sheep

The period . represents any single character except carriage return.

RE	Match	Example patterns matched
`/beg.n/`	any character	begin or began or begun or beginning

Some additional optionality abbreviations you may find useful:

RE	Match
`{n}`	n occurrences of the previous char or expression
`{n,m}`	from n to m occurrences of the previous char or expression
`{n,}`	at least n occurrences of the previous char or expression
`{,m}`	up to n occurrences of the previous char or expression

Aliases

Aliases provide shortcuts for some commonly used patterns.

RE	Expansion of	Match
`\d`	`[0-9]`	any digit
`\D`	`[^0-9]`	any non-digit
`\w`	`[a-zA-Z0-0_]`	any alphanumeric or underscore
`\W`	`[^\w]`	any non-alphanumeric or non-underscore
`\s`	`[ \r\t\n\f]`	any whitespace (space or tab)
`\S`	`[^\s]`	any non-whitespace

NOTE \r = carriage return, \t = tab, \n = line feed, \f = form feed.

Anchors

An anchor restricts regular expressions to particular point(s) in the string.

RE	Match	Example patterns matched
`/^The/`	The, but only at the beginning of a line	The dog at The Diner
`/dog$/`	dog, but only at the end of a line	The dog chased another dog
`/^The dog$/`	The dog, but only where that is the only thing on the line	The dog The dog and his boy
`/\bdog\b/`	dog (surrounded by word boundaries) but not dogged where it’s part of a greater word	dog his dogged steps
`/dog\B/`	dog but only where it’s followed by a non-word boundary so within dogged	dog his dogged steps

Grouping

Parentheses () are used to group terms when specifying the order in which operations should apply.

RE	Match	Example patterns matched
`/guppy\|ies/`	Incorrect: in this regex guppy takes precedence and so the full word guppies is only partially matched	I have a special guppy of guppies
`/gupp(y\|ies)/`	Correct: grouping helps us say that \| only applies within the group	I have a special guppy of guppies

Parentheses () are also used to capture matches in the regex register (referenced by \1, \2, \3, etc.).

Capture groups are formed by parentheses (). Every time a capture group is used, the resulting match is stored in the numbered register.

Non-capture groups are also formed by parentheses but with ?: at the start (?:). They are used when you don’t want to capture the group in the register.

In the following example \1 refers back to whatever is picked up as a match in the first (and only) capture group (people|cats). In other words if the match picked up in the capture group is ‘cats’ then ‘like some’ has to also be followed by ‘cats’ in order for a match to occur. This is why in the third example no match occurs: cats are not people!

RE	Example patterns
`(?:some\|a few) (people\|cats) like some \1`	some cats like some cats
	a few people like some people
	some cats like some people

Operator precedence

Note that operators are processed in the following order:

Operator	Description	Order
`()`	Parenthesis	1^st
`* + ? {}`	Counters	2^nd
^ $ \b \B	Sequencers and anchors	3^rd
\|	Disjuntions	4^th

Greediness

[a-z]* applied to ‘once upon a time’ could match nothing (remember Kleene * says 0 or more occurrences), it could match ‘o’, ‘on’, ‘onc’, etc. The greediness come in where regex always matches the largest string it can possibly match. In this case the match would result in ‘once upon a time‘.

Positive & negative lookahead

Positive lookaheads are formed by parentheses but with ?= at the start (?=).

Negative lookaheads are also formed by parentheses but with ?! at the start (?!).

RE	Match	Example patterns matched
`^(?=Volcano).*`	Any line that DOES start with Volcano	Volcanoes are so cool. Icebergs are not.
`^(?!Volcano).*`	Any line that DOESN’T start with Volcano	Volcanoes are so cool. Icebergs are not.

Special characters

Any special characters in regex must be escaped when you want to match the actual character, so if you want to find an actual question mark in your text you’ll specify it with \?.

Fun fact –

Image from ELIZA on Wikipedia

One of the earliest chatbots, Eliza, relied extensively on regex patterns to simulate therapeutic conversation. Jurafsky & Martin give some examples: “ELIZA works by having a series or cascade of regular expression substitutions each of which matches and changes some part of the input lines. Input lines are first uppercased. The first substitutions then change all instances of MY to YOUR, and I’M to YOU ARE, and so on. The next set of substitutions matches and replaces other patterns in the input.” In the following example we can see capture groups at work:

s/.* I’M (depressed|sad) .*/I AM SORRY TO HEAR YOU ARE \1/
s/.* I AM (depressed|sad) .*/WHY DO YOU THINK YOU ARE \1/
s/.* all .*/IN WHAT WAY/
s/.* always .*/CAN YOU THINK OF A SPECIFIC EXAMPLE/