Category: postcards
-
Systems design concepts
Ashish Pratap Singh‘s article “System Design was HARD until I Learned these 30 Concepts” is just so well-written and intuitive that I simply had to save it as a postcard for my future self. Some of these concepts I’ve come across directly through data science, others I have absorbed almost by osmosis over the years…
-
Limit theorems explained
Before we dive into the theorems let’s tackle a concept one often sees in statistics: the notion of independent, identically distributed (iid) random variables. Whether we’re drawing a sample from a population or conducting a series of experiments like coin flips, we can assess whether iid holds true or not as follows: Independent? Here we…
-
Regex basics
Regex comes up all the time in NLP, and it’s worth having an understanding of the basics. In recent times the quickest way to construct a regex is to go ‘Hey <favourite LLM>, make me a regex to do xyz‘ and yet it is unsatisfying not to understand the construction of the provided regex –…
-
Representing text by counting
Natural language processing algorithms work with numbers, not text. So how can we convert strings of text into numbers that are representative of the meaning of that text? Some of the simplest methods (which can be surprisingly effective for some applications) are those that count words in various ways. In this article I’ll unpack the…
-
Data structures for deep learning
In 2020 I completed the Udacity Deep Learning Nanodegree, which focuses on implementing a variety of deep learning architectures using PyTorch. At the outset, it’s pretty fundamental to understand the data structures you’ll be encountering as inputs to and outputs from your neural network architecture. What I noticed was that plenty of the issues encountered…
-
Tutorial: BigQuery arrays and structs
The first time I encountered the BigQuery export schema this year my heart sank: arrays and structs were not something covered in my SQL intro course! But having spent a few months extracting data like this I’ve come to appreciate the logic. These are all the ‘notes to self’ I wish I’d had at the…
-
Finding relationships between words
I’ve spent the past couple of weeks exploring how to find relationships between words with the skip-gram word2vec model so I was pretty fired up to share some of what I’d learned! Here are some of the intuitions I covered… What task have I been working on? I have a large number of news articles,…
-
Poisson vs Exponential distributions
These distributions are related yet different – here’s a comparison that hopefully clears up any confusions! Poisson Exponential Number of events that occur in an interval of time Time taken between 2 events occurring For example… the number of Metrorail trains that arrive at the platform in an hour For example… the time between one…
-
Populations and samples
Populations We can think of the population as the complete set of “things”, whatever the “things” are that are under consideration – for example if we’re interested in studying the height of men in South Africa, then the population would be all adult men in South Africa. A population can be described by parameters. Here’s…
-
Central limit theorem – a worked example
Remember our formal definition: The CLT states that, provided enough samples are taken, the sample distribution of the sample mean will be normally distributed, regardless of the population distribution. In mathematical terms we say therefore that the sample mean is equal to the population mean: With enough samples this also happens – the sample standard deviation…
