Tag: data science

  • Limit theorems explained

    Before we dive into the theorems let’s tackle a concept one often sees in statistics: the notion of independent, identically distributed (iid) random variables. Whether we’re drawing a sample from a population or conducting a series of experiments like coin flips, we can assess whether iid holds true or not as follows: Independent? Here we…

  • Tensorflow classification of 475 bird species

    For this project I followed the universal workflow of machine learning as described in Deep Learning with Python (1st edition) by François Chollet. It is a classic text which builds the student’s understanding of neural networks ‘brick by brick’ and was the first book that really gave me a good understanding of where to start with neural…

  • Representing text by counting

    Natural language processing algorithms work with numbers, not text. So how can we convert strings of text into numbers that are representative of the meaning of that text? Some of the simplest methods (which can be surprisingly effective for some applications) are those that count words in various ways. In this article I’ll unpack the…

  • Predicting churn with PySpark

    I decided to tackle the Expresso churn prediction challenge on the Zindi platform during the course of the Big data analysis module of my degree for a couple of reasons: The full project can be viewed in my Github repo: The Expresso brief According to Zindi “Expresso is an African telecommunications company that provides customers…

  • Using human-in-the-loop techniques

    Many machine-learning tasks rely on the availability of a labelled dataset for training and tuning. But how do we go about evaluation when the dataset we have is not labelled? This is exactly the situation I found myself facing during my final MSc project. I chose to experiment with building a knowledge graph from news…

  • Studying data science through the University of London

    I came across the University of London’s MSc Data Science program towards the end of 2020. At the time my Dad was fighting off Covid – and because I had also been exposed, I was quarantined with him and my Mom for three weeks while we waited to see whether we might also have contracted…

  • The data science “antilibrary”

    I first came across the notion of the “antilibrary” in Maria Popova’s beautiful post reflecting on “Why Unread Books Are More Valuable to Our Lives than Read Ones“. The term was coined by Nassim Nicholas Taleb (author of The Black Swan) who suggests that as your knowledge grows, so too should your accumulation of unread…

  • Populations and samples

    Populations We can think of the population as the complete set of “things”, whatever the “things” are that are under consideration – for example if we’re interested in studying the height of men in South Africa, then the population would be all adult men in South Africa. A population can be described by parameters. Here’s…