This week I did something a bit different and rather fun! My colleague Carel phoned to say he was bringing his 11-year old daughter, Lisa-Marie, to work the next day and did I have anything interesting to share with her about the world of data science?
As it happens I’ve spent the past couple of weeks exploring how to find relationships between words with the skip-gram word2vec model so I was pretty fired up to share some of what I’d learned! Here are some of the intuitions I covered…
What task have I been working on?
I have a large number of news articles, and I’m interested in finding which ones are on similar topics and what those topics are – obviously without actually reading them all (for those interested, technically I’m aiming to implement a word2vec skip-gram model using pytorch).
Humans work with words, but machines want numbers
When we do this kind of work there are 2 main starter challenges we face:
- To translate words into numbers (and back again so we can understand the results)
- To reduce the number of words we have to deal with
Take a look at our first sample sentence below, and a few things will stand out:
- We have Rugby (title case) and rugby (lower case) – these words will be counted as 2 different words!
- We have funny punctuation, like ‘Rainbow and Nation’ – where ideally we’d just like the word on its own…
Step 1 is to do some coding to clean up these words: typical tasks include converting all the words to lower case, removing punctuation that we don’t want, and converting punctuation that we do want to words like FULLSTOP.
Step 2 is to create a dictionary of words – so the machine can work in numbers, but we can still work in words and we can translate between the two. We can do this by extracting all the unique words from the text and numbering them. Have a look at the word rugby – each time it appears in our text it will be represented by the number 1.
Step 3 is about reducing the number of words we have to deal with: 4,751,168 is a LOT of words! And many of them will not even add value – for example, the and of and his will probably be used 1000’s and 1000’s of times and they won’t help us learn anything about the themes in the text. After we remove these very frequent words, we’re down to 1,408,614 which is a lot more manageable.
So how does a machine learn?
Let’s have a look at our sentence again: if we give our model the word springbok it must try to guess what the surrounding words will be.
Initially, it’s literally just guessing, so it might be wildly off – but we know what the right answers are so we can correct it.
In fact, the machine learning training process is quite a lot like training a dog – Carel coined the term ‘Digital Dog’ in the session – I love it!
So when the Digital Dog does what you want it to do then you reward it with a ‘well done – that was a good guess‘, but when the Digital Dog doesn’t perform well you reprimand it and say ‘no – not like that, more like this‘. And over time the Digital Dog learns how to guess better and better. And by learning to predict which words surround other words, our machine learning model also learns which words are closely related to other words.
Watching the learning
In fact, it’s quite fun to watch the progress – here you can see it in action: every so often we pick out 16 random words and see how it’s looking… For Epoch think Guess – so this is the first of 8 big rounds of the guessing game… The first word is what was given, and the next 6 words are what was guessed as the surrounding words. You can see the guesses are terrible – there is very little that makes sense here:
Loss is another word for error – so the bigger the loss the worse the guessing is. But look how much we’ve improved by round 3 – the loss has gone down a lot, so our Digital Dog is starting to behave better. And the word associations are looking better, for example proteas is indeed associated with cricket and related cricket terms.
By round 8 our Digital Dog is extremely well-trained, the loss is low and the word associations are classy! Our model is even starting to pick up subtleties like the relationships between staff and coaching and work.
Our model stores these relationships as vectors (this is a linear algebra thing) and we can visualize these relationships quite nicely in 3D, using the Tensorflow embedding projector – here we can see words similar to siya kolisi who was our SA captain in the Rugby World Cup 2019. Not a bad job, most of the words are rugby-themed, hooray :).
Thanks Carel and Lisa-Marie for the chance to share the fun!