Finding relationships between words

I’ve spent the past couple of weeks exploring how to find relationships between words with the skip-gram word2vec model so I was pretty fired up to share some of what I’d learned! Here are some of the intuitions I covered…

What task have I been working on?

I have a large number of news articles, and I’m interested in finding which ones are on similar topics and what those topics are – obviously without actually reading them all (for those interested, technically I’m aiming to implement a word2vec skip-gram model using pytorch).

Humans work with words, but machines want numbers

When we do this kind of work there are 2 main starter challenges we face:

To translate words into numbers (and back again so we can understand the results)
To reduce the number of words we have to deal with

Take a look at our first sample sentence below, and a few things will stand out:

We have Rugby (title case) and rugby (lower case) – these words will be counted as 2 different words!
We have funny punctuation, like ‘Rainbow and Nation’ – where ideally we’d just like the word on its own…

Step 1 is to do some coding to clean up these words: typical tasks include converting all the words to lower case, removing punctuation that we don’t want, and converting punctuation that we do want to words like FULLSTOP.

Screenshot 2019-12-12 at 21.33.50

Step 2 is to create a dictionary of words – so that the machine can work in numbers, but we can still work in words and we can translate between the two. We can do this by extracting all the unique words from the text and numbering them. Have a look at the word rugby – each time it appears in our text it will be represented by the number 1.

Step 3 is about reducing the number of words we have to deal with: 4,751,168 is a LOT of words! And many of them will not even add value – for example, the and of and his will probably be used 1000’s of times and they won’t help us learn anything about the themes in the text. After we remove these kinds of words, known as “stop words”, we’re down to 1,408,614 which is a lot more manageable.

So how does a machine learn?

Screenshot 2019-12-12 at 22.02.01.png

By guessing!

Let’s have a look at our sentence again: if we give our model the word springbok it must try to guess what the surrounding words will be.

Initially, it’s literally just guessing, so it might be wildly off – but we know what the right answers are so we can correct it.

In fact, the machine learning training process is quite a lot like training an imaginary digital dog. When the digital dog does what you want it to do then you reward it with a ‘well done – that was a good guess‘, but when the digital dog doesn’t perform well you reprimand it and say ‘no – not like that, more like this‘. And over time the digital dog learns how to guess better and better. And by learning to predict which words surround other words, our machine learning model also learns which words are closely related to other words.

Watching the learning

In fact, it’s quite fun to watch the progress – here you can see it in action. For Epoch think ’round’ – so this is showing us the first of 8 rounds of the guessing game… At the end of the epoch we pick 16 random words and view the top 6 guesses for those surrounding words. You can see the guesses are terrible – there is very little that makes sense here:

Screenshot 2019-12-12 at 22.23.24

Loss is another word for error – so the bigger the loss the worse the guessing is. But look how much we’ve improved by round 3 – the loss has gone down a lot, so our digital dog is starting to behave better. And the word associations are looking better, for example proteas is indeed associated with cricket and related cricket terms.

Screenshot 2019-12-12 at 22.27.28

By round 8 our digital dog is extremely well-trained: the loss is low and the word associations are classy! Our model is even starting to pick up subtleties like the relationships between staff and coaching and work.

Screenshot 2019-12-12 at 22.31.20

Our model stores these relationships as vectors (in this case a vector of 300 numbers representing each word) and we can visualize these relationships quite nicely in 3D, using the Tensorflow embedding projector – here we can see words similar to siya kolisi who was our SA captain in the Rugby World Cup 2019. Not a bad job, most of the words are rugby-themed, hooray :).