Sho't left to data science

Getting results vs Understanding

Jul 3, 2018

—

shortcut-on-desktop Alexander Pope is famously quoted as saying:

A little learning is a dangerous thing;
drink deep, or taste not the Pierian spring:
there shallow draughts intoxicate the brain,
and drinking largely sobers us again.

I’ve been thinking about these words the past few days as I worked on my latest challenge: a text classifier using my fledgling scikit learn knowledge.

Here is my objective: IT end-users email the Support Desk with their IT requests and problems. This email is captured by a clerk when the call is logged and then lies waiting for someone with slightly more expert knowledge to read and understand said email so that they can determine which Support Team should deal with it. My thinking was why not build a text classifier that can “read” the mail and decide immediately which Support Team is needed?

I’ve been dreaming of this project for a while so the moment I reached the point in Frank Kane’s machine learning course where sklearn.naive_bayes.MultinomialNB was covered I broke off from my studies to tackle my project!

My first attempt on my train/test split yielded a result of 65% accuracy. I was momentarily impressed until I realised that 35% of user incidents logged would go astray :). And so the quest for a higher and higher percentage started. Which brings me to my point about getting results vs understanding. I’ve learned a tremendous amount (at a high level anyway) in the process of fine-tuning my text classifier, including:

sklearn.ensemble.RandomForestClassifier
sklearn.pipeline.Pipeline
sklearn.linear_model.SGDClassifier
How more data doesn’t necessarily bring better results (I did not expect this: my classifier did better with 9 months worth of training data than with 12 months?!)
How pre-cleaning the data also doesn’t necessarily bring better results (I also did not expect this: I spent a whole day learning how to do the things described in this ultimate guide – only to get a 4% drop in performance!)

With the help of my friend Google, I’ve nudged the accuracy of my predictor up to 87% – a truly triumphal moment last night – but imagine: 13% of incidents would still be getting lost in the system! And that’s when it hit me: I’ve become so obsessed with getting a better number that I’ve now stopped learning and I’ve just taken to blindly “fiddling” – without understanding what I’m really doing! I do not believe that I can do any better than 87% without going back to the basics and understanding how changing a parameter would affect the potential outcome and why, what the differences really are between one classification method and another, and what is happening “under the hood”. It can feel easy to take some code from an article and bastardise it to one’s own purposes – but that isn’t what it’s about right? So today I’m going to give up on the Holy Grail of percentages and get back to understanding. Hopefully the rest will follow… :).

how to machine learning python shortcuts