 Typically we have 2 sets of values and we want to find out if these 2 sets of values are related, and if so how, and by how much? Could height be indicative of weight? Could hours of practice be related to how many errors are made in a mathematical test paper?

Co-variance is a start – if this number is positive then we know that as one variable increases so does the other (e.g. heights and weights); if it’s negative then as one variable increases the other decreases (e.g. practice hours and math test errors – hopefully!). The problem with co-variance though, is that it isn’t normalized – and how often will we find 2 sets of values with the same unit of measure? So if we’re comparing heights and weights and we get a co-variance of, say, 7: how should we evaluate this: it’s an enormous number in terms of height, but tiny in terms of weight.

The Correlation co-efficient is a much more reliable indicator as it normalizes the data and gives you a number between -1 and 1, with -1 being a perfect negative correlation (our math test example), 1 being a perfect positive correlation (our height and weight example)  and 0 being no correlation whatsoever.

Having hopefully first established, by finding the correlation co-efficient, that there is a definite relationship between 2 values, you’d want to find the equation for the line that best fits and describes that relationship. Why? Because then, given the height of any other random future person, we could predict their probable weight, and vice-versa. The process of finding the equation for this line is what we call Linear Regression!

A picture speaks 1000 words, so take a look at the sample Python code with a more in-depth explanation of key concepts and calculations in How it works – Covariance, Correlation & Linear Regression on Github.