Calculus is a big topic, but by and large, there are quite specific aspects of calculus that come into machine learning and in particular deep learning algorithms. This article is not intended to explain how and why things are as they are; rather it’s my own personal cheat sheet for when I need to remember enough calculus to understand what’s going on ‘under the hood’ of an algorithm. I didn’t grow up doing calculus, and it’s not something I do every day or even every month so I need this aid to memory!
With thanks to the wonderful explanations in Calculus made easy (also source of above diagram) by Silvanus Phillips Thompson – money well-spent if you’re looking for something a little more in-depth on this topic.
Differential Calculus cuts a thing into small pieces to find how it changes.
You’ll see various notations containing either d or ∂.
For example: dx (derivative) or ∂x (partial derivative)
And these both just mean ‘a little bit of’ x.
We speak of finding the ‘derivative’ of x.
Integral Calculus joins (integrates) the small pieces together to find how much there is. [from Maths is fun]
You’ll see a long s-shaped symbol like ∫.
And this just means ‘the sum of’.
We speak of finding the ‘integral’ of x.
So if you see ∫dx it would just mean ‘the sum of all the little bits of x’ which would give you x!
With calculus, we are always going to be concerned with some function, for example, let’s consider a simple function like y = 2x + 1:
The value of y depends on the value of x: a change in the value of x brings about a change in the value of y. So in this scenario we would say that y is the dependent variable and x is the independent variable. You’ll also see functions expressed in different ways: y = 2x + 1 is known as an explicit function because the variable y is isolated. x = (1 – y) / 2 would also be an explicit function as the variable x is isolated. However, if we re-worked it to y – 2x = 1, we’d have an implicit function because the variables are all mixed in with one another.
In differential calculus, we are always hunting for some sort of ratio: the proportion of ∂y to ∂x when both are very small, the relationship between 2 variables of a function: how much does y change when you change x? In other words we are looking for the rate of change or put another way the slope. By looking at our graph we can see that the rate of change is 2: each time x is increased by 1, y increases by 2. Differentiation is going to help us find this value in much more complex scenarios.
Let’s say we have a function y = f(x)… Can you believe? All of these things are the same so don’t be thrown off by the different notation conventions!
They all boil down to ‘the derivative of y with respect to x‘, i.e. if x changes a little bit, how much will y change?
The derivative of a constant is always 0, which makes sense if we think about a function like y = 9, if we draw this it will be a flat line, there is no x to affect y! For example:
Multiply by the value of the exponent (here 3) and subtract 1 from the exponent:
Isolate the constant (here 5), take the derivative of what’s left, and then resolve:
If you need to find the derivative of 2 functions, you find the deritivate of each and then add:
Here is an example:
Take the derivative of the first function – treating the second as a constant, the do the reverse and take the derivative of the second function – treating the first as a constant, and then add:
Here is a silly little example (silly because this is effectively x3):
It all simplifies down to 3x2 which gives use the same result as if we’d used the power rule on x3!
This is used when we have a function within a function, or put another way: the result of the inner function is then acted upon by the outer function. Maths is fun explains this very well, let’s say we have 2 functions involving x:
A function of a function may be written in 2 ways:
In both cases it translates to ‘first perform function f on x, then perform function g on the result’ – in other words, start inside and work your way out! Using our values for f(x) and g(x) as defined above the final result for g(f(x)) will be:
So the chain rule just says ‘take the derivative of the outside function and multiply by the derivative of the inside function’, in other words:
Maths is fun has a nice summary of some additional popular rules.
So far we’ve just been dealing with simple 1-variable examples, but life is not like that! Normally we may have several variables in play, for example this function:
So here we may want to find the partial derivative with respect to x (∂/∂x) AND the partial derivative with respect to y (∂/∂y). To do this, we treat y as a constant and find the derivative of x, and then we treat x as a constant and find the derivative of y.
For multiple variables we do exactly the same: hold all variables as constants except the one we are finding the partial derivative for.
Integrals are also known as anti-derivatives, because integration effectively reverses differentiation. Remember, with differentiation we were cutting the whole up into little bits; with integration we are taking the little bits and assembling the whole!
Because we are reversing we do the opposite of what we do in the power rule: add one to the exponent and the divide by that value (+ C, see next rule!):
The constant of integration rule
The first rule we saw above for derivatives was that all constants become 0. In the process of reversing therefore, we can’t know whether the original had a constant or not so we add C as a placeholder for whatever constant may or may not have been there.
Lone constants rule
We re-instate our imaginary x0 (which is just 1 so it doesn’t change anything) and then proceed as usual with the anti-power and constant of integration rules: