In this chapter we discuss the math behind deep learning. This topic is quite advanced and not necessarily required for practitioners. However, it is recommended reading if you are interested in understanding what is going on under the hood when you play with neural networks. We start with an historical introduction, and then we will review the high school concept of derivatives and gradients. We will also introduce the gradient descent and backpropagation algorithms commonly used to optimize deep learning networks.
The basics of continuous backpropagation were proposed by Henry J. Kelley  in 1960 using dynamic programming. Stuart Dreyfus proposed using the chain rule in 1962 . Paul Werbos was the first proposing to use backpropagation for neural nets in his 1974 PhD Thesis . However, it was only in 1986 that backpropagation gained success with the work of David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams published in Nature . Only in 1987, Yan LeCun described the modern version of backpropagation currently used for training neural networks .
The basic intuition of SGD was introduced by Robbins and Monro in 1951 in a context different from neural networks . Only in 2012 – or 52 years after the first time backpropagation was first introduced – AlexNet  achieved a top-5 error of 15.3% in the ImageNet 2012 Challenge using GPUs. According to The Economist , "Suddenly people started to pay attention, not just within the AI community but across the technology industry as a whole." Innovation in this field was not something that happened overnight. Instead it was a long walk lasting more than 50 years!
Before introducing backpropagation, we need to review some mathematical tools from calculus. Don't worry too much; we'll briefly review a few areas, all of which are commonly covered in high school-level mathematics.
Derivatives are a powerful mathematical tool. We are going to use derivatives and gradients for optimizing our network. Let's look at the definition. The derivative of a function y = f(x) of a variable x is a measure of the rate at which the value y of the function changes with respect to the change of the variable x. If x and y are real numbers, and if the graph of f is plotted against x, the derivative is the "slope" of this graph at each point.
If the function is linear, y = f(x) = ax + b, the slope is . This is a simple result of calculus that can be derived by considering that:
In Figure 1 we show the geometrical meaning of , and the angle between the linear function and the x-cartesian axis:
Figure 1: An example of linear function and rate of change
If the function is not linear then the intuition is to compute the rate of change as the mathematical limit value of the ratio of the differences as becomes infinitely small. Geometrically, this is the tangent line at (x, y = f(x)) as shown in Figure 2:
Figure 2: Rate of change for and tangential line as
Figure 3: and
Gradient is a generalization of the derivative for multiple variables. Note that the derivative of a function of a single variable is a scalar-valued function, whereas the gradient of a function of several variables is a vector-valued function. The gradient is denoted with an upside-down delta , and called del. or nabla from the Greek alphabet. This makes sense as delta indicates change in one variable, and the gradient is the change in all variables.
Suppose (for example, the space of real numbers with m dimensions) and f maps from to ; the gradient is defined as follows:
In math, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant.
- Points in the direction of greatest increase of a function.
- Is 0 at a local maximum or local minimum. This is because if it is 0, it cannot increase or decrease further.
The proof is left as an exercise to the interested reader. (Hint: consider Figure 2 and 3.)
If the gradient points in the direction of greatest increase for a function, then it is possible to move towards a local minimum for the function by simply moving in a direction opposite to the gradient. That's the key observation used for gradient descent algorithms, which will be used shortly. An example is provided in Figure 4:
Figure 4: Gradient descent for a function in three variables
This can be generalized beyond the scalar case. Suppose and with g which maps from to , and f, which maps from to , and with y = g(x) and z = f(y); then we have:
The generalized chain rule using partial derivatives will be used as a basic tool for the backpropagation algorithm when dealing with functions in multiple variables. Stop for a second and make sure that you fully understand it.
- Constant differentiation: c' = 0, with c constant
- Differentiation variable: , when deriving the differentiation variable
- Linear differentiation:
- Reciprocal differentiation:
- Exponential differentiation:
There are many books about matrix calculus. Here we focus only on only a few basic operations used for neural networks. Let us recall that a matrix m × n can be used to represent the weights wij with , associated with the arcs between two adjacent layers.
Note that by adjusting the weights we can control the "behavior" of the network and that a small change in a specific wij will be propagated through the network following its topology (see Figure 5, where the edges in bold are the ones impacted by the small change in a specific wij):
Figure 5: Propagating wij changes through the network via the edges in bold
Now that we have reviewed some basic concepts of calculus let's start applying them to deep learning. The first question is how to optimize activation functions. Well, I am pretty sure that you are thinking about computing the derivative, so let's do it!
In Chapter 1, Neural Network Foundations with TensorFlow 2.0, we have seen a few activation functions including sigmoid, tanh, and ReLU. In the following section we compute the derivative of these activation functions.
Figure 6: Sigmoid activation function
Therefore the derivative of can be computesd as a very simple form .
Figure 7: Tanh activation function
If you remember that and , then the derivative is computed as:
Therefore the derivative of tanh(z) can be computed as a very simple form: .
Note that ReLU is non-differentiable at zero. However, it is differentiable anywhere else, and the value of the derivative at zero can be arbitrarily chosen to be a 0 or 1, as demonstrated in Figure 8:
Figure 8: ReLU activation function
Now that we have computed the derivative of the activation functions, we can describe the backpropagation algorithm – the mathematical core of deep learning. Sometimes, backpropagation is called backprop for short.
Remember that a neural network can have multiple hidden layers, as well as one input layer and one output layer.
In addition to that, recall from Chapter 1, Neural Network Foundations with TensorFlow 2.0, that backpropagation can be described as a way of progressively correcting mistakes as soon as they are detected. In order to reduce the errors made by a neural network, we must train the network. The training needs a dataset including input values and the corresponding true output value. We want to use the network for predicting the output as close as possible to the true output value. The key intuition of the backpropagation algorithm is to update the weights of the connections based on the measured error at the output neuron(s). In the remainder of this section, we will explain how to formalize this intuition.
When backpropagation starts, all the weights have some random assignment. Then the net is activated for each input in the training set: values are propagated forward from the input stage through the hidden stages to the output stage where a prediction is made (note that we keep the following figure simple by only representing a few values with green dotted lines, but in reality all the values are propagated forward through the network):
Figure 9: Forward step in backpropagation
Since we know the true observed value in the training set, it is possible to calculate the error made in prediction.
The easiest way to think about backtracking is to propagate the error back (see Figure 10), using an appropriate optimizer algorithm such as a gradient descent to adjust the neural network weights with the goal of reducing the error (again for the sake of simplicity only a few error values are represented here):
Figure 10: Backward step in backpropagation
The process of forward propagation from input to output and backward propagation of errors is repeated several times until the error goes below a predefined threshold. The whole process is represented in Figure 11. A set of features is selected as input to a machine learning model, which produces predictions. The predictions are compared with the (true) label and the resulting loss function is minimized by the optimizer, which updates the weights of the model:
Figure 11: Forward propagation and backward propagation
Let's see in detail how the forward and backward steps are realized. It might be useful to have a look back at Figure 5 and recall that a small change in a specific wij will be propagated through the network following its topology (see Figure 5, where the edges in bold are the ones impacted by the small change in a specific weight).
During the forward steps the inputs are multiplied with the weights and then they all are summed together. Then the activation function is applied (see Figure 12). This step is repeated for each layer, one after another. The first layer takes the input features as input and it produces its output. Then, each subsequent layer takes as input the output of the previous layer:
Figure 12: Forward propagation
If we look at one single layer, mathematically we have two equations:
- The transfer equation: , where xi are the input values, wi are the weights, and b is the bias. In vector notation z = WTX. Note that b can be absorbed in the summatory by setting and x0 = 1.
- The activation function: , where is the chosen activation function.
An artificial neural network consists of an input layer I, an output layer O and any number of hidden layers Hi situated between the input and the output layers. For the sake of simplicity let's assume that there is only one hidden layer, since the results can be easily generalized.
As shown in Figure 12, The features xi from the input layer are multiplied by a set of fully-connected weights wij connecting the input layer to the hidden layer (see the left side of Figure 12). The weighted signals are summed together with the bias to calculate the result (see the center of Figure 12).
The result is passed through the activation function which leaves the hidden layer to the output layer (see the right side of Figure 12).
Summarizing, during the forward step we need to run the following operations:
- For each neuron in a layer, multiply each input by its corresponding weight.
- Then for each neuron in the layer, sum all input x weights together.
- Finally, for each neuron, apply the activation function on the result to compute the new output.
At the end of the forward step, we get a predicted vector yo from the output layer o given the input vector x presented at the input layer. Now the question is: how close is the predicted vector yo to the true value vector t?
That's where the backstep comes in.
For understanding how close the predicted vector yo is to the true value vector t we need a function that measures the error at the output layer o. That is the loss function defined earlier in the book. There are many choices for the loss function. For instance, we can define the Mean Squared Error (MSE) as follows:
Note that E is a quadratic function and, therefore, the difference is quadratically larger when t is far away from yo, and the sign is not important. Note that this quadratic error (loss) function is not the only one that we can use. Later in this chapter we will see how to deal with cross-entropy.
Now, remember that the key point is that during the training we want to adjust the weights of the network in order to minimize the final error. As discussed, we can move towards a local minimum by moving in the opposite direction to the gradient . Moving in the opposite direction to the gradient is the reason why this algorithm is called gradient descent. Therefore, it is reasonable to define the equation for updating the weight wij as follows:
For a function in multiple variables, the gradient is computed using partial derivatives. We introduce the hyperparameter – or, in ML lingo, the learning rate – to account for how large a step should be taken in the direction opposite to the gradient.
Considering the error, E, we have the equation:
The preceding equation is simply capturing the fact that a slight change will impact the final error, as seen in Figure 13:
Figure 13: A small change in wij will impact the final error, E
- zj is the input to node j for layer l
- is the activation function for node j in layer l (applied to zj)
- is the output of the activation of node j in layer l
- wij is the matrix of weights connecting the neuron i in layer l – 1 to the neuron j in layer l
- bj is the bias for unit j in layer l
- to is the target value for node o in the output layer
Now we need to compute the partial derivative for the error at the output layer when the weights change by . There are two different cases:
Case 1: Weight update equation for a neuron from hidden (or input) layer to output layer.
Case 2: weight update equation for a neuron from hidden (or input) layer to hidden layer.
We'll begin with case 1.
Here the summation disappears because when we take the partial derivative with respect to the j-th dimension, the only term not zero in the error is the j-th. Considering that differentiation is a linear operation and that – because the true t0 value does not depend on wjo – we have:
Applying the chain rule again and remembering that , we have:
Remembering that we have:
Again because when we take the partial derivative with respect to the j-th dimension, the only term not zero in the error is the j-th. By definition , so putting everything together we have:
The gradient of the error E with respect to the weights wj from the hidden layer j to the output layer o is therefore simply the product of three terms: the difference between the prediction yo and the true value to, the derivative of the output layer activation function, and the activation output yj of node j in the hidden layer. For simplicity we can also define and get:
In short, for case 1 the weight update equation for each of the hidden-output connections is:
If we want to explicitly compute the gradient with respect to the output layer biases, the steps to follow are similar to the one above with only one difference.
So in this case .
Next, we'll look at case 2.
In this case, we need to consider the equation for a neuron from a hidden layer (or the input layer) to a hidden layer. Figure 13 showed that there is an indirect relationship between the hidden layer weight change and the output error. This makes the computation of the gradient a bit more challenging. In this case, we need to consider the equation for a neuron from hidden layer i to hidden layer j. Applying the definition of E and differentiating we have:
In this case the sum will not disappear because the change of weights in the hidden layer is directly affecting the output. Substituting and applying the chain rule, we have:
The indirect relation between zo and the internal weights wij (Figure 13) is mathematically expressed by the expansion:
Remembering the definition we get:
This last substitution with vo is particularly interesting because it back-propagates the signal vo computed in the subsequent layer. The rate of change with respect to the rate of change of the weights wij is therefore the multiplication of four factors: the output activations yi from the layer below, the derivative of hidden layer activation function and the sum of the back-propagated signal vo previously computed in the subsequent layer weighted by . We can use this idea of backpropagating the error signal by defining and therefore . This suggests that in order to calculate the gradients at any layer l in a deep neural network, we can simply multiply the back propagated error signal vj and multiply it by the feed-forward signal yl-1 arriving at the layer l. Note that the math is a bit complex but the result is very simple indeed! The intuition is given in Figure 14. Given a function z = f(x, y) computed locally to a neuron with the inputs x and y, the gradients are backpropagated. Then, they combine via chain rule with the local gradients and for further backpropagation.
Here, L denotes the error from the generic previous layer:
Figure 14: An example of the math behind backpropagation
If we want to explicitly compute the gradient with respect to the output layer biases, it can be proven that . We leave this as an exercise for the reader.
We have arrived at the end of this section and all the mathematical tools are defined to make our final statement. The essence of backstep is nothing other than applying the weight update rules one layer after another, starting from the last output layer and moving back toward the first input layer. Difficult to derive, to be sure, but extremely easy to apply once defined. The whole forward-backward algorithm at the core of deep learning is then the following:
- Compute the feed-forward signals from the input to the output.
- Compute the output error E based on the predictions yo and the true value to.
- Backpropagate the error signals; multiply them with the weights in previous layers and with the gradients of the associated activation functions.
- Compute the gradients for all of the parameters based on the back-propagated error signal and the feed-forward signals from the inputs.
- Update the parameters using the computed gradients .
Note that the above algorithm will work for any choice of differentiable error function E and for any choice of differentiable activation function. The only requirement is that both must be differentiable.
Gradient descent with backpropagation is not guaranteed to find the global minimum of the loss function, but only a local minimum. However, this is not necessarily a problem observed in practical application.
Where c refers to one-hot encoded classes (or labels) whereas p refers to softmax-applied probabilities. Since cross entropy is applied to softmax-applied probabilities and to one-hot-encoded classes, we need to take into account the chain rule for computing the gradient with respect to the final weights, scorei. Mathematically, we have:
Computing each part separately, let's start from :
(Note that for a fixed all the terms in the sum are constant except the chosen one.)
Therefore, we have:
(Applying the partial derivative to the sum and considering that .)
Therefore, we have:
Now let's compute the other part where pi is the softmax function defined as.
The derivative is:
Combining the results, we have:
If we generalize the preceding discussion, then we can state that the problem of optimizing a neural network consists of adjusting the weights w of the network in such a way that the loss function is minimized. Conveniently, we can think about the loss function in the form of a sum, as in this form it's indeed representing all the loss functions commonly used:
In many cases, evaluating the above gradient might require expensive evaluation of the gradients from all summand functions. When the training set is very large, this can be extremely expensive. If we have three million samples, we have to loop through three million times or use the dot product. That's a lot! How can we simplify this? There are three types of gradient descent, each different in the way they handle the training dataset:
Batch gradient descent computes the change of error, but updates the whole model only once the entire dataset has been evaluated. Computationally it is very efficient, but it requires that the results for the whole dataset be held in the memory.
Instead of updating the model once the dataset has been evaluated, it does so after every single training example. The key idea is very simple: SGD samples a subset of summand functions at every step.
This is the method that is very frequently used in deep learning. MBGD (or mini-batch) combines BGD and the SGD in one single heuristic. The dataset is divided into small batches of about size bs, which is generally from 64 to 256. Then each of the batches are evaluated separately.
Note that bs is another hyperparameter to fine tune during training. MBGD lies between the extremes of BGD and SGD – by adjusting the batch size and the learning rate parameters, we sometimes find a solution that descends closer to the global minimum than can be achieved by either of the extremes.
In contrast with gradient descent, where the cost function minimized more smoothly, the mini-batch gradient has a bit more of a noisy and bumpy descent, but the cost function still trends downhill. The reason for the noise is that mini-batches are a sample of all the examples and this sampling can cause the loss function to oscillate.
In this section we want to give an intuition behind backpropagation and convnets. For the sake of simplicity we will focus on an example of convolution with input X of size 3 × 3, one single filter W of size 2 × 2 with no padding, stride 1, and no dilation (see Chapter 5, Advanced Convolutional Neural Networks). The generalization is left as an exercise.
The standard convolution operation is represented in Figure 15. Simply put, the convolutional operation is the forward pass:
Figure 15: Forward pass for our convnet toy example
Following the intuition of Figure 15, we can now focus our attention to the backward pass for the current layer. The key assumption is that we receive a backpropagated signal as input, and we need to compute and . This computation is left as an exercise but please note that each weight in the filter contributes to each pixel in the output map or, in other words, any change in a weight of a filter affects all the output pixels.
As you remember from Chapter 8, Recurrent Neural Networks, the basic equation for an RNN is , the final prediction is at step t, the correct value is yt, and the error E is the cross-entropy. Here U, V, W are learning parameters used for the RNNs' equations. These equations can be visualized as in Figure 16 where we unroll the recurrency. The core idea is that total error is just the sum of the errors at each time step.
If we used SGD, we need to sum the errors and the gradients at each timestep for one given training example:
Figure 16: Recurrent neural network unrolled with equations
We are not going to write all the tedious math behind all the gradients, but rather focus only on a few peculiar cases. For instance, with math computations similar to the one made in the previous chapters, it can be proven by using the chain rule that the gradient for V depends only on the value at the current timestep s3, y3 and :
However, has dependencies carried across timesteps because for instance depends on s2 which depends on W2 and s1. As a consequence, the gradient is a bit more complicated because we need to sum up the contributions of each time step:
In order to understand the preceding equation, you can think that we are using the standard backpropagation algorithm used for traditional feed-forward neural networks, but for RNNs we need to additionally add the gradients of W across timesteps. That's because we can effectively make the dependencies across time explicit by unrolling the RNN. This is the reason why backpropagation for RNNs is frequently called Backpropagation Through Time (BTT). The intuition is shown in Figure 17 where the backpropagated signals are represented:
Figure 17: RNN equations and back propagated signals
Then we notice that should be again computed with the chain rule producing a number of multiplications. In this case, we take the derivative of a vector function with respect to a vector, so we need a matrix whose elements are all the pointwise derivatives (in math, this matrix is called a Jacobian). Mathematically, it can be proven that:
Therefore, we have:
The multiplication in the preceding equation is particularly problematic since both the sigmoid and tanh get saturated at both ends and their derivative goes to 0. When this happens, they drive other gradients in previous layers towards 0. This makes the gradient vanish completely after a few time steps and the network stops learning from "far away."
Chapter 8, Recurrent Neural Networks, discussed how to use LSTMs and GRUs to deal with the problem of vanishing gradients and efficiently learn long-range dependencies. In a similar way, the gradient can explode when one single term in the multiplication of the Jacobian matrix becomes large. Chapter 8, Recurrent Neural Networks, discussed how to use gradient clipping to deal with this problem.
We now come to the conclusion of this journey, and you should now better understand how backpropagation works and how it is applied in neural networks for dense networks, CNNs, and RNNs. In the next section, we will discuss how TensorFlow computes gradients, and why this is useful for backpropagation.
TensorFlow can automatically calculate derivatives, a feature called Automatic Differentiation. This is achieved by using the chain rule. Every node in the computational graph (see Chapter 2, TensorFlow 1.x and 2.x) has an attached gradient operation for calculating the derivatives of input with respect to output. After that, the gradients with respect to parameters are automatically computed during backpropagation.
Automatic differentiation is a very important feature because you do not need to handcode new variations of backpropagation for each new model of a neural network. This allows quick iteration and running many experiments faster.
In this chapter we discussed the math behind deep learning. Put simply, a deep learning model computes a function given an input vector to produce the output. The interesting part is that we can literally have billions of parameters (weights) to be tuned. Backpropagation is a core mathematical algorithm used by deep learning for efficiently training artificial neural networks following a gradient descent approach that exploits the chain rule. The algorithm is based on two steps repeated alternatively: the forward step and the backstep.
During the forward step inputs are propagated through the network in order to predict outputs. These predictions might be different from the true values given to assess the quality of the network. In other words, there is an error and our goal is to minimize it. This is where the backstep plays a role, by adjusting the weights of the network to minimize the error.
The error is computed via loss functions such as MSE, or cross-entropy for noncontinuous values such as Booleans (Chapter 1, Neural Network Foundations with TensorFlow 2.0). A gradient-descent-optimization algorithm is used to adjust the weight of neurons by calculating the gradient of the loss function. Backpropagation computes the gradient, and gradient descent uses the gradients for training the model. Reduction in the error rate of predictions increases accuracy, allowing machine learning models to improve. Stochastic gradient descent is the simplest thing you could possibly do by taking one step in the direction of the gradient. This chapter does not cover the math behind other optimizers such as Adam and RMSProp (chapter 1). However, they involve using the first and the second moments of the gradients. The first moment involves the exponentially decaying average of the previous gradients and the second moment involves the exponentially decaying average of the previous squared gradients.
There are three big properties of your data that justify using deep learning, otherwise you might just use regular machine learning: (1) very-high-dimensional input (text, images, audio signals, videos, and temporal series are frequently a good example), (2) dealing with complex decision surfaces that cannot be approximated with a low-order polynomial function, and (3) having a large amount of training data available.
Deep learning models can be thought of as a computational graph made up by stacking together several basic components such as Dense (chapter 1), CNNs (chapters 4 and 5), Embeddings (chapter 6), RNNs(chapter 7), GANs (chapter 8), Autoencoders (chapter 9) and, sometimes, adopting shortcut connections such as "peephole", "skip", and "residual" because they help data flow a bit more smoothly (chapter 5 and 7). Each node in the graph takes tensors as input and produces tensors as output. As discussed, training happens by adjusting the weights in each node with backpropagation, where the key intuition is to reduce the error in the final output node(s) via gradient descent. GPUs and TPUs (chapter 16) can significantly accelerate the optimization process since it is essentially based on (hundreds of) millions of matrix computations.
There are a few other mathematical tools that might be helpful to improve your learning process. Regularization (L1, L2, Lasso, from chapter 1) can significantly improve the learning by keeping weights normalized. Batch normalization (chapter 1) helps to basically keep track of the mean and the standard deviation of your dataset across multiple deep layers. The key intuition is to have data resembling a normal distribution while it flows through the computational graph. Dropout (chapters 1, 4, and 5) helps by introducing some elements of redundancy in your computation; this prevents overfitting and allows better generalization.
This chapter has presented the mathematical foundation behind the intuition. As discussed, this topic is quite advanced and not necessarily required for practitioners. However, it is recommended reading if you are interested in understanding what is going on "under the hood" when you play with neural networks.
The next chapter will introduce the Tensor Processing Unit (TPU), a special chip developed at Google for ultra-fast execution of many mathematical operations described in this chapter.
- Kelley, Henry J. (1960). Gradient theory of optimal flight paths. ARS Journal. 30 (10): 947–954. Bibcode:1960ARSJ...30.1127B. doi:10.2514/8.5282.
- Dreyfus, Stuart (1962). The numerical solution of variational problems. Journal of Mathematical Analysis and Applications. 5 (1): 30–45. doi:10.1016/0022-247x(62)90004-5.
- Werbos, P. (1974). Beyond Regression: New Tools for Prediction and Analysis in the Behavioral Sciences. PhD thesis, Harvard University.
- Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986-10-09). Learning representations by back-propagating errors. Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0.
- LeCun, Y. (1987). Modèles Connexionnistes de l'apprentissage (Connectionist Learning Models), Ph.D. thesis, Universite' P. et M. Curie, 1987
- Herbert Robbins and Sutton Monro A Stochastic Approximation Method The Annals of Mathematical Statistics, Vol. 22, No. 3. (Sep., 1951), pp. 400-407.
- Krizhevsky, Alex; Sutskever, Ilya; Hinton, Geoffrey E. (June 2017). ImageNet classification with deep convolutional neural networks (PDF). Communications of the ACM. 60 (6): 84–90. doi:10.1145/3065386. ISSN 0001-0782.
- From not working to neural networking. The Economist. 25 June 2016