14 Backpropagation Foundations of Computer Vision

It works by propagating errors backward through the network, using the chain rule of calculus to compute gradients and then iteratively updating the weights and biases. Combined with optimization techniques like gradient descent, backpropagation enables the model to reduce loss across epochs and effectively learn complex patterns from data. In this post we calculated the backpropagation algorithm for some simplified examples in detail. The general concept of calculating the gradient is calculating the partial derivatives of the loss function using the chain rule. The contributions Lc receives from L-1 neurons are determined not just by the weights applied to L-1’s output values, but by the actual (pre-weight) output values themselves.

Neural networks that have more than one layer, such as multilayer perceptrons (MLPs), on the other hand, must be trained using methods that can change the weights and biases in the hidden layers as well. Another interesting property, which we already pointed out previously, is that the backward network only consists of linear layers. This is true no matter what the forward network consists of (even if it is not a conventional neural network but some arbitrary computation graph). The reason why this happens is because backprop implements the chain rule, and the chain rule is always a product of Jacobian matrices. But more intuitively, you can think of each Jacobian as being a locally linear approximation to the loss surface; hence each can be represented with a linear layer. Modern deep neural networks, often with dozens of hidden layers each containing many neurons, might comprise thousands, millions or—in the case of most large language models (LLMs)—billions of such adjustable parameters.

Abstractly speaking, the purpose of backpropagation is to train a neural network to make better predictions through supervised learning.
Changing a weight that has a larger magnitude in the negative gradient vector has a bigger effect on the cost.
The analogy is not perfect since the untrained network is not really “thinking” about a 2 when it sees this example; it’s more that the label on the training data is hardcoding what the network should be thinking about.
If you’re not familiar with his channel do yourself a favor and check it out (3B1B Channel).

We start at the error node and move back one node at a time taking the partial derivative of the current node with respect to the node in the preceding layer. Each term is chained onto the preceding term to get the total effect, this is of course the chain rule. There are a few interesting things about this forward-backward network. One is that activations from the relu layers get transformed to become parameters of a linear layer of the backward network (see Equation 14.7).

Error and Loss

If you’re beginning with neural networks and/or need a refresher on forward propagation, activation functions and the like see the 3B1B video in