5. Automatic Differentiation#

Recall from Section 2.4 that calculating derivatives is the crucial step in all of the optimization algorithms that we will use to train deep networks. While the calculations are straightforward, working them out by hand can be tedious and error-prone, and this problem only grows as our models become more complex.

Fortunately all modern deep learning frameworks take this work off of our plates by offering automatic differentiation (often shortened to autograd). As we pass data through each successive function, the framework builds a computational graph that tracks how each value depends on others. To calculate derivatives, automatic differentiation works backwards through this graph applying the chain rule. The computational algorithm for applying the chain rule in this fashion is called backpropagation.

While autograd libraries have become a hot concern over the past decade, they have a long history. In fact the earliest references to autograd date back over half of a century (Wengert, 1964). The core ideas behind modern backpropagation date to a PhD thesis from 1980 (Speelpenning, 1980) and were further developed in the late 1980s (Griewank, 1989). While backpropagation has become the default method for computing gradients, it is not the only option. For instance, the Julia programming language employs forward propagation (Revels et al., 2016). Before exploring methods, let’s first master the Zygote package.

5.1. A Simple Function#

To start, we assign x an initial value.

x = collect(1.0:4.0)
4-element Vector{Float64}:
 1.0
 2.0
 3.0
 4.0

We now define a function of x.

using LinearAlgebra

y(x) = 2x⋅x
y (generic function with 1 method)

We can now take the gradient of y with respect to x by using ' operator.

using Zygote

grad = y'.(x)
4-element Vector{Float64}:
  4.0
  8.0
 12.0
 16.0

We can now verify that the automatic gradient computation and the expected result are identical.

grad == 4x
true

Now let’s calculate another function of x and take its gradient.

y(x) = sum(x)
y'.(x)
4-element Vector{Float64}:
 1.0
 1.0
 1.0
 1.0

5.2. Backward for Non-Scalar Variables#

When y is a vector, the most natural interpretation of the derivative of y with respect to a vector x is a matrix called the Jacobian that contains the partial derivatives of each component of y with respect to each component of x. Likewise, for higher-order y and x, the differentiation result could be an even higher-order tensor.

While Jacobians do show up in some advanced machine learning techniques, more commonly we want to sum up the gradients of each component of y with respect to the full vector x, yielding a vector of the same shape as x. For example, we often have a vector representing the value of our loss function calculated separately for each example among a batch of training examples. Here, we just want to sum up the gradients computed individually for each example.

y(x) = x.*x
y'.(x)
4-element Vector{Float64}:
 2.0
 4.0
 6.0
 8.0

5.3. Detaching Computation#

Sometimes, we wish to move some calculations outside of the recorded computational graph. For example, say that we use the input to create some auxiliary intermediate terms for which we do not want to compute a gradient. In this case, we need to detach the respective computational graph from the final result. The following toy example makes this clearer: suppose we have z = x * y and y = x * x but we want to focus on the direct influence of x on z rather than the influence conveyed via y. In this case, we can create a new variable u that takes the same value as y but whose provenance (how it was created) has been wiped out. Thus u has no ancestors in the graph and gradients do not flow through u to x. For example, taking the gradient of z = x * u will yield the result x, (not 3 * x * x as you might have expected since z = x * x * x). We can take the gradient of z with respect to x by using gradient function.

y(x) = x.*x
u = y(x)
z(x) = x.*u

first(gradient(x->sum(z(x)),x)) == y(x)
true

Note that while this procedure detaches y’s ancestors from the graph leading to z, the computational graph leading to y persists and thus we can calculate the gradient of y with respect to x.

y'.(x) == 2x
true

5.4. Gradients and Julia Control Flow#

So far we reviewed cases where the path from input to output was well-defined via a function such as z = x * x * x. Programming offers us a lot more freedom in how we compute results. For instance, we can make them depend on auxiliary variables or condition choices on intermediate results. One benefit of using automatic differentiation is that even if building the computational graph of a function required passing through a maze of Julia control flow (e.g., conditionals, loops, and arbitrary function calls), we can still calculate the gradient of the resulting variable. To illustrate this, consider the following code snippet where the number of iterations of the while loop and the evaluation of the if statement both depend on the value of the input a.

function f(a)
    b = 2a
    while norm(b) < 1000
        b = 2b
    end
    c = sum(b) > 0 ? b : 100b
    return c
end
f (generic function with 1 method)

Below, we call this function, passing in a random value as input. Since the input is a random variable, we do not know what form the computational graph will take.

a = randn()
f'.(a) == f(a)/a
true

Dynamic control flow is very common in deep learning. For instance, when processing text, the computational graph depends on the length of the input. In these cases, automatic differentiation becomes vital for statistical modeling since it is impossible to compute the gradient a priori.