16 Automatic Differentiation

Alec Loudenback

One machine can do the work of fifty ordinary men. No machine can do the work of one extraordinary man. - Elbert Hubbard, Thousand and One Epigrams (1911)

16.1 Chapter Overview

Harnessing the chain rule to compute derivatives not just of simple functions, but of complex programs.

16.2 Motivation for (Automatic) Derivatives

Derivatives are one of the most useful analytical tools we have. Determining the rate of change with respect to an input is effectively sensitivity testing. Knowing the derivative lets you optimize things faster (see Chapter 17). You can test properties and implications (monotonicity, maxima/minima). These applications make derivatives foundational for machine learning, generative models, scientific computing, and more.

We will work up concepts on computing derivatives, from the most basic (finite differentiation) to automatic differentiation (forward mode and then reverse mode). These tools can be used in many modeling applications. Automatic Differentiation (“AD” or “autodiff”) refers to innovative techniques to get computers to perform differentiation of complex algorithms (code) that would be intractable for a human to perform manually.

16.3 Finite Differentiation

Finite differentiation is evaluating a function $f(x)$ at a value $x$ and then at a nearby value $x+\epsilon$. The line drawn through these two points effectively estimates the line that is tangent to the function $f$ at $x$ - effectively the derivative has been found by approximation. That is, we are looking to approximate the derivative using the property:

\[ f'(x) = \lim_{{\epsilon \to 0}} \frac{{f(x_0 + \epsilon) - f(x_0)}}{{\epsilon}} \]

We can approximate the result by simply choosing a small $\epsilon$.

There’s also flavors of finite differentiation to approximate derivatives to be aware of:

forward difference is as defined in the above equation, where $\epsilon$ is added to $x_0$
reverse difference is as defined in the above equation, where $\epsilon$ is subtracted from $x_0$
central difference is where we evaluate at $x_0 \pm \epsilon$ and then divide by $2\epsilon$

The benefit of the central difference is that it limits issues around minima and maxima where the trough or peak respectively would seem much steeper if using forward or reverse. Here’s a picture of this:

One benefit of the central difference method is that is often more accurate than forward or reverse differences. However, it comes at the cost of needing an additional function evaluation/computation in many circumstances. Take, for example, the process of optimizing a function to find a maxima or minima.

Maxima-finding algorithms usually involve guessing an initial point, evaluating the function at that point, and determining what the derivative of the function is at that point. Both items are used to update the guess to one that’s closer to the solution. This approach is used in many optimization algorithms such as Newton’s Method. At each step you need to evaluate the function three times: for $x$, $x+\epsilon$, and $x-\epsilon$. With forward or reverse finite differences, you can reuse the prior function evaluation of the prior guess $x$. As one of the components in the estimation of the derivative, thereby saving an evaluation of the function for each iteration.

There are additional challenges with the finite differentiation method. In practice, we are often interested in much more complex functions than $x^2$. For example, we may actually be interested in the sum of a series that is many elements long or contains more complex operations than basic algebra. In the prior example, the $\epsilon$ is set unusually wide for demonstration purposes. As $\epsilon$ grow smaller generally, the accuracy of all three finite different methods increases. However, that’s not always the case due to both the complexity of the function that you may be trying to differentiate or due to numerical inaccuracies of floating point math.

To demonstrate, here is a more complex example using an arbitrary function

\[ f(x) = exp(x) \]

for this example we’ll show the results of the three methods calculated at different values of $\epsilon$:

using DataFrames


f(x) = exp(x)
ϵ = 10 .^ (range(-16, stop=0, length=100))
x0 = 1
estimate = @. (f(x0 + ϵ) - f(x0 - ϵ)) / 2ϵ
1actual = f(x0)

fig = Figure()
ax = Axis(fig[1, 1], xscale=log10, yscale=log10, xlabel="ϵ", ylabel="absolute error")
scatter!(ax, ϵ, abs.(estimate .- actual))
fig

1: The derivative of $f(x) = exp(x)$ is itself. That is $f'(x) = f(x)$ in this special case.

Figure 16.1: A log-log plot showing the absolute error of the finite differences. Further to the left, roundoff error dominates while further to the right, truncation error dominates.

Note

The @. in the code example above is a macro that applies broadcasting each function to its right. @. (f(x0 + ϵ) - f(x0 - ϵ)) / 2ϵ is the same as (f.(x0 .+ ϵ) .- f.(x0 .- ϵ)) ./ (2 .* ϵ)

A few observations:

At virtually every value of ϵ we observe some error from the true derivative.
That error is the sum of two parts:
1. Truncation error is inherent in that we are using a given non-zero value for ϵ and not determining the limiting analytic value as $\epsilon \to 0$. The larger $\epsilon$ is, the larger the truncation error.
2. Roundoff error which arises due to the limited precision of floating point math.

The implications of this are that we need to often be careful about the choice of ϵ, as the optimal choice will vary depending on the function and the point we are attempting to evaluate. This presents a number of practical difficulties in various algorithms.

Additionally, when computing the finite difference we must evaluate the function multiple times to determine a single estimate of the derivative. When performing something like optimization, the process typically involves iteratively making many guesses requiring many evaluations of the approximate derivative. Further, the efficiency of the algorithm usually depends on the accuracy of computing the derivative!

Despite the accuracy and computational overhead, finite differences can be very useful in many circumstances. However, a more appealing alternative approach will be covered next.

16.4 Automatic Differentiation

Automatic differentiation is essentially the practice of defining algorithmically what the derivatives of function should be. We are able to do this through a creative application of the chain rule. Recall that the chain rule allows us to compute the derivative of a composite function using the derivatives of the component functions:

\[ h(x)=f(g(x)) \] \[ h'(x) = f'(g(x)) g'(x) \]

Using this rule, we can define how elementary operations act when differentiated. Combined with the fact that most computer code is building up from a bunch of elementary operations, we can get a very long way in differentiating complex functions.

16.4.1 Dual Numbers

To understand where we are going, let’s remind ourselves about complex numbers. Complex numbers are of the form which has an real part ($r$) and an imaginary part ($iq$):

\[ r+iq \]

By definition we say that $i^2 = -1$. This is useful because it allows us to perform certain types of operations (e.g. finding a square root of a negative number) that is otherwise unsolvable with just the real numbers¹. After defining how the normal algebraic operations (addition, multiplication, etc.) work for the imaginary number, we are able to utilize the imaginary numbers for a variety of practical mathematical tasks.

What is meant by extending the algebraic operations for imaginary numbers? For example, stating how addition should work for imaginary numbers:

\[ (r+iq) + (s+iu) = (r+s) + i(q+u) \]

In a similar fashion as extending the Real ($\mathbb{R}$) numbers with an imaginary part, for automatic differentiation we will extend them with a dual part. A dual number is one of the form:

\[ a + \epsilon b \]

Where $\epsilon^2 = 0$ and $\epsilon \neq 0$ by definition. While $a$ represents the function value, $b$ carries its derivative. An example should make this clearer. First let’s define a DualNumber:

1struct DualNumber{T,U}
    a::T
    b::U
2    function DualNumber(a::T, b::U=zero(a)) where {T,U}
        return new{T,U}(a, b)
    end
end

1: We define this type parametrically to handle all sorts of <:Real types and allow a and b to vary types in case a mathematical operation causes a type change (e.g. as in the case of integers becoming a floating point number like 10/4 == 2.5)
2: In the constructor, we set the default value of b to be zero(a). zero(a) is a generic way to create a value equal to zero with the same type of the argument a. E.g. zero(12.0) == 0.0 and zero(12) == 0.

Now let’s define how dual numbers work under addition. The mathematical rule is:

\[ (a+\epsilon b)+(c+\epsilon d)=(a+c)+(b+d)\epsilon \]

We then need to define how it works for the combinations of numbers that we might receive as arguments to our function (this is an example where multiple dispatch greatly simplifies the code compared to object oriented single dispatch!):

Base.:+(d::DualNumber, e::DualNumber) = DualNumber(d.a + e.a, d.b + e.b)
Base.:+(d::DualNumber, x) = DualNumber(d.a + x, d.b)
Base.:+(x, d::DualNumber) = d + x

And here’s how we would get the derivative of a very simple function:

f1(x) = 5 + x

f1(DualNumber(10, 1))

DualNumber{Int64, Int64}(15, 1)

That’s not super interesting though - the derivative of f1 is just 1 and we supplied that in the construction of DualNumber. We did at least prove that we can add the 10 and 5!

Let’s make this more interesting by also defining the multiplication operation on dual numbers. We’ll follow the product rule:

\[ (u \times v)' = u ' \times v + u \times v' \]

Base.:*(d::DualNumber, e::DualNumber) = DualNumber(d.a * e.a, d.b * e.a + d.a * e.b)
Base.:*(x, d::DualNumber) = DualNumber(d.a * x, d.b * x)
Base.:*(d::DualNumber, x) = x * d

Now what if we evaluate this function:

f2(x) = 5 + 3x

f2(DualNumber(10, 1))

DualNumber{Int64, Int64}(35, 3)

We have found that the second component is 3, which is indeed the derivative of $5+3x$ with respect to $x$. And in the first part we have the value of f2 evaluated at 10.

Note

When calculating the derivative, why do we start with 1 in the dual part of the number? Because the derivative of a variable with respect to itself is 1. From this unitary starting point, the various operations applied accumulate the derivative of the various operations in the $b$ part of $a + \epsilon b$.

We can also define this for things like transcendental functions:

Base.exp(d::DualNumber) = DualNumber(exp(d.a), exp(d.a) * d.b)
Base.sin(d::DualNumber) = DualNumber(sin(d.a), cos(d.a) * d.b)
Base.cos(d::DualNumber) = DualNumber(cos(d.a), -sin(d.a) * d.b)
exp(DualNumber(1, 1))

DualNumber{Float64, Float64}(2.718281828459045, 2.718281828459045)


sin(DualNumber(0, 1))

DualNumber{Float64, Float64}(0.0, 1.0)

cos(DualNumber(0, 1))

DualNumber{Float64, Float64}(1.0, -0.0)

And finally, to put it all together in a more usable wrapper, we can define a function which will calculate the derivative of another function at a certain point. This function applies f to an initialized DualNumber and then returns the $b$ component from the result:

derivative(f, x) = f(DualNumber(x, one(x))).b

derivative (generic function with 1 method)

And then evaluating it on a more complex function like $f(x) = 5e^{sin(x)}+3x$ at $x = 0$, we would analytically derive $8$, which matches what we calculate next:

f3(x) = 5 * exp(sin(x)) + 3x
derivative(f3, 0)

8.0

We have demonstrated that through the clever use of dual numbers and the chain rule that complex expressions can be automatically differentiated by a computer to an exact level, limited only by the same machine precision that applies to our primary function of interest as well.

Libraries exist (such as ChainRules.jl) which define large numbers of predefined rules for many more operations, even beyond basic algebraic functions. This allows complex programs to be differentiated automatically.

16.5 Performance of Automatic Differentiation

Recall that in the finite difference method, we generally had to evaluate the function two or three times to approximate the derivative. Here we have a single function call that provides both the value and the derivative at that value. How does this compare performance-wise to simply evaluating the function a single time? Let’s check how long it takes to compute a Float64 versus a DualNumber:

using BenchmarkTools
@btime f3(rand())

  7.125 ns (0 allocations: 0 bytes)

7.134593806013554

@btime f3(DualNumber(rand(), 1))

  11.302 ns (0 allocations: 0 bytes)

DualNumber{Float64, Float64}(6.934105133990432, 9.096900017054564)

In performing this computation, the compiler has been able to optimize it such that we effectively are able to compute the function and its derivative at less than two times the cost of the base function evaluation. As the function gets more complex, the overhead does increase but is still a generally preferred versus finite differentiation. This advantage becomes more pronounced as we contemplate derivatives with respect to many variables at once or for higher-order derivatives.

Note

In fact, it’s largely due to the advances in applications of automatic differentiation that has led to the explosion of machine learning and artificial intelligence techniques in the 2010s/2020s. The “learning” process relies on solving parameter weights and would be too computationally expensive if using finite differences.

These applications of AD in specialized C++ libraries underpin the libraries like PyTorch, Tensorfow, and Keras. These libraries specialize in allowing for AD on a limited subset of operations. Julia’s available AD libraries are more general and can be applied to many more scenarios.

16.6 Automatic Differentiation in Practice

We have, of course, not defined an exhaustive list of operations, covering only +, *, exp, sin, and cos. There are only a few more arithmetic (-, /) and transcendental (log, more trigonometric functions, etc.) before we would have a very robust set of algebraic operations defined for our DualNumber. In fact, it’s possible to go even further and to define the behavior through conditional expressions and iterations to differentiate fairly complex functions or to extend the mechanism to partial derivatives and higher-order derivatives as well.

import Distributions
import ForwardDiff

N(x) = Distributions.cdf(Distributions.Normal(), x)

function d1(S, K, τ, r, σ, q)
    return (log(S / K) + (r - q + σ^2 / 2) * τ) / (σ * √(τ))
end

function d2(S, K, τ, r, σ, q)
    return d1(S, K, τ, r, σ, q) - σ * √(τ)
end

"""
    eurocall(parameters)

Calculate the Black-Scholes implied option price for a european call where `parameters` is a vector with the following six elements:

- `S` is the current asset price
- `K` is the strike or exercise price
- `τ` is the time remaining to maturity (can be typed with \\tau[tab])
- `r` is the continuously compounded risk free rate
- `σ` is the (implied) volatility (can be typed with \\sigma[tab])
- `q` is the continuously paid dividend rate
"""
function eurocall(parameters)
1    S, K, τ, r, σ, q = parameters
    iszero(τ) && return max(zero(S), S - K)
    d₁ = d1(S, K, τ, r, σ, q)
    d₂ = d2(S, K, τ, r, σ, q)
    return (N(d₁) * S * exp(τ * (r - q)) - N(d₂) * K) * exp(-r * τ)
end

1: We put the various variables inside a single parameters vector to allow calling a single gradient call instead of multiple derivative calls for each parameter.

Main.Notebook.eurocall

S = 1.0
K = 1.0
τ = 30 / 365
r = 0.05
σ = 0.2
q = 0.0
params = [S, K, τ, r, σ, q]
eurocall(params)

0.02493376819403728

Tip

Some terminology in differentiation:

Scalar-valued function: A function whose output is a single scalar.
Vector-valued (array-valued) function: A function whose output is a vector or array.
Derivative (or partial derivative): The (instantaneous) rate of change of the output with respect to one input variable. In a multivariate context, these are partial derivatives, e.g., $\frac{\partial}{\partial x} f(x,y,z)$.
Gradient: For a scalar-valued function of several variables, the gradient is the vector of all first partial derivatives, e.g. $\nabla f(x,y,z) = \bigl[\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}\bigr]^T$.
Jacobian: For a vector-valued function, the Jacobian is the matrix of first partial derivatives. If $f:R^n R^m $, its Jacobian is an $m \times n$ matrix.
Hessian: For a scalar-valued function of several variables, the Hessian is the matrix of all second partial derivatives (i.e., it’s an $n \times n$ matrix).

With the above code, now we can get the partial derivatives with respect to each parameter. The first, third, fourth, fifth, and sixth correspond to the common “greeks” delta, theta, rho, vega, and epsilon respectively. The second term is the partial derivative with respect to the strike price:

ForwardDiff.gradient(eurocall, params)

6-element Vector{Float64}:
  0.5399635456230838
 -0.5150297774290467
  0.16420676980838977
  0.042331214583209334
  0.11379886104405816
 -0.04438056539367815

We can also get the second order greeks with another call. This includes many uncommon second order partial derivatives, but the popular gamma is in the [1,1] position for example:

ForwardDiff.hessian(eurocall, params)

6×6 Matrix{Float64}:
  6.92276    -6.92276    0.242297   0.568994   -0.0853491   -0.613375
 -6.92276     6.92276   -0.07809   -0.526663    0.199148     0.568994
  0.242297   -0.07809   -0.846846   0.521448    0.685306    -0.559878
  0.568994   -0.526663   0.521448   0.0432874  -0.0163683   -0.0467667
 -0.0853491   0.199148   0.685306  -0.0163683   0.00245525   0.007015
 -0.613375    0.568994  -0.559878  -0.0467667   0.007015     0.0504144

16.6.1 Performance

Earlier we assessed the impact on performance for the derivatives using DualNumber on a very basic function. What about if we take a more realistic example like eurocall? We can observe approximately a 9x slowdown when computing all of the first order derivatives which isn’t bad considering we are computing 6x of the outputs!

@btime eurocall($params)

  18.973 ns (0 allocations: 0 bytes)

0.02493376819403728

let
1    g = similar(params)
    @btime ForwardDiff.gradient!($g, eurocall, $params)
end

1: To avoid benchmarking memory allocation for the new array, we pre-allocate the array in memory to store the result and then call gradient! to fill in g for each result.

  251.703 ns (3 allocations: 704 bytes)

6-element Vector{Float64}:
  0.5399635456230838
 -0.5150297774290467
  0.16420676980838977
  0.042331214583209334
  0.11379886104405816
 -0.04438056539367815

16.7 Forward Mode and Reverse Mode

The approach of AD outlined about is called forward mode AD where the derivative is brought forward through the computation and accumulated through each step. The alternative to this is to first evaluate the function and then work backwards by accumulating the partial derivatives in what’s called reverse mode AD.

Reverse mode requires more book-keeping because unlike the forward mode the derivative needs to be carried backwards, unlike the DualNumber approach of forward mode.

16.8 Practical tips for Automatic Differentiation

Here are a few practical tips to keep in mind.

16.8.1 Choosing between Reverse Mode and Forward Mode

Forward mode is more efficient when the number of outputs is much larger than the number inputs. When the number of inputs is much larger than the number of outputs, then reverse mode will generally be more efficient. Examples of the number of inputs being larger than the outputs might be in a statistical analysis where many features are used to predict a limited number of outcome variables or a complex model with a lot of parameters.

16.8.2 Mutation

Auto-differentiation works through most code, but a particularly tricky part to get right is when values within arrays are mutated (changed). It’s possible to do so but may require a little bit more boilerplate to setup. As of 2024, Enzyme.jl has the best support for functions with mutation inside of them.

16.8.3 Custom Rules

Custom rules for new or unusual functions can be defined, but this is an area that should be explored equipped with a bit of calculus and a deeper understanding of both forward-mode and reverse-mode. ChainRules.jl provides an interface for defining additional rules that hook into the AD infrastructure in Julia as well as provide a good set of documentation on how to extend the rules for your custom function.

16.8.4 Available Libraries in Julia

ForwardDiff.jl provides robust forward-mode AD.
Zygote.jl is a reverse-mode package with the innovations of being able to differentiate structs in addition to arrays and scalars.
Enzyme.jl is a newer package which allows for both forward and reverse mode, but has the advantage of supporting array mutation. Additionally, Enzyme works at the level of LLVM code (an intermediate level between high level Julia code and machine code) which allows for different, sometimes better, optimizations.
DifferentiationInterface.jl is a wrapper library providing a consistent API while being able exchange different backends.

In the authors’ experience, they would probably recommend DifferentiationInterface.jl as a starting point, and diving into specific libraries if certain features are needed.

16.9 References

https://book.sciml.ai/notes/08-Forward-Mode_Automatic_Differentiation_(AD)_via_High_Dimensional_Algebras/
https://blog.esciencecenter.nl/automatic-differentiation-from-scratch-23d50c699555

Richard Feynman has a wonderful, short lecture on algebra here: https://www.feynmanlectures.caltech.edu/I_22.html↩︎