Automatic differentiation (AD) is a technique for computing (transposed) derivatives of functions implemented by computer programs, essentially by applying the chain-rule across program code. It is typically the method of choice for computing derivatives in machine learning and scientific computing because of its efficiency and numerical stability.
AD works by calculating the (transposed) derivative of a composite program in terms of the (transposed) derivatives of the parts, by using the chain-rule. The distinction between derivatives and transposed derivatives leads to the main distinction in automatic differentiation modes:
forward mode AD: this implements the tangent bundle functor $T$ and calculates derivatives by a forward pass to calculate function values (primals), followed by another forward pass to calculate derivative values (tangents); these two forward passes can be interleaved into a single forward pass for efficiency.
reverse mode AD: this implements the cotangent bundle functor $T^*$ and calculates transposed derivatives by a forward pass to calculate function values (primals), followed by a reverse pass (inverting the control flow of the original program) to calculate transposed derivative values (cotangents); these two passes cannot easily be interleaved due to their differing direction.
When calculating a derivative of a program that implements a function $f:R^n\to R^m$, reverse mode tends to be the more efficient algorithm if $n \gg m$ and forward mode tends to be more efficient if $n\ll m$. Seeing that many tasks in machine learning and statistics require the calculation of derivatives (for use in gradient-based optimization or Monte-Carlo sampling) of functions $f:R^n\to R$ (e.g. probability density functions) for $n$ very large, reverse mode AD tends to be the most popular algorithm.
Let us fix some class of categories with stuff $S$. We will call its members $S$-categories. For example, for our purposes, $S$ might include properties like
and structure like
The idea behind CHAD will be to view forward and reverse AD as the unique structure (stuff) preserving functor (homomorphism of $S$-categories) from the initial $S$-category $Syn$ to two suitably chosen $S$-categories $\Sigma_{CSyn}LSyn$ and $\Sigma_{CSyn}LSyn^{op}$.
Consider the initial $S$-category $Syn$ (put differently, the S-properties category that is freely generated from $S$-structure). We can think of this category as a programming language: its objects are types and its morphisms are programs modulo beta-equivalence and eta-equivalence. In fact, for a wide class of programming languages, we can find a suitable choice of stuff $S$, such that the programming language arises as the initial $S$-category. We will refer to this category as the source language of our AD transformation: its morphisms are the programs we want to differentiate.
For example, if we choose the property part of $S$ to consist of Cartesian closure and the structure part of $S$ consists of designated objects $R$ and $Z$ and morphisms $sin$, $cos$, $(+)$, and $(*)$, then $Syn$ is the simply typed lambda-calculus with base types $R$ and $Z$ and the primitive operations $sin$, $cos$, $(+)$, and $(*)$.
Given a strictly indexed category $L:C^{op}\to Cat$, we can form its Grothendieck construction $\Sigma_C L$, which is a split fibration over $C$. Similarly, we can take the opposite category $L^{op}:C^{op}\to Cat$ of $L$ and form $\Sigma_C L^{op}$ from that.
For most natural choices of stuff $S$, we can find elegant sufficient conditions on $C$ and $L$ that guarantee that $\Sigma_C L$ and $\Sigma_C L^{op}$ are both $S$-categories. Let us call a strictly indexed category $L:C^{op}\to Cat$ satisfying these conditions a $S'$-category (where we think of $S'$ again as stuff). We give the corresponding $S'$ for some examples for different choices of $S$:
Let us write $LSyn:CSyn^{op}\to Cat$ for the initial $S'$-category. We think of this category as the target language of automatic differentiation, in the sense that the forward/reverse derivatives of programs (morphisms) in the source language $Syn$ with consist of an associated primal program that is a morphism in $CSyn$ and an associated tangent/cotangent program that is a morphism in $LSyn$/$LSyn^{op}$. As a programming language, $LSyn$ is a linear dependent type theory over the Cartesian type theory $CSyn$.
Indeed, as for any $S'$-category $L:C^{op}\to Cat$, $\Sigma_C L$ and $\Sigma_C L^{op}$ are $S$-categories, it follows that $\Sigma_{CSyn} LSyn$ and $\Sigma_{CSyn} LSyn^{op}$ are, in particular, $S$-categories. Seeing that $Syn$ is the initial $S$-category, we obtain unique morphisms of $S$-categories:
The category $Set$ of sets and functions gives another example of an $S$-category, for a lot of choices of $S$ (if we choose the sets $[[R]]$ and functions $[[op]]$ that we would like to denote with the types $R$ and operations $op$ in the structure part of $S$). Moreover, the strictly indexed category $Fam(CMon):Set^{op}\to Cat$ of families of commutative monoids tends to give an example of an $S'$-category. By initiality of $Syn$ and $LSyn:CSyn^{op}\to Cat$, we obtain
In particular, we see that source language programs $t\in Syn(A, B)$ get interpreted as a function $[[t]]\in Set([[A]], [[B]])$. Similarly, programs $t\in CSyn(A, B)$ in the target language with Cartesian type get interpreted as a function $[[t]]\in Set([[A]], [[B]])$. Finally, programs $t\in LSyn(A)(B,C)$ in the target language with linear type get interpreted as a function $[[t]] \in Fam(CMon)([[A]])([[B]],[[C]])$: families of monoid homomorphisms.
We say that CHAD calculates the correct derivative (resp. transposed derivative) of a program $s$ if the semantics $[[Df(s)]]$ of the program $Df(s)$ (resp. $Dr(s)$) equals the pair of the semantics $[[s]]$ of $s$ and the derivative $T[[s]]$ (resp. transposed derivative $T^*[[s]]$) of the semantics $[[s]]$ of $s$. CHAD is correct in the sense that it calculates the correct (transposed) derivative of any composite (possibly higher-order) program between first-order types (meaning: types built using only positive type formers), provided that it calculates the correct (transposed) derivatives of all primitive operations like $(*)$ that we used to generate the source language. That is, CHAD is a valid way for compositionally calculating (transposed) derivatives of composite computer programs, as long as we correctly implement the derivatives for all primitive operations (basic mathematical functions like multiplication, addition, sine, cosine) in the language.
We can prove this by a standard logical relations argument, relating smooth curves to their primal and (co)tangent curves. Viewed more abstractly, the proof follows automatically because the Artin gluing along a representable functor (like the hom out of the real numbers) of an $S$-category is itself again an $S$-category, for common most choices of $S$.
Forward mode automatic differentiation was introduced by Robert Edwin Wengert in
An early description of reverse mode automatic differentiation can be found in
but it was already described earlier by others such as Seppo Linnainmaa:
A categorical analysis of (non-interleaved) forward and reverse AD is given by
Matthijs Vákár, CHAD: Combinatory Homomorphic Automatic Differentiation. 2021. arXiv preprint 2103.15776 Haskell implementationunded by European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 895827)
Fernando Lucatelli Nunes?, Matthijs Vákár, CHAD for Expressive Total Languages. 2021. arXiv preprint 2110.00446unded by European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No. 895827)
A categorical analysis of (interleaved) forward mode AD for calculating higher order derivatives is given by