Automatic differentiation (autodiff)

An artificial neural network (ANN) is usually a function of input X and some parameters b,

f(X,b).

Given X we observe Y as output of the function or mechanism f.

The training of the ANN would involve adjusting b such that f(X,b) is as close to Y as possible by some measure, called “loss”. For example, below is a loss,

l(X,Y,b)=|f(X,b)-Y|,

where X, Y are given. b can be adjusted to make l smaller.

We would compute the mathematical derivatives

l b

and move b against the direction of lb to make l smaller.

The capability to automatically perform mathematical differentiation (autodiff) of a complex function with respect to its parameters is essential to machine learning libraries: for example Google’s TensorFlow, Meta’s PyTorch, JAX, the emergent Apple’s MLX, and micrograd developed by us Brief Solutions Ltd.

micrograd autodiff library

Our repository is at https://github.com/brief-ds/micrograd. The README serves as the self-sufficient documentation.

micrograd was started by Andrej Karpathy. The initial version, now the code under tag scalar, works only on scalar values. We extended it to work with vectors, including matrices (2-dimensional) and arbitrary-dimensional tensors.

The philosophy of micrograd

micrograd separates the symbolic differentiation and numerical calculation:

  1. micrograd does the differentiation, a manipulation of symbols;
  2. actual numerical calculation is delegated to a numerical library as NumPy.

When a machine learning library implements the mathematical functions again, it is possible the result is different across libraries, for example the arctanh(x) in TensorFlow vs NumPy, when x is close to 1 or -1.

Today the NumPy’s install size is about 100 megabytes. If in the future there is a numerical library more compact and more performant, we will switch to that one. The clean division of job in the design enables that.

micrograd can be taught to high schoolers

The core file micrograd/engine.py is less than 500 lines, 10,000+ times smaller than full-featured libraries.

Each mathematical operation is defined in 10-20 lines, for example the sum operation in micrograd/engine.py:


    def sum(self, axis=None):
        ...         # 8 lines of pre-processing

        out = Value(_sum(self.data, axis=axis), (self,), 'sum')

        def _forward(**kwds):
            out.data = _sum(self.data, axis=axis)
        out._forward = _forward

        def _backward():
            # expand out.grad to same number of dimensions
            # as self.data, self.grad
            _out_grad = expand_dims(out.grad, _axis)

            # ... expand further to same shape as self.data
            self.grad += broadcast_to(_out_grad, self.shape)
        out._backward = _backward

        return out

where

  • the _forward() function evaluates the sum, and
  • the _backward() function differentiates the sum with respect to the elements, over which the sum was calculated.

micrograd can be inspected with Python’s built-in profiler

To time any code is called “profiling”. Complex machine learning libraries would require additionally written code to inspect itself. Because micrograd is pure Python, one may time it with the cProfile module built in Python.

python3 -m cProfile -s tottime <program_using_micrograd>

We rewrote the model behind https://tsterm.com using micrograd and profiled it.

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3258    2.749    0.001    2.814    0.001 numeric.py:1002(tensordot)
     ...
     1440    1.009    0.001    1.266    0.001 engine.py:91(_backward)
     ...

cProfile’s output clearly ranks each forward or backward function of the mathematical operators by the total time, under the tottime column. On one run, the most costly was the tensordot operation (tensor multiplication), followed by the differentiation of the element-wise multiplication.

micrograd is comparable in performance

micrograd turns out not to lose out in performance. We benchmarked the model behind https://tsterm.com written with different libraries. The shorter the run time is the better.

Hardware Operating System TensorFlow micrograd
x86_64 (AMD EPYC) Amazon Linux 2 10s 12s
Aarch64 (Graviton3) Ubuntu 24.04 LTS 13s 12s
Aarch64 (Graviton4) Ubuntu 24.04 LTS 11s 11s

The model performs quantile regression on 600 megabytes of data in memory. The data type was float32.

micrograd is most widely deployable

The more complex the machine learning library is, the more likely its deployability is restricted. For example, on a machine with Alpine Linux, micrograd still runs, as it only depends on Python and NumPy, while the other libraries are not available.

micrograd is easiest to maintain and extend

As we saw above, tensordot was the most costly operation. If you have an idea to accelerate a particular kind of tensordot, go into micrograd/engine.py, and add a few lines:


    def my_tensordot(self):

        out = ...

        def _forward():
            pass
        out._forward = _forward

        def _backward():
            pass
        out._backward = _backward

        return out

likewise to define any new operator. That’s a snap!

References

Introduction to Derivatives, Math is Fun, https://www.mathsisfun.com/calculus/derivatives-introduction.html

Differentiation, BBC Bitsize, https://www.bbc.co.uk/bitesize/guides/zyj77ty/

Dive into MLX, Pranay Saha, https://medium.com/@pranaysaha/dive-into-mlx-performance-flexibility-for-apple-silicon-651d79080c4c

How Fast is MLX?, Tristan Bilot, https://towardsdatascience.com/how-fast-is-mlx-a-comprehensive-benchmark-on-8-apple-silicon-chips-and-4-cuda-gpus-378a0ae356a0/