Four experiments on GPT-2

Language models

Andrej Karpathy did a self-contained about 200-line microgpt.py in Python that trains and makes inference from the GPT-2 model, the second generation of the model behind ChatGPT.

The program reads persons’ names in an input.txt, then generates likely names from the statistical patterns therein. Each name is called a document. Each letter in the name is called a token. The program trains a model to model the likelihood of one token given the precedent ones within a document,

P(token[j] given all tokens before token[j])

If the input.txt has contained a name “michaela”, the trained model should give higher probability to a “h” given precedent “mic”, than say a “z” given precedent “mic”.

The likelihood of a document is

P(token[0]) x P(token[1] given all tokens before it)
x P(token[2] given all tokens befrore it) x ...

The model loss is a metric inversely related to the modeled likelihood of a document. The program’s job is to minimise the loss, or to maximise the modeled likelihood of a given document, by adjusting the model’s coefficients.

The four experiments

The Karpathy’s version only works with scalar values, updating every model coefficient one by one. We changed the program a few times and timed it. From setup to setup, the acceleration is generally in the ballpark of 20 times if we use the vector extension on the hardware architecture, and a further 5 times if we change the language model to a simpler recurrent net that sparsely fires neurons, yielding comparable optimised loss, and generating text of comparable quality as the original model.

The machine learning library was replaced by a vector-capable 500-line micrograd that does autodifferentiation.

version	description	number of lines (the fewer the simpler)	run time (the lower the better)	optimised loss (the lower the better)
scalar	the original version, updating model coefficient one by one		160s	2.66
vector	uses vector extension on the hardware architecture, updating a whole array of model coefficients at once	164	7s	2.63
rnn	changes the model to vanilla recurrent net, and uses vector extension	118	1.2s	2.35
rnn_att	as the `rnn` version but at each step, fire neurons sparsely rather than all	120	1.3s	2.30

The code is in the scalar, vector, rnn, rnn_att branches respectively for the experiments. One will have to check out the att branch of the machine learning library micrograd and pip install it to work with the microgpt.

mkdir src; cd src
git clone https://github.com/jli05/microgpt
git clone https://github.com/brief-ds/micrograd
cd microgpt
python3 -m venv venv
. venv/bin/activate
cd ../micrograd && git checkout att && pip3 install . && cd ../microgpt

git checkout scalar
time python3 microgpt.py

git checkout vector
time python3 microgpt.py

# ...

Vector extensions

Various hardware architecture supports a vector extension, for example AVX for x86_64, SVE for ARM, and the V extension for RISC-V.

The micrograd machine learning library would delegate all numerical compute on arrays to NumPy. Simple functions on NumPy array would leverage the vector extension on the hardware architecture. Matrix-vector multiplication, matrix-matrix multiplication and alikes would be delegated to the Basic Linear Algebra Subprograms (BLAS) further by NumPy.

The Basic Linear Algebra Subprograms (BLAS) are compiled from C and assembly into machine code, which usually make full use of any vector extension available.

We compiled a version of microgpt re-written in C, comparable to the original “scalar” version that still updates the model coefficients one by one. The run time was 24s. Compared to the performance league table, it shows it is OK to write a compute-heavy program in Python, as long as hardware acceleration is leveraged under the hood.

Vanilla recurrent net

The attention mechanism in the Transformer model works effectively but the compute is quadratic in terms of the number of tokens. In a roadmap for a simple attention mechanism we proposed a recurrent net with sparsely selected coordinates to update for the same effect. If the number of selected coordinates is capped, the compute will be linear in term of the number of tokens.

Before training, we set up the model symbolically: specify how the functions are chained together, so that output of one is the input of another. No actual values are passed into the model. The token_id, pos_id, target_id below are just placeholders.

h = Value(zeros(n_state,))
losses = []
avg_loss = []
for j in range(block_size):
    token_id = Args(0, name=f'token{j}')
    pos_id = Args(0, name=f'pos{j}')
    target_id = Args(0, name=f'target{j}')

    h = (h @ state_dict['m']
         + state_dict['wte'].attend(token_id) @ state_dict['token_proj']
         + state_dict['wpe'].attend(pos_id) @ state_dict['pos_proj']).relu()

    logits = h @ state_dict['lm_head']
    loss = - logits.softmax().attend(target_id).log()
    losses.append(loss)
    avg_loss.append(concatenate(losses, axis=0).mean())

At training, loop over the documents. Say the current document has n tokens. First fill a dict with the input values,

    io_dict = {}
    for j in range(n):
        io_dict[f'token{j}'] = tokens[j]
        io_dict[f'target{j}'] = tokens[j + 1]
        io_dict[f'pos{j}'] = j

As avg_loss[n - 1] is the average loss from token 0 to token n - 1, n tokens in total, differentiate avg_loss[n - 1] with respect to the model parameters, and adjust once the model parameters to minimise the average loss.

    # compute the average loss
    avg_loss[n - 1].forward(**io_dict)

    # differentiate the average loss
    # with respect to the model parameters
    avg_loss[n - 1].backward()

    # adjust the model parameters
    # to minimise the average loss
    optimiser.loss()

The complete code is in the rnn branch.

Sparsely fire neurons in the recurrent net

At each step, rather than fire all neurons, we can fire just some of them to reduce the compute. So instead of

    h = (h @ M + ...).relu()

we can do

    h = (h + h.attend(args) @ M.attend(args) + ...).relu()

Generally there has to be a model to decide the args at each step. But if args is simply the top k indices of h, no model shall be specified for args. In the last experiment rnn_att, the recurrent model is set up symbolically as

for j in range(block_size):
    ...

    args = h.topk(n_att)
    h = (h + h.attend(args) @ state_dict['m'].attend(args)
         + state_dict['wte'].attend(token_id) @ state_dict['token_proj']
         + state_dict['wpe'].attend(pos_id) @ state_dict['pos_proj']).relu()

    ...

The remaining code stays largely the same as the rnn version.

Optimisation

The original version used ADAM. For the three others, we used the SGD with momentum, and tuned the learning rate to be slightly different across the versions.

References

microgpt blog post, https://karpathy.github.io/2026/02/12/microgpt/

Karpathy’s microgpt, only working on scalar values, https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95

our microgpt, with four versions scalar, vector, rnn and rnn_att in the respective branches, https://github.com/jli05/microgpt

our tensor-capable micrograd, whose att branch powers the study in this post, https://github.com/brief-ds/micrograd

Advanced Vector Extensions (AVX) for Intel and Advanced Micro Devices, https://en.wikipedia.org/wiki/Advanced_Vector_Extensions

ARM SVE (Scalable Vector Extension), /2025/10/15/sve.html

RISC-V64 V extension for vector operations, /2025/10/31/rv64-v.html

Basic Linear Algebra Subprograms (BLAS), https://en.wikipedia.org/wiki/Basic_Linear_Algebra_Subprograms

A roadmap to a simple attention mechanism, /2026/02/10/roadmap-att.html