Four experiments on GPT-2
Andrej Karpathy did a self-contained 200-line microgpt.py in Python that trains and makes inference from the GPT-2 model, the second generation of the model behind ChatGPT.
The Karpathy’s version only works with scalar values, every coefficient one by one in the artificial neural network. We changed it a few times and timed it:
| version | description | number of lines (the fewer the simpler) | run time (the lower the better) | optimised loss (the lower the better) |
|---|---|---|---|---|
| scalar | the original version, does math one by one | 160s | 2.66 | |
| vector | uses vector extension on the microprocessor to do math concurrently | 164 | 7s | 2.63 |
| rnn | changes to vanilla recurrent network, and uses vector extension on hardware | 118 | 1.2s | 2.35 |
| rnn_att | as above but at each step, sparsely select coordinates on the state vector to transform | 119 | 1.2s | 2.44 |
The code is in the scalar, vector, rnn, rnn_att branches respectively for the experiments. One will have to clone the vector-capable micrograd repository and check out the att branch to work with it.
mkdir src; cd src
git clone https://github.com/jli05/microgpt
git clone https://github.com/brief-ds/micrograd
cd microgpt
python3 -m venv venv
. venv/bin/activate
cd ../micrograd && git checkout att && pip3 install . && cd ../microgpt
git checkout scalar
time python3 microgpt.py
git checkout vector
time python3 microgpt.py
# ...
References
microgpt blog post, https://karpathy.github.io/2026/02/12/microgpt/
Karpathy’s microgpt, only working on scalar values, https://gist.github.com/karpathy/8627fe009c40f57531cb18360106ce95
our microgpt, with four versions scalar, vector, rnn and rnn_att in the respective branches, https://github.com/jli05/microgpt
our tensor-capable micrograd, whose att branch powers the study in this post, https://github.com/brief-ds/micrograd