Background

The attention mechanism in neural network models underlies the phenomenal success of AI. The commonly implemented attention mechanism today is computationally costly. Given n tokens, and one value’s size d, the total computation cost is

O(nnd)

to transform these tokens’ values with “attention”, quadratic in terms of n.

ChatGPT: “why the current transformer deep learning model is computationally costly? could you give mathematical notions and equations to illustrate?”

Aim of this project

We ask the question: Can selective sparse recurrent computation provide competitive predictive performance with significantly improved efficiency for long-context or streaming tasks, especially on low-power hardware targets?

Particularly, we will take a vanilla recurrent neural network, and impose an attention vector on each state vector.

The Turing machine has a “head” that during its execution, at any point of time, is positioned over a cell on the memory tape. This “head” can be assimilated to attention. A recurrent neural net can simulate any Turing machine.

As one reads on, a specification of the model will be a net of the McCulloch-Pitts Neurons with both inhibition levels and stimulus levels.

Method

In a recurrent neural net, the state vector is assimilated to the memory tape of a Turing machine. The state vector undergoes change while proceeding in time. Denote the state vector at any time by X, a row vector of size m. Rather than multiplying the entire X by a matrix, we will attend over and transform only certain elements in X,

X + X[args] @ M[args]

before a non-linear transform on the result for the next state vector, where @ is the multiplication in linear algebra.

If the number of args is capped, the above operation costs O(m). Over n tokens, it is

O(nm),

linear in terms of n.

If each row of M is sparse, X[args] @ M[args] will be a sparse vector. X will only have to be sparsely incremented for the next state vector. The total compute will be further less. Note in human brain, averagely each neuron is connected with very few others: less than 10-5 of all.

We will then do extensive training following training detail provided in

for various AI tasks and report results.

the library that does autodifferentiation

Note that once the args are determined no matter how, usually the mathematical derivative of args with respect to any variable is zero. Backpropogation stops at args. This fact may allow us to be daring and chain (relate) coordinates over steps in flexible ways.

We will use a tensor-capable micrograd for autodifferentiation, about 500 lines in Python. Its only dependency is NumPy. Switch to the att branch, to attend over X is X.attend(args). The attend() takes care of backpropogating into the original X, so mathematical derivatives will be computed with respect to X and the variables X relies on.

batched training

If one input instance at one time is in one row, the attention on it can be different than that on a separate input instance. How to handle multiple training instances?

Note for one instance X and its attention indices args,

X[args] @ M[args]

is either a zero-row (if the args is empty) or one-row matrix, so we can still go training instance by instance, compute above for the current instance, and vertically stack the results into a matrix. The new matrix will be of fewer or the same number of rows than the precedent matrix of training instances.

While a human is conscious, typically only one object is being attended over at one time. When asleep, the human may recall and process many instances in parallel. But if something during the day left some strong impression, it is possible after one step of attention, only one instance received (strong) attention, and the matrix of training instances becomes a single row.

how is the attention determined?

If it is simply the indices of the top k values, no model is needed, otherwise a model has to be specified. For example, we extend the model by allowing a vector for inhibition levels B, one for stimulus levels X,

args = (X - B).topk(k)
B = f(X[args], B[args])
X = g(X[args], B[args])

The functions f and g will determine the B and X at the next step, and need be trained.

Each operation here costs O(m), if the size of args is capped.

information in the stimulus vector X

It can fall into several cases,

  • at some coordinates, external information from senses: vision, hearing, etc. These coordinates are fixed, as the fixed addressess for input/output ports on computer architecture
  • at some coordinates, the results of internal processing, such as logical reasoning or dreaming
  • the remaining can be called the long-term memory, with information infrequently updated

a toy example

We did a few variations on the Andrej Karpathy’s GPT-2 model microgpt.py. The last variation was a recurrent net that sparsely fires neurons. It would run about 5 times faster than the vectorised Transformer model, yielding comparable optimised loss, and generating text of comparable quality.

version description number of lines (the fewer the simpler) run time (the lower the better) optimised loss (the lower the better)
scalar the original version, updating model coefficient one by one   160s 2.66
vector uses vector extension on the hardware architecture, updating a whole array of model coefficients at once 164 7s 2.63
rnn changes the model to vanilla recurrent net, and uses vector extension 118 1.2s 2.35
rnn_att as the rnn version but at each step, fire neurons sparsely rather than all 120 1.3s 2.30

Expectation of outcome

We will collect enough data to see if selective sparse recurrent computation can provide competitive predictive performance with significantly improved efficiency for long-context or streaming tasks, especially on low-power hardware targets.

State Space Models model the evoluation of the state with linear models. Over all steps, the total compute is only linear in number of the tokens. Mamba and DeltaNet are examples among others, which all differ in how to scale the current state vector, and how to compute the new incremental information.

model state update equation
Mamba ht=atht-1+Btxt
DeltaNet ht=ht-1+gt(vt-ht-1)

where is element-wise multiplication, and ht is the state at step t.

References

  • Turing, A. M. (1936). “On Computable Numbers, with an Application to the Entscheidungsproblem.” Proceedings of the London Mathematical Society. 2. 42 (published 1937): 230–265. doi:10.1112/plms/s2-42.1.230.
  • Siegelmann H.T. Sontag E.D. (1995). “On the Computational Power of Neural Nets”. Journal of Computer and System Sciences. 50 (1): 132-150. https://doi.org/10.1006/jcss.1995.1013
  • McCulloch, W S.; Pitts, W (1943-12-01). “A logical calculus of the ideas immanent in nervous activity”. The Bulletin of Mathematical Biophysics. 5 (4): 115–133. doi:10.1007/BF02478259. ISSN 1522-9602.
  • Walter Pitts’ bibliography (accessed 10 February 2026). https://home.csulb.edu/~cwallis/artificialn/walter_pitts.html https://jli05.github.io/2024/04/05/Walter-Pitts-bibliography.html
  • Wikipedia, The Free Encyclopedia, s.v. “Neuron,” “Connectivity,” (accessed 10 February 2026), https://en.wikipedia.org/wiki/Neuron#Connectivity
  • Brief Solutions Ltd (2026). micrograd, att branch. A tiny autograd engine. https://github.com/brief-ds/micrograd/tree/att
  • Lee, J (2026). Four experiments on GPT-2. https://www.brief-ds.com/2026/03/16/gpt2.html
  • Gu, A, Dao, T (2024). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. https://arxiv.org/abs/2312.00752
  • Yang, S, et al (2025). Parallelizing Linear Transformers with the Delta Rule over Sequence Length. https://arxiv.org/abs/2406.06484