A roadmap to a simple attention mechanism

Background

The attention mechanism in neural network models underlies the phenomenal success of AI. The commonly implemented attention mechanism today is computationally costly. Given $n$ tokens, and one value’s size $d$ , the total computation cost is

O (n n d)

to transform these tokens’ values taking account of attention, quadratic in terms of $n$ .

ChatGPT: “why the current transformer deep learning model is computationally costly? could you give mathematical notions and equations to illustrate?”

Aim of this project

We ask the question whether a model with a less costly attention mechanism, can behave as intelligently. Particularly, we will take a vanilla recurrent neural network, and impose an attention vector on each state vector.

The Turing machine has a “head” that during its execution, at any point of time, is positioned over a cell on the memory tape. This “head” can be assimilated to attention. A recurrent neural net can simulate any Turing machine.

Method

In a recurrent neural net, the state vector is assimilated to the memory tape of a Turing machine. The state vector undergoes change while proceeding in time. Denote the state vector at any time by X, a row vector of size $m$ . Rather than multiplying the entire X by a matrix, we will attend over and transform only certain elements in X for the next state vector:

X + X[args] M

The transformation matrix M will be with fewer number of rows. If the number of args is capped, the above operation costs $O (m)$ . Over $n$ tokens, it is

O (n m),

linear in terms of $n$ .

In most machine learning libraries in Python, X[args] copies into a new array the selected elements. Backpropogation (calculating of mathematical derivatives) will be with respect to this new array but not the original X.

We have rewritten the autodifferentiation part of any machine learning library into a 500-line Python lib micrograd which opens up the interface of any op(erator) for the forward and backward propogation for implementation. For example, attending over selected elements is

https://github.com/brief-ds/micrograd/commit/61db262bbb2409974dd2615113dc443ec072e1f4:

    def attend(self, args):
        out = Value(self.data[args], (self,), 'attend')

        def _forward(**kwds):
            # make a copy of attended data
            out.data = self.data[args]
        out._forward = _forward

        def _backward():
            # mathematical derivatives got propogated into
            # selected coordinates of the original vector
            self.grad[args] += out.grad
        out._backward = _backward

        return out

We will then do extensive training following training detail provided in

nanochat https://github.com/karpathy/nanochat/
xLSTM https://arxiv.org/abs/2510.02228

for various AI tasks and report results.

Expectation of outcome

We will collect enough data to see if a simpler model can behave as intelligently as today’s popular chatbots.