<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://www.brief-ds.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://www.brief-ds.com/" rel="alternate" type="text/html" /><updated>2026-03-09T11:08:10+00:00</updated><id>https://www.brief-ds.com/feed.xml</id><title type="html">Brief Solutions Ltd</title><subtitle>We enable intelligent computing at one watt.</subtitle><entry><title type="html">A roadmap to a simple attention mechanism</title><link href="https://www.brief-ds.com/2026/02/10/roadmap-att.html" rel="alternate" type="text/html" title="A roadmap to a simple attention mechanism" /><published>2026-02-10T00:00:00+00:00</published><updated>2026-02-10T00:00:00+00:00</updated><id>https://www.brief-ds.com/2026/02/10/roadmap-att</id><content type="html" xml:base="https://www.brief-ds.com/2026/02/10/roadmap-att.html"><![CDATA[<h2 id="background">Background</h2>
<p>The attention mechanism in neural network models underlies the phenomenal success of AI. The commonly implemented attention mechanism today is computationally costly. Given <math><mi>n</mi></math> tokens, and one value’s size <math><mi>d</mi></math>, the total computation cost is</p>

<math display="block">
<mi>O</mi><mo>(</mo><mi>n</mi><mi>n</mi><mi>d</mi><mo>)</mo>
</math>

<p>to transform these tokens’ values with “attention”, quadratic in terms of <math><mi>n</mi></math>.</p>

<p>ChatGPT: “why the current transformer deep learning model is computationally costly? could you give mathematical notions and equations to illustrate?”</p>

<h2 id="aim-of-this-project">Aim of this project</h2>
<p>We ask the question whether a model with a less costly attention mechanism, can behave as intelligently. Particularly, we will take a vanilla recurrent neural network, and impose an attention vector on each state vector.</p>

<p>The <a href="https://en.wikipedia.org/wiki/Turing_machine">Turing machine</a> has a “head” that during its execution, at any point of time, is positioned over a cell on the memory tape. This “head” can be assimilated to attention. A recurrent neural net can <a href="https://www.sciencedirect.com/science/article/pii/S0022000085710136">simulate any Turing machine</a>.</p>

<h2 id="method">Method</h2>
<p>In a recurrent neural net, the state vector is assimilated to the memory tape of a Turing machine. The state vector undergoes change while proceeding in time. Denote the state vector at any time by <code class="language-plaintext highlighter-rouge">X</code>, a row vector of size <math><mi>m</mi></math>. Rather than multiplying the entire <code class="language-plaintext highlighter-rouge">X</code> by a matrix, we will attend over and transform only certain elements in <code class="language-plaintext highlighter-rouge">X</code>,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>X + X[args] * M
</code></pre></div></div>

<p>then apply a non-linear function on the above result for the next state vector.</p>

<p>The transformation matrix <code class="language-plaintext highlighter-rouge">M</code> will be with fewer number of rows. If the number of <code class="language-plaintext highlighter-rouge">args</code> is capped, the above operation costs <math><mi>O</mi><mo>(</mo><mi>m</mi><mo>)</mo></math>. Over <math><mi>n</mi></math> tokens, it is</p>

<math display="block">
<mi>O</mi><mo>(</mo><mi>n</mi><mi>m</mi><mo>)</mo><mtext>,</mtext>
</math>

<p>linear in terms of <math><mi>n</mi></math>.</p>

<p>If each row of <code class="language-plaintext highlighter-rouge">M</code> is sparse, the product <code class="language-plaintext highlighter-rouge">X[args] * M</code> will be a sparse vector. <code class="language-plaintext highlighter-rouge">X</code> will only have to be sparsely incremented for the next state vector. The total compute will be further less. Note in <a href="https://en.wikipedia.org/wiki/Neuron#Connectivity">human brain</a>, averagely each neuron is connected with very few others: less than <math><msup><mn>10</mn><mn>-5</mn></msup></math> of all.</p>

<p>In most machine learning libraries in Python, <code class="language-plaintext highlighter-rouge">X[args]</code> copies into a new array the selected elements. Backpropogation (calculating of mathematical derivatives) will be with respect to this new array but not the original <code class="language-plaintext highlighter-rouge">X</code>.</p>

<p>We have rewritten the autodifferentiation part of any machine learning library into a 500-line Python lib <a href="https://www.brief-ds.com/2025/09/25/tensorflow-mlx.html">micrograd</a> which opens up the interface of any op(erator) for the forward and backward propogation for implementation. For example, attending over selected elements is</p>

<p><a href="https://github.com/brief-ds/micrograd/commit/61db262bbb2409974dd2615113dc443ec072e1f4">https://github.com/brief-ds/micrograd/commit/61db262bbb2409974dd2615113dc443ec072e1f4</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="k">def</span> <span class="nf">attend</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">Value</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">args</span><span class="p">],</span> <span class="p">(</span><span class="bp">self</span><span class="p">,),</span> <span class="s">'attend'</span><span class="p">)</span>

        <span class="k">def</span> <span class="nf">_forward</span><span class="p">(</span><span class="o">**</span><span class="n">kwds</span><span class="p">):</span>
            <span class="c1"># make a copy of attended data
</span>            <span class="n">out</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="n">args</span><span class="p">]</span>
        <span class="n">out</span><span class="p">.</span><span class="n">_forward</span> <span class="o">=</span> <span class="n">_forward</span>

        <span class="k">def</span> <span class="nf">_backward</span><span class="p">():</span>
            <span class="c1"># mathematical derivatives got propogated into
</span>            <span class="c1"># selected coordinates of the original vector
</span>            <span class="bp">self</span><span class="p">.</span><span class="n">grad</span><span class="p">[</span><span class="n">args</span><span class="p">]</span> <span class="o">+=</span> <span class="n">out</span><span class="p">.</span><span class="n">grad</span>
        <span class="n">out</span><span class="p">.</span><span class="n">_backward</span> <span class="o">=</span> <span class="n">_backward</span>

        <span class="k">return</span> <span class="n">out</span>

</code></pre></div></div>

<p>We will then do extensive training following training detail provided in</p>

<ul>
  <li>nanochat  <a href="https://github.com/karpathy/nanochat/">https://github.com/karpathy/nanochat/</a></li>
  <li>xLSTM     <a href="https://arxiv.org/abs/2510.02228">https://arxiv.org/abs/2510.02228</a></li>
</ul>

<p>for various AI tasks and report results.</p>

<h2 id="expectation-of-outcome">Expectation of outcome</h2>
<p>We will collect enough data to see if a simpler model can behave as intelligently as today’s popular chatbots.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Background The attention mechanism in neural network models underlies the phenomenal success of AI. The commonly implemented attention mechanism today is computationally costly. Given n tokens, and one value’s size d, the total computation cost is]]></summary></entry><entry><title type="html">Apollo Guidance Computer fixed-point arithmetic</title><link href="https://www.brief-ds.com/2025/12/16/AGC.html" rel="alternate" type="text/html" title="Apollo Guidance Computer fixed-point arithmetic" /><published>2025-12-16T00:00:00+00:00</published><updated>2025-12-16T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/12/16/AGC</id><content type="html" xml:base="https://www.brief-ds.com/2025/12/16/AGC.html"><![CDATA[<h2 id="introduction-to-the-apollo-guidance-computer">Introduction to the Apollo Guidance Computer</h2>
<p>Apollo Guidance Computer was a digital computer produced for the Apollo program. The <a href="https://en.wikipedia.org/wiki/Apollo_command_and_service_module">command module</a> flies to the moon orbit. The <a href="https://en.wikipedia.org/wiki/Apollo_Lunar_Module">lunar module</a> then descends to the moon and makes way back to the command module.</p>

<p>The <a href="https://ibiblio.org/apollo/">Virtual AGC</a> was a project started in 2003 by volunteers and took about 20 years to restore the software line by line into digital form. The amount of effort invovled was beyond description. One may clone the repository with all the code and documents:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/virtualagc/virtualagc.git
</code></pre></div></div>

<p>The <a href="https://ibiblio.org/apollo/Colossus.html">Colossus</a> software runs on the command module and the <a href="https://ibiblio.org/apollo/Luminary.html">Luminary</a> software runs on the lunar module. There is another significant <a href="https://ibiblio.org/apollo/yaAGS.html">Abort Guidance System</a> as a backup computer system, made by a different company, different team of people, only using common hardware technology then.</p>

<blockquote>
  <p>The design principles developed for the AGC by MIT Instrumentation Laboratory, directed in late 1960s by Charles Draper, became foundational to software engineering—particularly for the design of more reliable systems that relied on asynchronous software, priority scheduling, testing, and human-in-the-loop decision capability.</p>
</blockquote>

<p>The processing logic runs at roughly the same speed as the access to memory. The clock that drives internal operations is at 1.024MHz. It takes 12 cycles to read from and write back to memory: 12 / (1.024*10^6/s) = about 11 microseconds. An assembly instruction takes one to several times of 11 microseconds to execute.</p>

<p>The basic unit of storage is 15-bit, called a word. One word is single-precision. Two words are double-precision. Multiplication of two single-precision integers would yield a double-precision integer.</p>

<p>An interpreter is also implemented. The interpreter language can simulate a stack, do high-precision arithmetic and vector operations, essential to navigation math.</p>

<p>The operating system is real-time, consisting of one co-operative scheduler and one interrupt-driven pre-emptive scheduler. The pre-emptive scheduler famously cleaned out low priority routines from memory when the processing logic was too busy.</p>

<h2 id="fixed-point-arithmetic">Fixed-point arithmetic</h2>
<p>The Apollo Guidance Computer only does integer multiplication and division. So how were <a href="https://en.wikipedia.org/wiki/Trigonometry">trigonometric functions</a> computed, essential to space flight, which usually require <a href="https://www.mathisfun.com/numbers/real-numbers.html">real numbers</a>?</p>

<p>It makes approximation with polynomials as does <a href="https://en.wikipedia.org/wiki/Taylor_series">Taylor series expansion</a>.</p>

<math display="block">
<mi>f</mi><mo>(</mo><mi>x</mi><mo>)</mo>
<mo>=</mo>
<msub><mi>a</mi><mn>0</mn></msub>
<mo>+</mo>
<msub><mi>a</mi><mn>1</mn></msub><mi>x</mi>
<mo>+</mo>
<mo>&#8230;</mo>
<mo>+</mo>
<msub><mi>a</mi><mi>n</mi></msub><msup><mi>x</mi><mi>n</mi></msup>
</math>

<p>For precise approximation formulas, search for documents entitled  “Guidance Equations” in the <a href="https://www.ibiblio.org/apollo/links.html">AGC Document Library</a> for various Apollo flights.</p>

<p>As a prelude to the next section, we introduce the <a href="https://en.wikipedia.org/wiki/Fixed-point_arithmetic">fixed-point arithmetic</a>. Note in the decimal system, two integers</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>3 x 5 = 15
</code></pre></div></div>

<p>At the same time, two fractional numbers</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0.3 x 0.5 = 3 / 10 * 5 / 10 = 15 / 100 = 0.15
</code></pre></div></div>

<p>So in order to computer the multiplication of two fractional numbers with the same number of digits after the decimal point, we can take the fractional parts, do the arithmetic as with integers, and finally put the decimal point at the beginning of the integer result.</p>

<h3 id="double-precision-multiplication">Double-precision multiplication</h3>
<p>How does it calculate any item <math><msub><mi>a</mi><mi>k</mi></msub><msup><mi>x</mi><mi>k</mi></msup></math> in the above polynomial? <math><mi>x</mi></math> would be a fractional number. The programmer has to remember the decimal point is before the fractional part. The fractional part is stored as one or more 15-bit integers.</p>

<p>Below we illustrate with the code of Luminary099, the version that flew Apollo 11 the first man-landing mission. The routine is <code class="language-plaintext highlighter-rouge">DMPSUB</code> called by <code class="language-plaintext highlighter-rouge">POLY</code> for polynomial valuation. Go to <a href="https://www.ibiblio.org/apollo/Luminary.html">https://www.ibiblio.org/apollo/Luminary.html</a>, in the version table find “Apollo 11”, version “099/1”, click on the “syntax-highlighted, hyperlinked HTML”. On the page for the version 099, scroll down, locate the symbol <code class="language-plaintext highlighter-rouge">DMPSUB</code> in the SymbolTable, and click on it.</p>

<p><code class="language-plaintext highlighter-rouge">DMPSUB</code> multiplies two double-precision fractional numbers and keeps the most significant three words as result. The programmer has to treat the result as a fractional number, with a decimal point before all the words.</p>

<p>Denote the two multiplicants as x, y. The fractional part of x is stored in two words x_major and x_minor, likewise is y. Just as the multi-digit multiplication we learnt in primary school, we multiply different word pairs from the two multiplicants and sum over the intermediate results.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>                     x_major  x_minor
                     y_major  y_minor
--------------------------------------
                     x_minor x y_minor
           x_major x y_minor   zeros
           y_major x x_minor   zeros
 x_major x y_major    zeros    zeros
</code></pre></div></div>

<p>In the code below, ADDRWD+0 is x_major, ADDRWD+1 is x_minor. MPAC+0 is y_major, MPAC+1 is y_minor. The result will be stored in MPAC, MPAC+1, MPAC+2, discarding the least significant word.</p>

<p>On the Apollo Guidance Computer, there is one register A as accumulator, storing current temporary result. Multiplication <code class="language-plaintext highlighter-rouge">MP</code> would place the more significant part of result in register A, and the less significant part in register L.</p>

<p>Refer to the code for the original comment. The comment below is added for this blog post.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>DMPSUB    INDEX   ADDRWD
          CA      1                # load ADDRWD+1 or x_minor into register A
          TS      MPAC     +2      # store register A to MPAC+2, x_minor at MPAC+2 now
          CAF     ZERO             # zero in A
          XCH     MPAC     +1      # exchange register A with MPAC+1, y_minor into A, zero into MPAC+1
          TS      MPTEMP           # store register A to MPTEMP, y_minor at MPTEMP now
          EXTEND
          MP      MPAC     +2      # multiply register A with MPAC+2, i.e. y_minor with x_minor

          XCH     MPAC     +2      # the most signficant part of result in A, after exchanging, MPAC+2 holds the most significant part of the result, A holds x_minor
          EXTEND
          MP      MPAC             # multiply register A with MPAC, i.e. x_minor with y_major
          DAS     MPAC     +1      # add the two-word result to MPAC+1 and MPAC+2

          INDEX   ADDRWD
          CA      0                # x_major in register A
          XCH     MPTEMP           # exchange register A with MPTEMP, y_minor in A, x_major in MPTEMP
DMPSUB2   EXTEND
          MP      MPTEMP           # multiply register A with MPTEMP, i.e. y_minor with x_major
          DAS     MPAC     +1      # add the two-word result to MPAC+1 and MPAC+2

          XCH     MPAC             # exchange register A with MPAC+0, y_major in A, 0, 1 or -1 in MPAC
          EXTEND
          MP      MPTEMP           # multiply register A with MPTEMP, i.e. y_major with x_major
          DAS     MPAC             # add the two-word results to MPAC and MPAC+1
          TC      Q                # transfer control to the return address in register Q
</code></pre></div></div>

<p>If one clicks on the <code class="language-plaintext highlighter-rouge">ADDRWD</code>, <code class="language-plaintext highlighter-rouge">MPAC</code> in the original code, one will be redirected to their declarations in the erasable memory assignment file. Their addresses are fixed (no dynamic allocation of memory) but the content in them can change.</p>

<p>For more detail on each instruction and their effect, consult the <a href="https://ibiblio.org/apollo/assembly_language_manual.html">Assembly Language Manual</a> and the discussion at <a href="https://github.com/virtualagc/virtualagc/discussions/1262">https://github.com/virtualagc/virtualagc/discussions/1262</a>.</p>

<h2 id="references">References</h2>
<p>Apollo Guidance Computer, <a href="https://en.wikipedia.org/wiki/Apollo_Guidance_Computer">https://en.wikipedia.org/wiki/Apollo_Guidance_Computer</a></p>

<p>Luminary software for the Lunar Module, <a href="https://ibiblio.org/apollo/Luminary.html">https://ibiblio.org/apollo/Luminary.html</a></p>

<p>Virtual AGC Assembly-Language Manual, <a href="https://ibiblio.org/apollo/assembly_language_manual.html">https://ibiblio.org/apollo/assembly_language_manual.html</a></p>

<p>Virtual AGC Document Library, <a href="https://www.ibiblio.org/apollo/links.html">https://www.ibiblio.org/apollo/links.html</a></p>

<p>Eldon C. Hall (1996). Journey to the Moon: The History of the Apollo Guidance Computer</p>

<p>Hidden Figures, 2016 film.</p>

<p>Michael Steil, Christian Hessmann, the ultimate Apollo Guidance Computer talk, <a href="https://media.ccc.de/v/34c3-9064-the_ultimate_apollo_guidance_computer_talk">https://media.ccc.de/v/34c3-9064-the_ultimate_apollo_guidance_computer_talk</a></p>

<p>Steve Baines, the curious design of the Apollo Guidance Computer, <a href="https://media.ccc.de/v/emf2022-105-the-curious-design-of-the-apollo-guidance-computer">https://media.ccc.de/v/emf2022-105-the-curious-design-of-the-apollo-guidance-computer</a></p>

<p>Charles Averill, a brief analysis of the Apollo Guidance Computer, <a href="https://arxiv.org/abs/2201.08230">https://arxiv.org/abs/2201.08230</a></p>

<p>Mike Kohn, Apollo Guidance Computer in an FPGA, <a href="https://www.mikekohn.net/micro/apollo11_fpga.php">https://www.mikekohn.net/micro/apollo11_fpga.php</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introduction to the Apollo Guidance Computer Apollo Guidance Computer was a digital computer produced for the Apollo program. The command module flies to the moon orbit. The lunar module then descends to the moon and makes way back to the command module.]]></summary></entry><entry><title type="html">RISC-V64 V extension for vector operations</title><link href="https://www.brief-ds.com/2025/10/31/rv64-v.html" rel="alternate" type="text/html" title="RISC-V64 V extension for vector operations" /><published>2025-10-31T00:00:00+00:00</published><updated>2025-10-31T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/10/31/rv64-v</id><content type="html" xml:base="https://www.brief-ds.com/2025/10/31/rv64-v.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>As in the <a href="/2025/08/11/first-riscv64-program.html">first post about RISC-V</a>, RISC-V would divide instruction set groups into base sets and extensions. V extension is for vector operations, workhorse of machine learning. It is very similar to ARM64 SVE (Scalable Vector Extension) with one significant difference. Readers are encouraged to read the <a href="/2025/10/15/sve.html">counterpart about ARM64 SVE</a>.</p>

<p>Again, the main reference is the RISC-V <a href="https://github.com/riscv/riscv-isa-manual"><strong>instruction set manual</strong></a>, on which the README provides a link to the latest typeset <a href="https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/">specification of unprivileged instructions</a>. Section 30 is on the V extension. Appendix C provides multiple example V extension programs.</p>

<h2 id="risc-v64-v-extension">RISC-V64 V extension</h2>
<p>The V extension is for vector operations. The V extension requires a 64-bit integer operation base set: RV64E or RV64I.</p>

<p>The V extension defines 32 registers: v0-v31. The length of a vector register is defined by VLEN: a vector register can hold four 32-bit numbers when VLEN = 128, or eight of them when VLEN = 256. A vector register can hold either multiple integers, or multiple floating-point numbers.</p>

<p>As with ARM64 SVE, the vector operation is agnostic of the length of a mathematical vector: the number of elements (numbers) in it. We just load as many of them as we can from the mathematical vector into the vector register(s), compute on the vector register(s), then proceed to the next batch of elements (numbers) in the mathematical vector, until we run out of elements to compute on.</p>

<h2 id="example">Example</h2>
<p>As in the <a href="/2025/08/11/first-riscv64-program.html">introductory post</a>, we first build the development toolchain on a x86-64 host: assembler, compiler, linker etc. for bare metal programming, which means when we use this toolchain to compile programs, the final executable file will need minimum support of the operating system (to manage memory etc), or any simulator to interpret the RISC-V instructions on x86-64.</p>

<p>Below we multiply two vectors element-wise. If the <code class="language-plaintext highlighter-rouge">vmul</code> routine were more complex, stack operations to <a href="/2025/08/11/first-riscv64-program.html">preserve integer registers</a> and <a href="https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf">floating point registers</a> may be necessary.</p>

<p><code class="language-plaintext highlighter-rouge">vmul.s</code>:</p>

<pre><code class="language-asm">.global vmul

.text
vmul:
    # vmul(size_t n, int *x, int *y, int *z)
    #
    # multiplies two vectors element-wise.
    #
    # input: n, x, y
    # output: z
    #
    # n in register a0
    # x in register a1
    # y in register a2
    # z in register a3

vmul_loop:
    beq a0, zero, vmul_return     # if a0 == 0, go to vmul_return

    # given the remaining number of elements to process in a0,
    # return the maximum number of elements that can be handled
    # by one vector instruction in t0.
    #
    # e32: each element (number) in the mathematical vector
    #      is 32-bit.
    #
    # m1:  grouping of 1 vector register.
    #
    # ta:  tail agnostic, when we are computing on the "tail"
    #      of the input vector, the last few elements in it.
    #      "ta" is usually a good setting. Leave it.
    #
    # ma:  mask agnostic, when a mask vector is present to indicate
    #      which elements in the vector register enter computation,
    #      which do not. "ma" is usually a good setting. Leave it.
    #
    # If the configuration is e32 and m1, and one vector register is 128-bit,
    #
    #   t0 = 4    if a0 &gt;= 4
    #   t0 = a0   otherwise.
    #
    # If the configuration is e32 and m4, and one vector register is 128-bit,
    #
    #   t0 = 16   if a0 &gt;= 16
    #   t0 = a0   otherwise.
    #
    vsetvli t0, a0, e32, m1, ta, ma

    vle32.v v0, (a1)        # load from memory into vector register
    vle32.v v1, (a2)
    vmul.vv v0, v0, v1      # integer multiplication element-wise
    vse32.v v0, (a3)        # store from vector register into memory

    sub a0, a0, t0          # having processed t0 numbers,
                            # decrement the number of elements to process

    slli t0, t0, 2          # t0 32-bit numbers take t0 x 4 bytes
                            # set t0 to t0 &lt;&lt; 2 = t0 * 4

    add a1, a1, t0          # move the pointers to point to the next
    add a2, a2, t0          # element (32-bit number) to load or save
    add a3, a3, t0

    j vmul_loop             # jump to vmul_loop

vmul_return:
    mv a0, zero             # set return value
    ret
</code></pre>

<p><code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">&lt;stddef.h&gt;</span><span class="cp">
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">vmul</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">n</span><span class="p">,</span> <span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="n">x</span><span class="p">,</span> <span class="k">const</span> <span class="kt">int</span> <span class="o">*</span><span class="n">y</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="n">z</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
  <span class="kt">size_t</span> <span class="n">n</span> <span class="o">=</span> <span class="mi">6</span><span class="p">;</span>
  <span class="kt">int</span> <span class="n">x</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">1</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">};</span>
  <span class="kt">int</span> <span class="n">y</span><span class="p">[]</span> <span class="o">=</span> <span class="p">{</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">};</span>
  <span class="kt">int</span> <span class="n">z</span><span class="p">[</span><span class="n">n</span><span class="p">];</span>

  <span class="kt">int</span> <span class="n">k</span> <span class="o">=</span> <span class="n">vmul</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="n">z</span><span class="p">);</span>

  <span class="k">for</span> <span class="p">(</span><span class="kt">size_t</span> <span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">n</span><span class="p">;</span> <span class="o">++</span><span class="n">i</span><span class="p">)</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%i "</span><span class="p">,</span> <span class="n">z</span><span class="p">[</span><span class="n">i</span><span class="p">]);</span>
  <span class="n">printf</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>

  <span class="k">return</span> <span class="n">k</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>To make the executable file,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>/opt/rv64gcv/bin/riscv64-unknown-elf-as <span class="nt">--march</span><span class="o">=</span>rv64gcv <span class="nt">-o</span> vmul.o vmul.s
<span class="nv">$ </span>/opt/rv64gcv/bin/riscv64-unknown-elf-gcc <span class="nt">-o</span> main vmul.o main.c
</code></pre></div></div>

<p>As we built it with the toolchain for bare-metal programming, we can directly run it,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./main
5 18 28 40 54 70
</code></pre></div></div>

<p>It can equally be run with a simulator that interprets RISC-V instructions on x86-64. Under Ubuntu Linux,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>apt <span class="nb">install </span>qemu-user
<span class="nv">$ </span>qemu-riscv64 <span class="nt">-cpu</span> rv64,v<span class="o">=</span>on,vext_spec<span class="o">=</span>v1.0,vlen<span class="o">=</span>128 main
5 18 28 40 54 70
</code></pre></div></div>

<p>To adapt <code class="language-plaintext highlighter-rouge">vmul.s</code> for floating-point multiplication, change one single line to</p>

<pre><code class="language-asm">    vfmul.vv v0, v0, v2
</code></pre>

<p>and change accordingly <code class="language-plaintext highlighter-rouge">main.c</code>.</p>

<h2 id="the-flexibility-of-the-v-extension">The flexibility of the V extension</h2>
<p>In ARM64 SVE, one vector instruction can only be carried out over one vector register, largely. In the RISC-V64 V extension, one may group multiple vector registers for one vector instruction.</p>

<p>The <code class="language-plaintext highlighter-rouge">m1</code> label in the <code class="language-plaintext highlighter-rouge">vsetvli</code> instruction specified that one vector instruction is carried out on one vector register only.</p>

<pre><code class="language-asm">    # given the remaining number of elements to process in a0,
    # return the maximum number of elements that can be handled
    # by one vector instruction in t0.
    #
    vsetvli t0, a0, e32, m1, ta, ma
</code></pre>

<p>But we can make one vector instruction work on more than one vector registers, and thus more 32-bit elements, as denoted by <code class="language-plaintext highlighter-rouge">t0</code>. Note in the comment,</p>

<pre><code class="language-asm">    # If the configuration is e32 and m1, and one vector register is 128-bit,
    #
    #   t0 = 4    if a0 &gt;= 4
    #   t0 = a0   otherwise.
    #
    # If the configuration is e32 and m4, and one vector register is 128-bit,
    #
    #   t0 = 16   if a0 &gt;= 16
    #   t0 = a0   otherwise.
    #
</code></pre>

<p>In the V extension, version 1.0, one may group maximum 8 vector registers together (Section 30.6.1 <code class="language-plaintext highlighter-rouge">vtype</code> encoding), in which case,</p>

<pre><code class="language-asm">    # If the configuration is e32 and m8, and one vector register is 128-bit,
    #
    #   t0 = 32   if a0 &gt;= 32
    #   t0 = a0   otherwise.
    #
</code></pre>

<p>If one vector instruction can work on more elements, the loop will require fewer steps to fully process the input.</p>

<p>When four vector registers are grouped together by <code class="language-plaintext highlighter-rouge">m4</code>, <code class="language-plaintext highlighter-rouge">v0</code> will refer to the content in the original <code class="language-plaintext highlighter-rouge">v0</code>, <code class="language-plaintext highlighter-rouge">v1</code>, <code class="language-plaintext highlighter-rouge">v2</code>, <code class="language-plaintext highlighter-rouge">v3</code>; <code class="language-plaintext highlighter-rouge">v4</code> to the content in the original <code class="language-plaintext highlighter-rouge">v4</code>, <code class="language-plaintext highlighter-rouge">v5</code>, <code class="language-plaintext highlighter-rouge">v6</code>, <code class="language-plaintext highlighter-rouge">v7</code>; so on and so forth. One can only access the vector registers via <code class="language-plaintext highlighter-rouge">v0</code>, <code class="language-plaintext highlighter-rouge">v4</code>, <code class="language-plaintext highlighter-rouge">v8</code>, …, up to <code class="language-plaintext highlighter-rouge">v28</code> in the program.</p>

<p>Likewise, if eight vector registers are grouped by <code class="language-plaintext highlighter-rouge">m8</code>, the programmer can only access <code class="language-plaintext highlighter-rouge">v0</code>, <code class="language-plaintext highlighter-rouge">v8</code>, <code class="language-plaintext highlighter-rouge">v16</code>, <code class="language-plaintext highlighter-rouge">v24</code> in the program.</p>

<p>Therefore if we group four vector registers by <code class="language-plaintext highlighter-rouge">m4</code> to increase the number of elements for one vector instruction, three lines in <code class="language-plaintext highlighter-rouge">vmul.s</code> need be changed:</p>

<pre><code class="language-asm">    vsetvli t0, a0, e32, m4, ta, ma
    ...
    vle32.v v4, (a2)
    vmul.vv v0, v0, v4      # integer multiplication element-wise
</code></pre>

<p>The number of loops will reduce four folds. Any user of the <code class="language-plaintext highlighter-rouge">vmul</code> routine will remain the same, as <code class="language-plaintext highlighter-rouge">main.c</code>.</p>

<h2 id="the-hardware-design">The hardware design</h2>
<p>ETH Zurich and Barcelona Supercomputing Centre are doing hardware designs of the V extension. Andes Technology, SiFive and a few other companies produce commercial designs.</p>

<p>The <a href="https://www.pulp-platform.org/">Parallel Ultra Low Power platform</a>, is a collaboration between ETH Zürich and University of Bologna that develops open-source, scalable and energy-efficient RISC-V hardware and software. The “Training” and “Conference talks” sections are particularly informative, covering much more than just the V extension. For example, the talk “Understanding performance numbers in Integrated Circuit Design” was an excellent exposition.</p>

<p>The current designs for the V extension would generally impose a fixed maximum width for concurrent data, as for ARM64 SVE. For example, if one vector register is 128-bit, and the maximum width for concurrent data is 256-bit, then two vector registers is the maximum throughput. If the programmer specified <code class="language-plaintext highlighter-rouge">m4</code> grouping four vector registers, the hardware will actually process content in the first two vector registers, then that in the next two vector registers, sequentially.</p>

<p>It seems if the hardware is built with simple units, and each unit can be adaptively turned on/off, it may maximise the power of the V extension, as allowed by the flexibility of the specification.</p>

<p>The latest designs with throughput of 80 32-bit numbers (80 x 32 = 2560 bits) or 2560 bits, already run at below 1 Watt <a href="https://dl.acm.org/doi/full/10.1145/3575861">Vitruvius+, Barcelona Supercomputing Centre</a>. A complete ARM64 or Apple’s M microprocessor usually runs at 5 Watts at load. The throughput of <a href="https://en.wikipedia.org/wiki/AWS_Graviton">Graviton 3 or 4</a>, ARM64 processor designed by Amazon, is 16 32-bit numbers (16 x 32 = 4 x 128 = 512), or 512 bits per clock cycle for the SVE (Scalable Vector Extension).</p>

<h2 id="openblas-kernels-with-risc-v64-v-extension">OpenBLAS kernels with RISC-V64 V extension</h2>
<p>In <a href="https://github.com/OpenMathLib/OpenBLAS">OpenBLAS</a> the linear algebra library, for example, for the general matrix-vector multiplication (gemv) routine, the files with <code class="language-plaintext highlighter-rouge">_rvv</code> or <code class="language-plaintext highlighter-rouge">_vector</code> in the file name use the RISC-V64 V extension:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kernel/riscv64/gemv_n.c
kernel/riscv64/gemv_n_rvv.c
kernel/riscv64/gemv_n_vector.c
kernel/riscv64/gemv_t.c
kernel/riscv64/gemv_t_rvv.c
kernel/riscv64/gemv_t_vector.c
kernel/riscv64/zgemv_n.c
kernel/riscv64/zgemv_n_rvv.c
kernel/riscv64/zgemv_n_vector.c
kernel/riscv64/zgemv_t.c
kernel/riscv64/zgemv_t_rvv.c
kernel/riscv64/zgemv_t_vector.c
</code></pre></div></div>

<h2 id="references">References</h2>
<p>RISC-V instruction set manual, <a href="https://github.com/riscv/riscv-isa-manual">https://github.com/riscv/riscv-isa-manual</a>, the main reference for RISC-V</p>

<p>… from the above GitHub repository, a link to the typeset specification of unprivileged instructions, <a href="https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/">https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/</a>, with Section 30 on V extension, Appendix C providing multiple example V extension programs</p>

<p>Introduction to RISC-V, its integer registers, <a href="/2025/08/11/first-riscv64-program.html">/2025/08/11/first-riscv64-program.html</a></p>

<p>RISC-V calling convention, its floating-point registers, <a href="https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf">https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf</a></p>

<p>RISC-V announces ratification of the RVA23 profile standard, October 2024, <a href="https://riscv.org/blog/risc-v-announces-ratification-of-the-rva23-profile-standard/">https://riscv.org/blog/risc-v-announces-ratification-of-the-rva23-profile-standard/</a></p>

<p>RVA23 profiles, <a href="https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc">https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc</a></p>

<p>ARM64 SVE (Scalable Vector Extension), <a href="/2025/10/15/sve.html">/2025/10/15/sve.html</a></p>

<p>PULP (Parallel Ultra Low Power) Platform, <a href="https://www.pulp-platform.org/">https://www.pulp-platform.org/</a></p>

<p>PULP (Parallel Ultra Low Power) training, <a href="https://www.pulp-platform.org/pulp_training.html">https://www.pulp-platform.org/pulp_training.html</a></p>

<p>PULP (Parallel Ultra Low Power) conference and workshop materials, <a href="https://www.pulp-platform.org/conferences.html">https://www.pulp-platform.org/conferences.html</a></p>

<p>…, on which for example “Understanding performance numbers in Integrated Circuit Design” was an excellent exposition.</p>

<p>PULP (Parallel Ultra Low Power) 10-year conference, <a href="https://pulp-platform.org/10years/">https://pulp-platform.org/10years/</a></p>

<p>Vitruvius+: an Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications, <a href="https://dl.acm.org/doi/full/10.1145/3575861">https://dl.acm.org/doi/full/10.1145/3575861</a>, the design is already upgraded for the V extension, version 1.0 according to the author</p>

<p>AWS Graviton, ARM64 processor designed by Amazon with support for the SVE (Scalable Vector Extension), <a href="https://en.wikipedia.org/wiki/AWS_Graviton">https://en.wikipedia.org/wiki/AWS_Graviton</a></p>

<p>OpenBLAS, <a href="https://github.com/OpenMathLib/OpenBLAS">https://github.com/OpenMathLib/OpenBLAS</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introduction As in the first post about RISC-V, RISC-V would divide instruction set groups into base sets and extensions. V extension is for vector operations, workhorse of machine learning. It is very similar to ARM64 SVE (Scalable Vector Extension) with one significant difference. Readers are encouraged to read the counterpart about ARM64 SVE.]]></summary></entry><entry><title type="html">ARM64 SVE (Scalable Vector Extension)</title><link href="https://www.brief-ds.com/2025/10/15/sve.html" rel="alternate" type="text/html" title="ARM64 SVE (Scalable Vector Extension)" /><published>2025-10-15T00:00:00+00:00</published><updated>2025-10-15T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/10/15/sve</id><content type="html" xml:base="https://www.brief-ds.com/2025/10/15/sve.html"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Traditional computer architecture stores one number in one register, and performs operations on them. For artificial neural network, it’s inevitable to do vector operations, eg. multiplying a 50-element vector by another 50-element vector (one “element” is one number).</p>

<h2 id="scalable-vector-extension">Scalable Vector Extension</h2>
<p>Short for SVE, it introduces vector registers: Z0-Z31. The length of a vector register is set by the chip designer: a 128-bit vector register can hold four 32-bit numbers, a 256-bit vector register eight of them. A vector register Zn can hold either multiple integers, or multiple floating-point numbers, unlike the distinction between Xn for integers and Dn for floating-point numbers.</p>

<p>Suppose on one architecture with 256-bit vector registers, we are going to compute element-wise multiplication of a 18-element vector with another 18-element vector, each element being a 32-bit number.</p>

<p>With SVE, we will</p>

<ol>
  <li>load the first eight elements of both vectors (stored in the memory) into two vector registers (on the CPU), multiply them element-wise, write out the result;</li>
  <li>load the next eight elements into the vector registers, do the same;</li>
  <li>in the last iteration, load the remaining two elements of both vectors, and do the job.</li>
</ol>

<p>A key concept is <strong>predicate</strong>, which indicates whether each element in the vector register is going to be computed on. In the first and second iterations above, the predicate would be all on</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1  1  1  1  1  1  1  1
</code></pre></div></div>

<p>but in the last iteration, it would be on only for the first two elements, so the element-wise multiplication proceeds correctly.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1  1  0  0  0  0  0  0
</code></pre></div></div>

<h2 id="example">Example</h2>
<p>If the <code class="language-plaintext highlighter-rouge">vmul</code> routine were more complex, stack operations to <a href="/2025/07/14/arm64-ldr-and-str.html">preserve integer registers</a> and <a href="/2025/08/12/arm64-floating-point.html">floating point registers</a> may be necessary.</p>

<p><code class="language-plaintext highlighter-rouge">vmul.s</code>:</p>

<pre><code class="language-asm">
.global vmul

.text
vmul:
    // vmul(n, a, b, c)
    //  input:  n in x0, pointer a in x1, pointer b in x2
    //  output: pointer c in x3
    // 
    // a, b, c point to arrays of 32-bit numbers.
    // The three arrays will be of the same data type: either int32
    //  or float32, and of the same length n.

    // count of "words" in one vector register
    // a word is 32-bit
    // if one vector register is 256-bit, x4 = 8;
    // if one vector register is 128-bit, x4 = 4.
    cntw x4

    // index of the current number to be processed
    //  in any input array
    mov x5, #0

vmul_loop:
    // p0 the "predicate", array of length x4.
    // 
    // for each 0 &lt;= i &lt; x4,
    //   p0[i] = 1 if x5 + i &lt; x0,
    //   p0[i] = 0 otherwise.
    whilelo p0.s, x5, x0

    // if no 1 in the predicate, go to vmul_return
    b.none vmul_return

    // load into the vector register z1 numbers in the input array
    // while the predicate is on. Load zero if the predicate is off.
    //
    // the memory address of the first number to be processed:
    //   x1 + x5 * 4 = x1 + (x5 &lt;&lt; 2)
    // as each number is 32-bit, or 4 bytes in this program.
    ld1w z1.s, p0/z, [x1, x5, lsl #2]       // lsl: logical shift left
    ld1w z2.s, p0/z, [x2, x5, lsl #2]

    // multiply two vector registers
    mul z1.s, p0/m, z1.s, z2.s

    // store the result back into the main memory
    st1w z1.s, p0, [x3, x5, lsl #2]

    // now that x4 numbers have been processed,
    // increment the index of the current number
    add x5, x5, x4

    // branch (jump) to the vmul_loop label
    b vmul_loop

vmul_return:
    ret
</code></pre>

<p><code class="language-plaintext highlighter-rouge">vmul.py</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
<span class="kn">import</span> <span class="nn">ctypes</span>
<span class="kn">from</span> <span class="nn">numpy</span> <span class="kn">import</span> <span class="n">array</span><span class="p">,</span> <span class="n">empty_like</span>

<span class="n">a</span> <span class="o">=</span> <span class="n">array</span><span class="p">([</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">]).</span><span class="n">astype</span><span class="p">(</span><span class="s">'int32'</span><span class="p">)</span>
<span class="n">b</span> <span class="o">=</span> <span class="n">array</span><span class="p">([</span><span class="mi">9</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">]).</span><span class="n">astype</span><span class="p">(</span><span class="s">'int32'</span><span class="p">)</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">empty_like</span><span class="p">(</span><span class="n">a</span><span class="p">)</span>

<span class="c1"># locate the vmul routine
</span><span class="n">lib</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">CDLL</span><span class="p">(</span><span class="s">'./vmul.so'</span><span class="p">)</span>
<span class="n">vmul</span> <span class="o">=</span> <span class="n">lib</span><span class="p">.</span><span class="n">vmul</span>

<span class="c1"># vmul's function signature, akin to C
# the first three are input arguments, the fourth is output
</span><span class="n">int_ptr</span> <span class="o">=</span> <span class="n">ctypes</span><span class="p">.</span><span class="n">POINTER</span><span class="p">(</span><span class="n">ctypes</span><span class="p">.</span><span class="n">c_int</span><span class="p">)</span>
<span class="n">vmul</span><span class="p">.</span><span class="n">argtypes</span> <span class="o">=</span> <span class="p">[</span><span class="n">ctypes</span><span class="p">.</span><span class="n">c_size_t</span><span class="p">,</span> <span class="n">int_ptr</span><span class="p">,</span> <span class="n">int_ptr</span><span class="p">,</span> <span class="n">int_ptr</span><span class="p">]</span>

<span class="c1"># call vmul
</span><span class="n">vmul</span><span class="p">(</span><span class="n">a</span><span class="p">.</span><span class="n">size</span><span class="p">,</span> <span class="n">a</span><span class="p">.</span><span class="n">ctypes</span><span class="p">.</span><span class="n">data_as</span><span class="p">(</span><span class="n">int_ptr</span><span class="p">),</span>
     <span class="n">b</span><span class="p">.</span><span class="n">ctypes</span><span class="p">.</span><span class="n">data_as</span><span class="p">(</span><span class="n">int_ptr</span><span class="p">),</span> <span class="n">c</span><span class="p">.</span><span class="n">ctypes</span><span class="p">.</span><span class="n">data_as</span><span class="p">(</span><span class="n">int_ptr</span><span class="p">))</span>

<span class="k">print</span><span class="p">(</span><span class="n">c</span><span class="p">)</span>
</code></pre></div></div>

<p>It takes an AArch64 with SVE to run. I started an c7g instance in AWS (Graviton3) and installed Linux. In the shell,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>as <span class="nt">-march</span><span class="o">=</span>armv8-a+sve <span class="nt">-o</span> vmul.o vmul.s
<span class="nv">$ </span>gcc <span class="nt">-shared</span> <span class="nt">-o</span> vmul.so <span class="nt">-fPIC</span> vmul.o
<span class="nv">$ </span>python3 vmul.py
<span class="o">[</span> 9 16 21 24 25 24 21 16  9]
</code></pre></div></div>

<p>The above example runs for integers. To adapt it for floating-point numbers, simply change <code class="language-plaintext highlighter-rouge">mul</code> to <code class="language-plaintext highlighter-rouge">fmul</code> in the assembly code, and make minimal change in the Python code. Interested readers are encouraged to figure out.</p>

<h2 id="sve-throughput">SVE throughput</h2>
<p>In the wikipedia article regarding <a href="https://en.wikipedia.org/wiki/AWS_Graviton">Graviton</a>, the ARM64 processors designed by Amazon with support for SVE, Graviton3 is marked “2x256 SVE”, Graviton4 “4x128 SVE”. What does it mean?</p>

<ul>
  <li>Graviton3 SVE is configured with 256-bit vector registers, per processor clock cycle it can make 2 computations, so the throughput is 256x2 = 512 bits per clock cycle;</li>
  <li>Graviton4 SVE is configured with narrower 128-bit vector registers, but per clock cycle it can make 4 computations, so the throughput is still 128x4=512 bits per clock cycle.</li>
</ul>

<h2 id="openblas-kernels-with-sve">OpenBLAS kernels with SVE</h2>
<p>In <a href="https://github.com/OpenMathLib/OpenBLAS">OpenBLAS</a> the linear algebra library, taking the general matrix-vector multiplication (gemv) routine for an example, the files with <code class="language-plaintext highlighter-rouge">_sve</code> in the file name use the SVE:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>kernel/arm64/bgemv_n_sve_v3x4.c
kernel/arm64/gemv_n.S
kernel/arm64/gemv_n_sve.c
kernel/arm64/gemv_n_sve_v1x3.c
kernel/arm64/gemv_n_sve_v4x3.c
kernel/arm64/gemv_t.S
kernel/arm64/gemv_t_sve.c
kernel/arm64/gemv_t_sve_v1x3.c
kernel/arm64/gemv_t_sve_v4x3.c
kernel/arm64/sbgemv_n_neon.c
kernel/arm64/sbgemv_t_bfdot.c
kernel/arm64/sgemv_n_neon.c
kernel/arm64/zgemv_n.S
kernel/arm64/zgemv_t.S
</code></pre></div></div>

<h2 id="references">References</h2>
<p>AArch64 registers for integers, <a href="/2025/07/14/arm64-ldr-and-str.html">/2025/07/14/arm64-ldr-and-str.html</a></p>

<p>AArch64 registers for floating-point numbers, <a href="/2025/08/12/arm64-floating-point.html">/2025/08/12/arm64-floating-point.html</a></p>

<p>SVE architecture, <a href="https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions">https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions</a></p>

<p>Stony Brook University AArch64 and SVE tutorials, <a href="https://www.stonybrook.edu/commcms/ookami/support/index_links_and_docs.php">https://www.stonybrook.edu/commcms/ookami/support/index_links_and_docs.php</a></p>

<p>A case study in vectorising using SVE, <a href="https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/a-case-study-in-vectorizing-haccmk-using-sve">https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/a-case-study-in-vectorizing-haccmk-using-sve</a>, which mentioned SVE throughput</p>

<p>AWS Graviton, ARM64 processors designed by Amazon with support for SVE (Scalable Vector Extension), <a href="https://en.wikipedia.org/wiki/AWS_Graviton">https://en.wikipedia.org/wiki/AWS_Graviton</a></p>

<p>OpenBLAS, <a href="https://github.com/OpenMathLib/OpenBLAS">https://github.com/OpenMathLib/OpenBLAS</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introduction Traditional computer architecture stores one number in one register, and performs operations on them. For artificial neural network, it’s inevitable to do vector operations, eg. multiplying a 50-element vector by another 50-element vector (one “element” is one number).]]></summary></entry><entry><title type="html">Install Apple’s MLX machine learning library</title><link href="https://www.brief-ds.com/2025/09/26/install-mlx.html" rel="alternate" type="text/html" title="Install Apple’s MLX machine learning library" /><published>2025-09-26T00:00:00+00:00</published><updated>2025-09-26T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/09/26/install-mlx</id><content type="html" xml:base="https://www.brief-ds.com/2025/09/26/install-mlx.html"><![CDATA[<p><a href="https://mlx-framework.org">MLX</a> is an Apple’s project to build a machine learning library, for Apple Silicon and ARM. <a href="https://en.wikipedia.org/wiki/AArch64">AArch64</a> architecture is necessary. To install on Ubuntu 24.04 LTS,</p>

<p>Install openblas,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>libopenblas-dev
</code></pre></div></div>

<p>Install Lapack. Not sure which one exactly is needed, I installed them all,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>liblapack-dev liblapack64-dev liblapacke-dev
</code></pre></div></div>

<p>Install nanobind,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt <span class="nb">install </span>nanobind-dev
</code></pre></div></div>

<p>Then clone MLX and build it,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/ml-explore/mlx.git
<span class="nb">cd </span>mlx
python3 <span class="nt">-m</span> venv venv
<span class="nb">.</span> venv/bin/activate
pip3 <span class="nb">install</span> <span class="nb">.</span>
</code></pre></div></div>

<p>The install size is 23 megabytes,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">du</span> <span class="nt">-h</span> <span class="nt">-d</span> 1 /home/ubuntu/mlx/venv/lib/python3.12/site-packages
..
23M	/home/ubuntu/mlx/venv/lib/python3.12/site-packages/mlx
..
</code></pre></div></div>

<h2 id="references">References</h2>
<p>Dive into MLX, Pranay Saha, <a href="https://medium.com/@pranaysaha/dive-into-mlx-performance-flexibility-for-apple-silicon-651d79080c4c">https://medium.com/@pranaysaha/dive-into-mlx-performance-flexibility-for-apple-silicon-651d79080c4c</a></p>

<p>How Fast is MLX?, Tristan Bilot, <a href="https://towardsdatascience.com/how-fast-is-mlx-a-comprehensive-benchmark-on-8-apple-silicon-chips-and-4-cuda-gpus-378a0ae356a0/">https://towardsdatascience.com/how-fast-is-mlx-a-comprehensive-benchmark-on-8-apple-silicon-chips-and-4-cuda-gpus-378a0ae356a0/</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[MLX is an Apple’s project to build a machine learning library, for Apple Silicon and ARM. AArch64 architecture is necessary. To install on Ubuntu 24.04 LTS,]]></summary></entry><entry><title type="html">TensorFlow, Apple’s MLX and our micrograd</title><link href="https://www.brief-ds.com/2025/09/25/tensorflow-mlx.html" rel="alternate" type="text/html" title="TensorFlow, Apple’s MLX and our micrograd" /><published>2025-09-25T00:00:00+00:00</published><updated>2025-09-25T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/09/25/tensorflow-mlx</id><content type="html" xml:base="https://www.brief-ds.com/2025/09/25/tensorflow-mlx.html"><![CDATA[<h2 id="automatic-differentiation-autodiff">Automatic differentiation (autodiff)</h2>
<p>An artificial neural network (ANN) is usually a function of input <math><mi>X</mi></math> and some parameters <math><mi>b</mi></math>,</p>

<math display="block">
<mi>f</mi><mo>(</mo><mi>X</mi><mo>,</mo><mi>b</mi><mo>)</mo><mtext>.</mtext>
</math>

<p>Given <math><mi>X</mi></math> we observe <math><mi>Y</mi></math> as output of the function or mechanism <math><mi>f</mi></math>.</p>

<p>The training of the ANN would involve adjusting <math><mi>b</mi></math> such that <math><mi>f</mi><mo>(</mo><mi>X</mi><mo>,</mo><mi>b</mi><mo>)</mo></math> is as close to <math><mi>Y</mi></math> as possible by some measure, called “loss”. For example, below is a loss,</p>

<math display="block">
<mi>l</mi><mo>(</mo><mi>X</mi><mo>,</mo><mi>Y</mi><mo>,</mo><mi>b</mi><mo>)</mo><mo>=</mo><mrow><mo>|</mo><mi>f</mi><mo>(</mo><mi>X</mi><mo>,</mo><mi>b</mi><mo>)</mo><mo>-</mo><mi>Y</mi><mo>|</mo></mrow><mtext>,</mtext>
</math>

<p>where <math><mi>X</mi></math>, <math><mi>Y</mi></math> are given. <math><mi>b</mi></math> can be adjusted to make <math><mi>l</mi></math> smaller.</p>

<p>We would compute the <a href="https://www.mathsisfun.com/calculus/derivatives-introduction.html">mathematical derivatives</a></p>

<math display="block">
<mfrac>
<mrow><mo>&part;</mo><mi>l</mi></mrow>
<mrow><mo>&part;</mo><mi>b</mi></mrow>
</mfrac>
</math>

<p>and move <math><mi>b</mi></math> against the direction of <math><mfrac><mrow><mo>&part;</mo><mi>l</mi></mrow><mrow><mo>&part;</mo><mi>b</mi></mrow></mfrac></math> to make <math><mi>l</mi></math> smaller.</p>

<p>The capability to automatically perform mathematical differentiation (autodiff) of a complex function with respect to its parameters is essential to machine learning libraries: for example Google’s TensorFlow, Meta’s PyTorch, <a href="https://jax.dev">JAX</a>, the emergent Apple’s <a href="https://mlx-framework.org">MLX</a>, and <a href="https://github.com/brief-ds/micrograd">micrograd</a> developed by us.</p>

<h2 id="micrograd-autodiff-library">micrograd autodiff library</h2>
<p>The repository is at <a href="https://github.com/brief-ds/micrograd">https://github.com/brief-ds/micrograd</a>. micrograd was started by Andrej Karpathy. <a href="https://github.com/brief-ds/micrograd/tree/scalar">The initial version</a> works only on scalar values. We extended it to work with vectors, including matrices (2-dimensional) and arbitrary-dimensional tensors.</p>

<p>The project is pure Python with no C code. Its core is just one 500-line Python file <a href="https://github.com/brief-ds/micrograd/blob/master/micrograd/engine.py">micrograd/engine.py</a>, ludicrously simple, and of toy size.</p>

<table>
  <thead>
    <tr>
      <th>Library</th>
      <th>Install size</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>micrograd</td>
      <td>20 kilobytes</td>
    </tr>
    <tr>
      <td>MLX</td>
      <td>23 megabytes</td>
    </tr>
    <tr>
      <td>PyTorch</td>
      <td>700 megabytes</td>
    </tr>
    <tr>
      <td>TensorFlow</td>
      <td>1,700 megabytes</td>
    </tr>
  </tbody>
</table>

<p>For numerical evaluation, micrograd would depend on an external library without re-inventing any wheels, which is NumPy today. If we account for Numpy, the total install size of micrograd would be in the same ballpark as that of MLX. micrograd can switch to any more compact numerical library though.</p>

<h3 id="micrograd-is-both-kid--and-researcher-friendly">micrograd is both kid- and researcher-friendly</h3>
<p>The core file <a href="https://github.com/brief-ds/micrograd/blob/master/micrograd/engine.py">micrograd/engine.py</a> is no more than 500 lines. Each mathematical operator is defined in 10-20 lines, for example the sum operation in <a href="https://github.com/brief-ds/micrograd/blob/master/micrograd/engine.py">micrograd/engine.py</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
    <span class="k">def</span> <span class="nf">sum</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
        <span class="p">...</span>         <span class="c1"># 8 lines of pre-processing
</span>
        <span class="n">out</span> <span class="o">=</span> <span class="n">Value</span><span class="p">(</span><span class="n">_sum</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="n">axis</span><span class="p">),</span> <span class="p">(</span><span class="bp">self</span><span class="p">,),</span> <span class="s">'sum'</span><span class="p">)</span>

        <span class="k">def</span> <span class="nf">_forward</span><span class="p">(</span><span class="o">**</span><span class="n">kwds</span><span class="p">):</span>
            <span class="n">out</span><span class="p">.</span><span class="n">data</span> <span class="o">=</span> <span class="n">_sum</span><span class="p">(</span><span class="bp">self</span><span class="p">.</span><span class="n">data</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="n">axis</span><span class="p">)</span>
        <span class="n">out</span><span class="p">.</span><span class="n">_forward</span> <span class="o">=</span> <span class="n">_forward</span>

        <span class="k">def</span> <span class="nf">_backward</span><span class="p">():</span>
            <span class="c1"># expand out.grad to same number of dimensions
</span>            <span class="c1"># as self.data, self.grad
</span>            <span class="n">_out_grad</span> <span class="o">=</span> <span class="n">expand_dims</span><span class="p">(</span><span class="n">out</span><span class="p">.</span><span class="n">grad</span><span class="p">,</span> <span class="n">_axis</span><span class="p">)</span>

            <span class="c1"># ... expand further to same shape as self.data
</span>            <span class="bp">self</span><span class="p">.</span><span class="n">grad</span> <span class="o">+=</span> <span class="n">broadcast_to</span><span class="p">(</span><span class="n">_out_grad</span><span class="p">,</span> <span class="bp">self</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
        <span class="n">out</span><span class="p">.</span><span class="n">_backward</span> <span class="o">=</span> <span class="n">_backward</span>

        <span class="k">return</span> <span class="n">out</span>

</code></pre></div></div>

<p>where</p>
<ul>
  <li>the <code class="language-plaintext highlighter-rouge">_forward()</code> function evaluates the sum, and</li>
  <li>the <code class="language-plaintext highlighter-rouge">_backward()</code> function differentiates the sum with respect to the elements, over which the sum was calculated.</li>
</ul>

<h3 id="micrograd-can-be-uniquely-inspected-with-pythons-built-in-profiler">micrograd can be uniquely inspected with Python’s built-in profiler</h3>
<p>To time any code is called “profiling”. Complex machine learning libraries would require additionally written code to inspect itself. Because micrograd is pure Python, one may time it with the cProfile module built in Python.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python3 <span class="nt">-m</span> cProfile <span class="nt">-s</span> tottime &lt;program_using_micrograd&gt;
</code></pre></div></div>

<p>We rewrote the model behind <a href="https://tsterm.com">https://tsterm.com</a> using micrograd and profiled it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     3258    2.749    0.001    2.814    0.001 numeric.py:1002(tensordot)
     ...
     1440    1.009    0.001    1.266    0.001 engine.py:91(_backward)
     ...
</code></pre></div></div>

<p>cProfile’s output clearly ranks each forward or backward function of the mathematical operators by the total time, under the <code class="language-plaintext highlighter-rouge">tottime</code> column. On one run, the most costly was the tensordot operation (tensor multiplication), followed by the differentiation of the element-wise multiplication.</p>

<h3 id="micrograd-is-comparable-in-performance">micrograd is comparable in performance</h3>
<p>micrograd turns out not to be at a disadvantage. We benchmarked the model behind <a href="https://tsterm.com">https://tsterm.com</a> written with different libraries. The shorter the run time is the better.</p>

<table>
  <thead>
    <tr>
      <th>Hardware</th>
      <th>Operating System</th>
      <th>TensorFlow</th>
      <th>MLX</th>
      <th>micrograd</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>x86_64 (AMD EPYC)</td>
      <td>Amazon Linux 2</td>
      <td>10s</td>
      <td> </td>
      <td>12s</td>
    </tr>
    <tr>
      <td>AArch64 (Graviton3)</td>
      <td>Ubuntu 24.04 LTS</td>
      <td>13s</td>
      <td>12s</td>
      <td>12s</td>
    </tr>
    <tr>
      <td>AArch64 (Graviton4)</td>
      <td>Ubuntu 24.04 LTS</td>
      <td>11s</td>
      <td>11s</td>
      <td>11s</td>
    </tr>
  </tbody>
</table>

<p>The model performs quantile regression on 600 megabytes of data in memory. The data type was float32.</p>

<p>MLX is only for <a href="https://en.wikipedia.org/wiki/AArch64">AArch64</a>, unable to run on other hardware. Note in MLX <a href="https://ml-explore.github.io/mlx/build/html/usage/lazy_evaluation.html">numerical evaluation is lazy</a>: issue <code class="language-plaintext highlighter-rouge">mlx.core.eval()</code> once to initialise the ANN parameters after defining them, then once in each training step to actually update them.</p>

<p>We can see on x86, TensorFlow wins; on AArch64, MLX and micrograd are about in par. MLX may always lead micrograd by 0.1-0.2s.</p>

<h3 id="micrograd-can-be-easily-extended">micrograd can be easily extended</h3>
<p>To add a new mathematical operator, just go into <a href="https://github.com/brief-ds/micrograd/blob/master/micrograd/engine.py"><code class="language-plaintext highlighter-rouge">micrograd/engine.py</code></a>, and add a few lines, for example:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code>
    <span class="k">def</span> <span class="nf">non_linear_op</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>

        <span class="n">out</span> <span class="o">=</span> <span class="p">...</span>

        <span class="k">def</span> <span class="nf">_forward</span><span class="p">():</span>
            <span class="k">pass</span>
        <span class="n">out</span><span class="p">.</span><span class="n">_forward</span> <span class="o">=</span> <span class="n">_forward</span>

        <span class="k">def</span> <span class="nf">_backward</span><span class="p">():</span>
            <span class="k">pass</span>
        <span class="n">out</span><span class="p">.</span><span class="n">_backward</span> <span class="o">=</span> <span class="n">_backward</span>

        <span class="k">return</span> <span class="n">out</span>

</code></pre></div></div>

<h2 id="micrograd-opens-up-more-opportunities">micrograd opens up more opportunities</h2>
<p>micrograd is plainly written, but is competitive in performance. With just Python’s built-in tool, we can understand the performance of each component in the ANN. The learning curve is close to zero.</p>

<p>micrograd would open up more opportunities by allowing finer control in the <code class="language-plaintext highlighter-rouge">_forward()</code> and <code class="language-plaintext highlighter-rouge">_backward()</code> functions, such as selectively updating rows of the weight matrix, as with “attention”.</p>

<h2 id="references">References</h2>
<p>Introduction to Derivatives, Math is Fun, <a href="https://www.mathsisfun.com/calculus/derivatives-introduction.html">https://www.mathsisfun.com/calculus/derivatives-introduction.html</a></p>

<p>Differentiation, BBC Bitsize, <a href="https://www.bbc.co.uk/bitesize/guides/zyj77ty/">https://www.bbc.co.uk/bitesize/guides/zyj77ty/</a></p>

<p>Install Apple’s MLX machine learning library, <a href="/2025/09/26/install-mlx.html">/2025/09/26/install-mlx.html</a></p>

<p>Lazy Evaluation, MLX, <a href="https://ml-explore.github.io/mlx/build/html/usage/lazy_evaluation.html">https://ml-explore.github.io/mlx/build/html/usage/lazy_evaluation.html</a></p>

<p>Dive into MLX, Pranay Saha, <a href="https://medium.com/@pranaysaha/dive-into-mlx-performance-flexibility-for-apple-silicon-651d79080c4c">https://medium.com/@pranaysaha/dive-into-mlx-performance-flexibility-for-apple-silicon-651d79080c4c</a></p>

<p>How Fast is MLX?, Tristan Bilot, <a href="https://towardsdatascience.com/how-fast-is-mlx-a-comprehensive-benchmark-on-8-apple-silicon-chips-and-4-cuda-gpus-378a0ae356a0/">https://towardsdatascience.com/how-fast-is-mlx-a-comprehensive-benchmark-on-8-apple-silicon-chips-and-4-cuda-gpus-378a0ae356a0/</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Automatic differentiation (autodiff) An artificial neural network (ANN) is usually a function of input X and some parameters b,]]></summary></entry><entry><title type="html">Project Agora update, Bank for International Settlements, Sep 2025</title><link href="https://www.brief-ds.com/2025/09/04/agora.html" rel="alternate" type="text/html" title="Project Agora update, Bank for International Settlements, Sep 2025" /><published>2025-09-04T00:00:00+00:00</published><updated>2025-09-04T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/09/04/agora</id><content type="html" xml:base="https://www.brief-ds.com/2025/09/04/agora.html"><![CDATA[<p><a href="https://www.bis.org/about/bisih/topics/fmis/agora.htm">Project Agora</a> is a digital cross-border payment project. Central bank coins (asset of commercial bank) and deposits (liability of commercial bank) will be sent cross-border.</p>

<p>On a unified ledger with smart contract (ability to execute code) functionality, the exchange between payer bank - correspondent banks - payee bank will be finalised at once.</p>

<p>Participants include commercial banks and central banks over US, UK, Switzerland, Europe, Japan, Korea, Mexico, and Mastercard, Visa.</p>

<p>As of Sep 2025, no decision is made whether to issue a <a href="https://www.bankofengland.co.uk/the-digital-pound">digital pound</a>.</p>

<p><img src="/assets/2025-09-agora/agora-participants.png" alt="Agora participants" /></p>

<p><img src="/assets/2025-09-agora/money-forms.png" alt="Forms of money" /></p>

<h2 id="references">References</h2>
<p>Project Agora, <a href="https://www.bis.org/about/bisih/topics/fmis/agora.htm">https://www.bis.org/about/bisih/topics/fmis/agora.htm</a></p>

<p>Digital pound, <a href="https://www.bankofengland.co.uk/the-digital-pound">https://www.bankofengland.co.uk/the-digital-pound</a></p>

<p>Bank of England innovation in money and payments conference, Sep 2025, <a href="https://www.bankofengland.co.uk/events/2025/september/boe-innovation-money-and-payments">https://www.bankofengland.co.uk/events/2025/september/boe-innovation-money-and-payments</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[Project Agora is a digital cross-border payment project. Central bank coins (asset of commercial bank) and deposits (liability of commercial bank) will be sent cross-border.]]></summary></entry><entry><title type="html">ARM64 floating point arithmetic</title><link href="https://www.brief-ds.com/2025/08/12/arm64-floating-point.html" rel="alternate" type="text/html" title="ARM64 floating point arithmetic" /><published>2025-08-12T00:00:00+00:00</published><updated>2025-08-12T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/08/12/arm64-floating-point</id><content type="html" xml:base="https://www.brief-ds.com/2025/08/12/arm64-floating-point.html"><![CDATA[<h2 id="arm64-floating-point-registers">ARM64 floating point registers</h2>
<p>Aarch64 has 32 floating point registers. Each is 128-bit under name Q0-Q31. If one needs less precision for example 32-bit, one can access the lowest 32-bit part of each register under a different name. For example,</p>

<table>
  <tbody>
    <tr>
      <td>Q0</td>
      <td>128-bit</td>
    </tr>
    <tr>
      <td>D0</td>
      <td>the lowest 64-bit part of Q0</td>
    </tr>
    <tr>
      <td>S0</td>
      <td>the lowest 32-bit part of Q0</td>
    </tr>
    <tr>
      <td>H0</td>
      <td>the lowest 16-bit part of Q0</td>
    </tr>
    <tr>
      <td>B0</td>
      <td>the lowest 8-bit part of Q0</td>
    </tr>
  </tbody>
</table>

<p>Below is a graphical illustration,</p>

<p><img src="/assets/arm64/a64_fp_registers.png" alt="a64 floating point registers" /></p>

<h2 id="floating-point-registers-preserved-by-the-callee-function">Floating point registers preserved by the callee function</h2>
<p>Certain registers need to stay the same at the exit as at the entry of a function/subroutine being called (callee).</p>

<p>The first eight registers, v0-v7, are used to pass argument values into a subroutine and to return result values from a function. They may also be used to hold intermediate values within a routine (but, in general, only between subroutine calls).</p>

<p>Registers v8-v15 must be preserved by a callee across subroutine calls; the remaining registers (v0-v7, v16-v31) do not need to be preserved (or should be preserved by the caller). Additionally, only the bottom 64 bits of each value stored in v8-v15 need to be preserved; it is the responsibility of the caller to preserve larger values. <a href="https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst#simd-and-floating-point-registers">Procedure Call Standard for Aarch64 6.1.2 SIMD and Floating-Point registers</a></p>

<h2 id="project-overview">Project Overview</h2>
<p>In previous posts, the result in an interger is returned by the Linux <code class="language-plaintext highlighter-rouge">exit()</code> call. But in this post the result is a pointing point value, so we change the game plan – we’ll use the C function <code class="language-plaintext highlighter-rouge">printf()</code> to print it out.</p>

<p>We first write <code class="language-plaintext highlighter-rouge">add.h</code> and <code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<p><code class="language-plaintext highlighter-rouge">add.h</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">add</span><span class="p">(</span><span class="kt">float</span> <span class="n">a</span><span class="p">,</span> <span class="kt">float</span> <span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p><code class="language-plaintext highlighter-rouge">main.c</code>:</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf">"add.h"</span><span class="cp">
#include</span> <span class="cpf">&lt;stdio.h&gt;</span><span class="cp">
</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span> <span class="p">{</span>
    <span class="n">printf</span><span class="p">(</span><span class="s">"%f</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">add</span><span class="p">(</span><span class="mi">3</span><span class="p">.</span><span class="mi">4</span><span class="p">,</span> <span class="o">-</span><span class="mi">2</span><span class="p">.</span><span class="mi">7</span><span class="p">));</span>
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>what remains is to write an <code class="language-plaintext highlighter-rouge">add.s</code>, assemble and link it with <code class="language-plaintext highlighter-rouge">main.c</code> to produce the executable file.</p>

<h2 id="the-add-function-in-assembly">The <code class="language-plaintext highlighter-rouge">add</code> function in assembly</h2>
<p>Recall that all constants are preceded by <code class="language-plaintext highlighter-rouge">#</code>. <code class="language-plaintext highlighter-rouge">#0x10</code> is the hexadecimal 10, equivalent to decimal 16. <code class="language-plaintext highlighter-rouge">#12</code> is just the decimal 12.</p>

<p>The C <code class="language-plaintext highlighter-rouge">float</code> type is typically 32-bit. In C, the interface of <code class="language-plaintext highlighter-rouge">add()</code> was declared as</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">float</span> <span class="nf">add</span><span class="p">(</span><span class="kt">float</span> <span class="n">a</span><span class="p">,</span> <span class="kt">float</span> <span class="n">b</span><span class="p">);</span>
</code></pre></div></div>

<p>At the entry of the <code class="language-plaintext highlighter-rouge">add</code> routine in the assembly, the input parameters a and b are in the registers s0, s1 of 32-bit width. The result will be in s0. They don’t have to be preserved by the callee.</p>

<p>Refer to the <a href="/2025/08/04/arm64-func.html">ARM64 function call</a> post for the use of the stack.</p>

<p><code class="language-plaintext highlighter-rouge">add.s</code>:</p>

<pre><code class="language-asm">.global add

.text
add:
    sub sp, sp, #0x10       // set sp to sp - 16 bytes
    str lr, [sp, #8]        // store link register lr at sp + 8 bytes
    str fp, [sp]            // store frame pointer fp at sp
    mov fp, sp              // set frame pointer fp to sp

    fadd s0, s0, s1         // floating-point add
                            // store result in s0

    ldr fp, [sp]            // restore frame pointer fp from sp
    ldr lr, [sp, #8]        // restore link register lr from sp + 8 bytes
    add sp, sp, #0x10       // set sp to sp + 16 bytes
    ret
</code></pre>

<h2 id="run-the-project">Run the project</h2>
<p>We assemble the <code class="language-plaintext highlighter-rouge">add.s</code> into a binary file <code class="language-plaintext highlighter-rouge">add.o</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>as <span class="nt">-o</span> add.o add.s
</code></pre></div></div>

<p>then compile the remaing C files, and link with <code class="language-plaintext highlighter-rouge">add.o</code> to produce the final executable file <code class="language-plaintext highlighter-rouge">main</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>gcc <span class="nt">-o</span> main main.c add.o
</code></pre></div></div>

<p>We run it:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./main
0.700000
</code></pre></div></div>

<h2 id="references">References</h2>
<p>Arm Compiler armasm User Guide. On <a href="https://developer.arm.com">https://developer.arm.com</a>, search for “armasm user guide”. In the result list, find the latest version of “Arm Compiler armasm User Guide”.</p>

<p>Aarch64 registers, <a href="https://developer.arm.com/documentation/102374/0102/Registers-in-AArch64---general-purpose-registers">https://developer.arm.com/documentation/102374/0102/Registers-in-AArch64—general-purpose-registers</a></p>

<p>Procedure Call Standard for Aarch64, <a href="https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst">https://github.com/ARM-software/abi-aa/blob/main/aapcs64/aapcs64.rst</a></p>

<p>ARM64 load/save register instructions, <a href="/2025/07/14/arm64-ldr-and-str.html">/2025/07/14/arm64-ldr-and-str.html</a></p>

<p>ARM64 function calls, <a href="/2025/08/04/arm64-func.html">/2025/08/04/arm64-func.html</a>.</p>

<p>Introduction to Aarch64 architecture, 8. The Stack, <a href="https://hrishim.github.io/llvl_prog1_book/stack.html">https://hrishim.github.io/llvl_prog1_book/stack.html</a>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[ARM64 floating point registers Aarch64 has 32 floating point registers. Each is 128-bit under name Q0-Q31. If one needs less precision for example 32-bit, one can access the lowest 32-bit part of each register under a different name. For example,]]></summary></entry><entry><title type="html">First RISC-V64 program</title><link href="https://www.brief-ds.com/2025/08/11/first-riscv64-program.html" rel="alternate" type="text/html" title="First RISC-V64 program" /><published>2025-08-11T00:00:00+00:00</published><updated>2025-08-11T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/08/11/first-riscv64-program</id><content type="html" xml:base="https://www.brief-ds.com/2025/08/11/first-riscv64-program.html"><![CDATA[<p>RISC-V64, as ARM64, is another simple <a href="https://en.wikipedia.org/wiki/Reduced_instruction_set_computer">reduced instruction set computer</a> architecture. RISC-V started from the University of California, Berkeley and is owned by a non-for-profit <a href="https://riscv.org/">consortium</a>. RISC-V is not one single instruction set, but a family of base sets and extensions, e.g, “I” for the base integer instruction set, “V” for the extension for vector operations. Any base set or extension may undergo revisions, but it tends to be overall stable. Manufacturers can combine a base set with any extensions for a chip, for example “E” only for microcontrollers, but “IFD” for integer, single-width floating point and double-width floating point arithmetic.</p>

<p>The main reference is the <strong>instruction set manual</strong> at <a href="https://github.com/riscv/riscv-isa-manual">https://github.com/riscv/riscv-isa-manual</a>, whose README links to the HTML snapshots of <a href="https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/">user-level (unprivileged) instruction sets</a> and <a href="https://riscv.github.io/riscv-isa-manual/snapshot/privileged/">privileged instruction sets</a>.</p>

<p>We will install the RISC-V development toolchain on a x86_64 host and write the first RISC-V64 assembly program on this non-RISC-V-native host. A toolchain is the collection of assembler, compiler, and linker that translate assembly, C programs to executable binary ones.</p>

<h2 id="bare-metal-development-toolchain">Bare metal development toolchain</h2>
<p>Bare metal, means executable files produced by the toolchain will be able to run with no support of the underlying operating system (for example to translate memory addresses) or any simulator of RISC-V64.</p>

<p>Clone the repo at <a href="https://github.com/riscv-collab/riscv-gnu-toolchain">https://github.com/riscv-collab/riscv-gnu-toolchain</a>,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git clone https://github.com/riscv-collab/riscv-gnu-toolchain
</code></pre></div></div>

<p>and follow the instructions and make the default target as under the “Installation (Newlib)” section.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">mkdir</span> /opt
configure <span class="nt">--prefix</span><span class="o">=</span>/opt/rv64gcv <span class="nt">--with-arch</span><span class="o">=</span>rv64gcv <span class="nt">--with-abi</span><span class="o">=</span>lp64d
make <span class="nt">-j</span> 2
</code></pre></div></div>

<p>After the make, 18G disk space is taken on a Ubuntu x86_64 system. After making of the toolchain, in the cloud you can switch the virtual machine to less powerful instance type, with at least 2G memory.</p>

<p>If you follow the instruction as under the “Installation (Linux)” section, it will take longer and more disk space to make the Linux/GNU toolchain. Executable files produced by this toolchain will be trickier to run, depending on the support of the underlying operating system and a RISC-V simulator.</p>

<h2 id="arch-architecture-and-abi-application-binary-interface-for-risc-v64"><code class="language-plaintext highlighter-rouge">arch</code> (architecture) and <code class="language-plaintext highlighter-rouge">abi</code> (application binary interface) for RISC-V64</h2>
<p>The <code class="language-plaintext highlighter-rouge">--with-arch=rv64gcv</code> parameter configures the architecture: the “G” represents the “I” base set and the “M”, “A”, “F”, “D”, “Zicsr” and “Zifencei” extensions, enough for a general purpose computer. <a href="https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/">The RISC-V Instruction Set Manual Volume I: Unprivileged Architecture</a> 36.3 Instruction-Set Extension Names</p>

<table>
  <thead>
    <tr>
      <th>acrynom</th>
      <th>instruction set</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>I</td>
      <td>integer</td>
    </tr>
    <tr>
      <td>M</td>
      <td>integer multiplication/division</td>
    </tr>
    <tr>
      <td>A</td>
      <td>atomic instructions</td>
    </tr>
    <tr>
      <td>Zicsr</td>
      <td>extension for control and status register instructions</td>
    </tr>
    <tr>
      <td>F</td>
      <td>single-precision floating point, depends on “Zicsr”</td>
    </tr>
    <tr>
      <td>D</td>
      <td>double-precision floating point, depends on “F”</td>
    </tr>
    <tr>
      <td>Zifencei</td>
      <td>extension for instruction-fetch fence</td>
    </tr>
    <tr>
      <td>C</td>
      <td>compressed 16-bit instructions (instead of 32-bit)</td>
    </tr>
    <tr>
      <td>V</td>
      <td>vector</td>
    </tr>
  </tbody>
</table>

<p><code class="language-plaintext highlighter-rouge">abi</code>, or application binary interface, specifies how parameters are passed in a function call, for example, whether it takes one 64-bit register or two 32-bit registers to send one 64-bit number. For more detail, refer to <a href="https://www.sifive.com/blog/all-aboard-part-1-compiler-args">The -march, -mabi, and -mtune arguments to RISC-V Compilers</a>, part of a blog series.</p>

<p>This whole SiFive blog series is worth a read:</p>

<ul>
  <li>As each RISC-V instruction is encoded with 32-bit, it takes two instructions to load a 32-bit address;</li>
  <li>it also influences the choice of “memory model”: the range of memory a program has access to.</li>
</ul>

<h2 id="risc-v64-registers">RISC-V64 registers</h2>
<p>A register is a location that stores a number on a computer architecture. For either the 32-bit base integer instruction set RV32I or the 64-bit counterpart RV64I, XLEN characterises the width in bits of an integer value stored in a register:</p>

<p>for RV32I, XLEN = 32; for RV64I, XLEN = 64.</p>

<p>Apart from the different XLEN, RV32I and RV64I define the same 32 general purpose registers: x0-x31. The value in x0 is always 0, even an instruction tries to write another value. There is an extra unprivileged program counter register pc that points at the current instruction in the memory.</p>

<p>By convention, x1 is used to store the return address that the current function shall return to, x2 as the stack pointer, etc. For this convention, the ABI (application binary interface) gives mnemonic names: ra for x1, sp for x2, etc, which are listed below.</p>

<table>
  <thead>
    <tr>
      <th>Register</th>
      <th>ABI mnemonic</th>
      <th>Use by convention</th>
      <th>Preserved by the callee</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>x0</td>
      <td>zero</td>
      <td>hardwired to 0, ignore writes</td>
      <td>n/a</td>
    </tr>
    <tr>
      <td>x1</td>
      <td>ra</td>
      <td>return address</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x2</td>
      <td>sp</td>
      <td>stack pointer</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x3</td>
      <td>gp</td>
      <td>global pointer</td>
      <td>n/a</td>
    </tr>
    <tr>
      <td>x4</td>
      <td>tp</td>
      <td>thread pointer</td>
      <td>n/a</td>
    </tr>
    <tr>
      <td>x5</td>
      <td>t0</td>
      <td>temporary register 0</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x6</td>
      <td>t1</td>
      <td>temporary register 1</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x7</td>
      <td>t2</td>
      <td>temporary register 2</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x8</td>
      <td>s0 or fp</td>
      <td>saved register 0 or frame pointer</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x9</td>
      <td>s1</td>
      <td>saved register 1</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x10</td>
      <td>a0</td>
      <td>return value or function argument 0</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x11</td>
      <td>a1</td>
      <td>return value or function argument 1</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x12</td>
      <td>a2</td>
      <td>function argument 2</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x13</td>
      <td>a3</td>
      <td>function argument 3</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x14</td>
      <td>a4</td>
      <td>function argument 4</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x15</td>
      <td>a5</td>
      <td>function argument 5</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x16</td>
      <td>a6</td>
      <td>function argument 6</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x17</td>
      <td>a7</td>
      <td>function argument 7</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x18</td>
      <td>s2</td>
      <td>saved register 2</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x19</td>
      <td>s3</td>
      <td>saved register 3</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x20</td>
      <td>s4</td>
      <td>saved register 4</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x21</td>
      <td>s5</td>
      <td>saved register 5</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x22</td>
      <td>s6</td>
      <td>saved register 6</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x23</td>
      <td>s7</td>
      <td>saved register 7</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x24</td>
      <td>s8</td>
      <td>saved register 8</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x25</td>
      <td>s9</td>
      <td>saved register 9</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x26</td>
      <td>s10</td>
      <td>saved register 10</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x27</td>
      <td>s11</td>
      <td>saved register 11</td>
      <td>yes</td>
    </tr>
    <tr>
      <td>x28</td>
      <td>t3</td>
      <td>temporary register 3</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x29</td>
      <td>t4</td>
      <td>temporary register 4</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x30</td>
      <td>t5</td>
      <td>temporary register 5</td>
      <td>no</td>
    </tr>
    <tr>
      <td>x31</td>
      <td>t6</td>
      <td>temporary register 6</td>
      <td>no</td>
    </tr>
    <tr>
      <td>pc</td>
      <td>(none)</td>
      <td>program counter</td>
      <td>n/a</td>
    </tr>
  </tbody>
</table>

<p>The current function (callee) has to make sure that at the exit of the callee function s0-s11 and sp preserve the same values as at the entry of it. In the case of function calls, ra is normally preserved in any parent function.</p>

<p>RV32E and RV64E are for microcontroller use. The only difference between RV32I and RV32E, or between RV64I and RV64E is that RV32E or RV64E has only 16 registers. By the convention above, x0-x15 are used for the most important purposes: return address, stack pointer, etc. x16-x31 only increased the number of function arguments, saved registers and temporary registers.</p>

<p>One can write assembly progam either with the general names x0-x31 or the mnemonic names. The disassembler <code class="language-plaintext highlighter-rouge">objdump</code> normally displays the mnemonic names, but can output the general names with option <code class="language-plaintext highlighter-rouge">-M numeric</code>.</p>

<h2 id="the-first-risc-v64-assembly-program">The first RISC-V64 assembly program</h2>
<p>We copy the <code class="language-plaintext highlighter-rouge">hello.s</code> in <a href="https://www.youtube.com/watch?v=0IeOaiKszLk">RISC-V Assembly Hello World (part 1)</a> by LaurieWired, which simply calls the Linux <code class="language-plaintext highlighter-rouge">exit()</code>.</p>

<pre><code class="language-asm">.global _start

.text
_start:
   li a0, 2      # return value for Linux exit()
   li a7, 93     # function number for Linux exit()
   ecall
</code></pre>

<p>We run the toolchain to produce the executable <code class="language-plaintext highlighter-rouge">hello</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>/opt/rv64gcv/bin/riscv64-unknown-elf-as <span class="nt">-o</span> hello.o hello.s
<span class="nv">$ </span>/opt/rv64gcv/bin/riscv64-unknown-elf-ld <span class="nt">-o</span> hello hello.o
</code></pre></div></div>

<p>As we are using the bare metal toolchain, we’ll find <code class="language-plaintext highlighter-rouge">hello</code> can run by itself,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>./hello<span class="p">;</span> <span class="nb">echo</span> <span class="nv">$?</span>
2
</code></pre></div></div>

<p>Alternatively,</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>/opt/rv64gcv/bin/riscv64-unknown-elf-run hello<span class="p">;</span> <span class="nb">echo</span> <span class="nv">$?</span>
2
</code></pre></div></div>

<h2 id="references">References</h2>
<p>RISC-V Instruction Set Manual, <a href="https://github.com/riscv/riscv-isa-manual">https://github.com/riscv/riscv-isa-manual</a></p>

<p>The RISC-V Instruction Set Manual Volume I: Unprivileged Architecture, <a href="https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/">https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/</a></p>

<p>RISC-V GNU toolchain, <a href="https://github.com/riscv-collab/riscv-gnu-toolchain">https://github.com/riscv-collab/riscv-gnu-toolchain</a></p>

<p>Spike RISC-V ISA Simulator, <a href="https://github.com/riscv-software-src/riscv-isa-sim">https://github.com/riscv-software-src/riscv-isa-sim</a></p>

<p>RISC-V Proxy Kernel and Boot Loader, <a href="https://github.com/riscv-software-src/riscv-pk">https://github.com/riscv-software-src/riscv-pk</a>, which translates I/O calls to the host computer.</p>

<p>Setup Riscv-GNU-TOOLCHAIN and SPIKE for the Vector Extension, Syed Hassan Ul Haq, <a href="https://medium.com/@ulhaqhassan1/setup-riscv-gnu-toolchain-and-spike-for-the-vector-extnesion-a10ede8b1857">https://medium.com/@ulhaqhassan1/setup-riscv-gnu-toolchain-and-spike-for-the-vector-extnesion-a10ede8b1857</a></p>

<p>Setting up RISC-V toolchain and simulator, Mohamed A. Bamakhrama, <a href="https://gist.github.com/mohamed/a6e406e086e6c9cc0ded222d23bcb0a6">https://gist.github.com/mohamed/a6e406e086e6c9cc0ded222d23bcb0a6</a></p>

<p>All Aboard, Part 1: The -march, -mabi, and -mtune arguments to RISC-V Compilers, part of a SiFive blog series, <a href="https://www.sifive.com/blog/all-aboard-part-1-compiler-args">https://www.sifive.com/blog/all-aboard-part-1-compiler-args</a></p>

<p>RISC-V64 assembly language setup and first steps, Russ Ross, <a href="https://www.youtube.com/watch?v=5g8M85r8Au8">https://www.youtube.com/watch?v=5g8M85r8Au8</a></p>

<p>RISC-V64 assembly language, Digital Design and Computer Architecture Chapter 6 Architecture, Sarah L Harris and David Harris, <a href="https://www.youtube.com/playlist?list=PLcbc_WBQVCytATU2xxAqcFkynK8hRflSz">https://www.youtube.com/playlist?list=PLcbc_WBQVCytATU2xxAqcFkynK8hRflSz</a></p>

<p>RISC-V Linux syscall table, Juraj Borza, <a href="https://jborza.com/post/2021-05-11-riscv-linux-syscalls/">https://jborza.com/post/2021-05-11-riscv-linux-syscalls/</a></p>

<p>RISC-V Assembly Hello World (part 1), LaurieWired, <a href="https://www.youtube.com/watch?v=0IeOaiKszLk">https://www.youtube.com/watch?v=0IeOaiKszLk</a></p>

<p>RISC-V Assembly Programmer’s Handbook, <a href="https://github.com/riscv-non-isa/riscv-asm-manual">https://github.com/riscv-non-isa/riscv-asm-manual</a></p>

<p>RISC-V Assembly Cheat Sheet, <a href="https://projectf.io/posts/riscv-cheat-sheet/">https://projectf.io/posts/riscv-cheat-sheet/</a></p>

<p>Porting software to RISC-V, <a href="https://training.linuxfoundation.org/training/porting-software-to-risc-v-lfd114/">https://training.linuxfoundation.org/training/porting-software-to-risc-v-lfd114/</a></p>]]></content><author><name></name></author><summary type="html"><![CDATA[RISC-V64, as ARM64, is another simple reduced instruction set computer architecture. RISC-V started from the University of California, Berkeley and is owned by a non-for-profit consortium. RISC-V is not one single instruction set, but a family of base sets and extensions, e.g, “I” for the base integer instruction set, “V” for the extension for vector operations. Any base set or extension may undergo revisions, but it tends to be overall stable. Manufacturers can combine a base set with any extensions for a chip, for example “E” only for microcontrollers, but “IFD” for integer, single-width floating point and double-width floating point arithmetic.]]></summary></entry><entry><title type="html">ARM64 function calls</title><link href="https://www.brief-ds.com/2025/08/04/arm64-func.html" rel="alternate" type="text/html" title="ARM64 function calls" /><published>2025-08-04T00:00:00+00:00</published><updated>2025-08-04T00:00:00+00:00</updated><id>https://www.brief-ds.com/2025/08/04/arm64-func</id><content type="html" xml:base="https://www.brief-ds.com/2025/08/04/arm64-func.html"><![CDATA[<p>This article accompanies Lesson 10 Function Calls and Lesson 11 Stack Operations of the <a href="https://www.youtube.com/playlist?list=PLn_It163He32Ujm-l_czgEBhbJjOUgFhg">ARM64 assembly tutorial</a> of LaurieWired.</p>

<p>Refer to</p>

<ul>
  <li><a href="/2025/07/13/first-arm64-code.html">The first ARM64 assembly program</a> for how to call Linux <code class="language-plaintext highlighter-rouge">exit()</code> to end execution;</li>
  <li><a href="/2025/07/14/arm64-ldr-and-str.html">ARM64 load/store register instructions</a> for an introduction to ARM64 registers.</li>
</ul>

<h2 id="the-stack">The stack</h2>
<p>When one function calls another one, the data in the old function is typically pushed into <a href="https://en.wikipedia.org/wiki/Stack_(abstract_data_type)">stack</a>.</p>

<p>For example, suppose inside function <code class="language-plaintext highlighter-rouge">f</code>, prior to calling function <code class="language-plaintext highlighter-rouge">g</code>, the stack pointer register sp stores the value 0x1000, and in the memory, in the address 0x1000 is stored some value or nothing. We usually say sp is pointing at address 0x1000,</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sp -&gt; 0x1000     [empty or some value]
</code></pre></div></div>

<p>Inside function <code class="language-plaintext highlighter-rouge">f</code>, suppose it has two 64-bit registers of which it wishes to keep a copy of the values such that at the return from function <code class="language-plaintext highlighter-rouge">g</code> it still has access to them. Conceptually, it will decrement sp by 0x8 twice, and store the two 64-bit values at the new addresses, each occupying 8 bytes.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>      0x1000    [empty or some value]
      0x0ff8    [value2 of f]
sp -&gt; 0x0ff0    [value1 of f]
</code></pre></div></div>

<p>The function <code class="language-plaintext highlighter-rouge">g</code> can continue decrementing sp, and store more content at lower addresses than 0x0ff0 that sp points at. But when it’s about to return to <code class="language-plaintext highlighter-rouge">f</code>, the sp must have been incremented back up to 0x0ff0.</p>

<p>When it reenters <code class="language-plaintext highlighter-rouge">f</code>, conceptually, <code class="language-plaintext highlighter-rouge">f</code> can load the content sp points at into a previous register, increment sp by 0x8, and repeat doing the same, until the memory looks as</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sp -&gt; 0x1000     [empty or some value]
</code></pre></div></div>

<h2 id="arm64-stack-is-16-byte-aligned">ARM64 stack is 16-byte aligned</h2>
<p>ARM64’s stack is <a href="https://hrishim.github.io/llvl_prog1_book/stack.html">16-bytes aligned</a>, meaning the value stored in sp, interpreted as an address, must be multiples of 16, or 0x10.</p>

<p>So in the above example, <em>practically</em> we cannot do</p>

<pre><code class="language-asm">sub sp, sp, #0x8
str x2, [sp]
sub sp, sp, #0x8
str x1, [sp]
</code></pre>

<p>because after the first decrementation, sp would be pointing at 0x0ff8, not a multiple of 0x10. Rather, we decrement sp by 0x10 in one go,</p>

<pre><code class="language-asm">sub sp, sp, #0x10
str x2, [sp, #0x8]
str x1, [sp]
</code></pre>

<p>These lines can be combined into</p>

<pre><code class="language-asm">stp x1, x2, [sp, #-0x10]!
</code></pre>

<p>Refer to <a href="https://devblogs.microsoft.com/oldnewthing/20220728-00/?p=106912">Aarch64 addressing mode</a> for explanation of the above line.</p>

<p>If there is only one 64-bit register to save, one still have to observe the 16-bytes alignment, effectively leaving some memory empty, for example,</p>

<pre><code class="language-asm">sub sp, sp, #0x10
str x1, [sp]
</code></pre>

<p>It would be empty in addresses sp + 8 bytes through sp + 15 bytes.</p>

<h2 id="an-example-with-stack-operation">An example with stack operation</h2>
<p><code class="language-plaintext highlighter-rouge">func.s</code>:</p>

<pre><code class="language-asm">.global _start

_start:
    mov x0, #1
    mov x1, #2

    stp x0, x1, [sp, #-16]!      // store pair x0 x1
                                 // store x0 to sp - 16
                                 // store x1 to sp - 8
                                 // and set sp to sp - 16
    bl add_nums
    ldp x0, x1, [sp], #16        // load pair x0 x1
                                 // load x0 from sp
                                 // load x1 from sp + 8
                                 // and set sp to sp + 16

    mov x8, #0x5d
    svc #0

add_nums:
    add x0, x0, x1
    ret
</code></pre>

<p>x0 after the <code class="language-plaintext highlighter-rouge">add_nums</code> call would store 3, but after the <code class="language-plaintext highlighter-rouge">ldp</code> call it was reset to the old value 1.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>as <span class="nt">-o</span> func.o func.s
<span class="nv">$ </span>gcc <span class="nt">-o</span> func func.o <span class="nt">-nostdlib</span> <span class="nt">-static</span>
<span class="nv">$ </span>./func<span class="p">;</span> <span class="nb">echo</span> <span class="nv">$?</span>
1
</code></pre></div></div>

<h2 id="registers-that-must-be-preserved-by-the-callee-function">Registers that must be preserved by the callee function</h2>
<p>When an old function calls a new function, the called function (the callee) must preserve the content in registers x19-x29 and sp, such that just <em>before</em> <code class="language-plaintext highlighter-rouge">ret</code> from the callee, the content in x19-x29 and sp is the same as at the entry of it.</p>

<p>In the case of function calls, the link register lr storing return address along with the frame pointer fp are normally preserved in parent functions.</p>

<h2 id="generic-prologue-and-clean-up-code-around-function-call">Generic prologue and clean-up code around function call</h2>
<p><a href="https://devblogs.microsoft.com/oldnewthing/20220829-00/?p=107066">Aarch64 code walkthrough</a> gave some generic prologue and clean-up code around function call for example.</p>

<p>Prologue:</p>

<pre><code class="language-asm">stp     x19, x20, [sp,#-0x20]!     // x19-x29 must be preserved
str     x21, [sp,#0x10]
stp     fp, lr, [sp,#-0x10]!
mov     fp, sp
</code></pre>

<p>Clean-up:</p>

<pre><code class="language-asm">ldp     fp, lr, [sp], #0x10
ldr     x21, [sp, #0x10]
ldp     x19, x20, [sp], #0x20
ret
</code></pre>

<p>In the above example, after entry into a function, before executing any other code of it, the stack would be pushed</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[empty 64-bit]
x21
x20
x19
lr
fp                &lt;--- the new stack pointer sp
</code></pre></div></div>

<p>and be unwound by the clean-up code.</p>

<h2 id="references">References</h2>
<p>ARM64 assembly tutorial, LaurieWired, <a href="https://www.youtube.com/playlist?list=PLn_It163He32Ujm-l_czgEBhbJjOUgFhg">https://www.youtube.com/playlist?list=PLn_It163He32Ujm-l_czgEBhbJjOUgFhg</a>.</p>

<p>Introduction to Aarch64 architecture, 8. The Stack, <a href="https://hrishim.github.io/llvl_prog1_book/stack.html">https://hrishim.github.io/llvl_prog1_book/stack.html</a>.</p>

<p>Aarch64 part 3: addressing mode, <a href="https://devblogs.microsoft.com/oldnewthing/20220728-00/?p=106912">https://devblogs.microsoft.com/oldnewthing/20220728-00/?p=106912</a>.</p>

<p>Aarch64 part 24: code walkthrough, <a href="https://devblogs.microsoft.com/oldnewthing/20220829-00/?p=107066">https://devblogs.microsoft.com/oldnewthing/20220829-00/?p=107066</a></p>

<p>First ARM64 assembly program, <a href="/2025/07/13/first-arm64-code.html">/2025/07/13/first-arm64-code.html</a>.</p>

<p>ARM64 load/store register instructions, <a href="/2025/07/14/arm64-ldr-and-str.html">/2025/07/14/arm64-ldr-and-str.html</a>.</p>

<p>Arm Compiler armasm User Guide. On <a href="https://developer.arm.com">https://developer.arm.com</a>, search for “armasm user guide”. In the result list, find the latest version of “Arm Compiler armasm User Guide”.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[This article accompanies Lesson 10 Function Calls and Lesson 11 Stack Operations of the ARM64 assembly tutorial of LaurieWired.]]></summary></entry></feed>