ARM64 SVE (Scalable Vector Extension)
Introduction
Traditional computer architecture stores one number in one register, and performs operations on them. For artificial neural network, it’s inevitable to do vector operations, eg. multiplying a 50-element vector by another 50-element vector (one “element” is one number).
Scalable Vector Extension
Short for SVE, it introduces vector registers: Z0-Z31. The length of a vector register is set by the chip designer: a 128-bit vector register can hold four 32-bit numbers, a 256-bit vector register eight of them. A vector register Zn can hold either multiple integers, or multiple floating-point numbers, unlike the distinction between Xn for integers and Dn for floating-point numbers.
Suppose on one architecture with 256-bit vector registers, we are going to compute element-wise multiplication of a 18-element vector with another 18-element vector, each element being a 32-bit number.
With SVE, we will
- load the first eight elements of both vectors (stored in the memory) into two vector registers (on the CPU), multiply them element-wise, write out the result;
- load the next eight elements into the vector registers, do the same;
- in the last iteration, load the remaining two elements of both vectors, and do the job.
A key concept is predicate, which indicates whether each element in the vector register is going to be computed on. In the first and second iterations above, the predicate would be all on
1 1 1 1 1 1 1 1
but in the last iteration, it would be on only for the first two elements, so the element-wise multiplication proceeds correctly.
1 1 0 0 0 0 0 0
Example
If the vmul routine were more complex, stack operations to preserve integer registers and floating point registers may be necessary.
vmul.s:
.global vmul
.text
vmul:
// vmul(n, a, b, c)
// input: n in x0, pointer a in x1, pointer b in x2
// output: pointer c in x3
//
// a, b, c point to arrays of 32-bit numbers.
// The three arrays will be of the same data type: either int32
// or float32, and of the same length n.
// count of "words" in one vector register
// a word is 32-bit
// if one vector register is 256-bit, x4 = 8;
// if one vector register is 128-bit, x4 = 4.
cntw x4
// index of the current number to be processed
// in any input array
mov x5, #0
vmul_loop:
// p0 the "predicate", array of length x4.
//
// for each 0 <= i < x4,
// p0[i] = 1 if x5 + i < x0,
// p0[i] = 0 otherwise.
whilelo p0.s, x5, x0
// if no 1 in the predicate, go to vmul_return
b.none vmul_return
// load into the vector register z1 numbers in the input array
// while the predicate is on. Load zero if the predicate is off.
//
// the memory address of the first number to be processed:
// x1 + x5 * 4 = x1 + (x5 << 2)
// as each number is 32-bit, or 4 bytes in this program.
ld1w z1.s, p0/z, [x1, x5, lsl #2] // lsl: logical shift left
ld1w z2.s, p0/z, [x2, x5, lsl #2]
// multiply two vector registers
mul z1.s, p0/m, z1.s, z2.s
// store the result back into the main memory
st1w z1.s, p0, [x3, x5, lsl #2]
// now that x4 numbers have been processed,
// increment the index of the current number
add x5, x5, x4
// branch (jump) to the vmul_loop label
b vmul_loop
vmul_return:
ret
vmul.py:
import ctypes
from numpy import array, empty_like
a = array([1, 2, 3, 4, 5, 6, 7, 8, 9]).astype('int32')
b = array([9, 8, 7, 6, 5, 4, 3, 2, 1]).astype('int32')
c = empty_like(a)
# locate the vmul routine
lib = ctypes.CDLL('./vmul.so')
vmul = lib.vmul
# vmul's function signature, akin to C
# the first three are input arguments, the fourth is output
int_ptr = ctypes.POINTER(ctypes.c_int)
vmul.argtypes = [ctypes.c_size_t, int_ptr, int_ptr, int_ptr]
# call vmul
vmul(a.size, a.ctypes.data_as(int_ptr),
b.ctypes.data_as(int_ptr), c.ctypes.data_as(int_ptr))
print(c)
It takes an AArch64 with SVE to run. I started an c7g instance in AWS (Graviton3) and installed Linux. In the shell,
$ as -march=armv8-a+sve -o vmul.o vmul.s
$ gcc -shared -o vmul.so -fPIC vmul.o
$ python3 vmul.py
[ 9 16 21 24 25 24 21 16 9]
The above example runs for integers. To adapt it for floating-point numbers, simply change mul to fmul in the assembly code, and make minimal change in the Python code. Interested readers are encouraged to figure out.
SVE throughput
In the wikipedia article regarding Graviton, the ARM64 processors designed by Amazon with support for SVE, Graviton3 is marked “2x256 SVE”, Graviton4 “4x128 SVE”. What does it mean?
- Graviton3 SVE is configured with 256-bit vector registers, per processor clock cycle it can make 2 computations, so the throughput is 256x2 = 512 bits per clock cycle;
- Graviton4 SVE is configured with narrower 128-bit vector registers, but per clock cycle it can make 4 computations, so the throughput is still 128x4=512 bits per clock cycle.
OpenBLAS kernels with SVE
In OpenBLAS the linear algebra library, taking the general matrix-vector multiplication (gemv) routine for an example, the files with _sve in the file name use the SVE:
kernel/arm64/bgemv_n_sve_v3x4.c
kernel/arm64/gemv_n.S
kernel/arm64/gemv_n_sve.c
kernel/arm64/gemv_n_sve_v1x3.c
kernel/arm64/gemv_n_sve_v4x3.c
kernel/arm64/gemv_t.S
kernel/arm64/gemv_t_sve.c
kernel/arm64/gemv_t_sve_v1x3.c
kernel/arm64/gemv_t_sve_v4x3.c
kernel/arm64/sbgemv_n_neon.c
kernel/arm64/sbgemv_t_bfdot.c
kernel/arm64/sgemv_n_neon.c
kernel/arm64/zgemv_n.S
kernel/arm64/zgemv_t.S
References
AArch64 registers for integers, /2025/07/14/arm64-ldr-and-str.html
AArch64 registers for floating-point numbers, /2025/08/12/arm64-floating-point.html
SVE architecture, https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions
Stony Brook University AArch64 and SVE tutorials, https://www.stonybrook.edu/commcms/ookami/support/index_links_and_docs.php
A case study in vectorising using SVE, https://community.arm.com/arm-community-blogs/b/servers-and-cloud-computing-blog/posts/a-case-study-in-vectorizing-haccmk-using-sve, which mentioned SVE throughput
AWS Graviton, ARM64 processors designed by Amazon with support for SVE (Scalable Vector Extension), https://en.wikipedia.org/wiki/AWS_Graviton
OpenBLAS, https://github.com/OpenMathLib/OpenBLAS