RISC-V64 V extension for vector operations

Introduction

As in the first post about RISC-V, RISC-V would divide instruction set groups into base sets and extensions. V extension is for vector operations, workhorse of machine learning. It is very similar to ARM64 SVE (Scalable Vector Extension) with one significant difference. Readers are encouraged to read the counterpart about ARM64 SVE.

Again, the main reference is the RISC-V instruction set manual, on which the README provides a link to the latest typeset specification of unprivileged instructions. Section 30 is on the V extension. Appendix C provides multiple example V extension programs.

RISC-V64 V extension

The V extension is for vector operations. The V extension requires a 64-bit integer operation base set: RV64E or RV64I.

The V extension defines 32 registers: v0-v31. The length of a vector register is defined by VLEN: a vector register can hold four 32-bit numbers when VLEN = 128, or eight of them when VLEN = 256. A vector register can hold either multiple integers, or multiple floating-point numbers.

As with ARM64 SVE, the vector operation is agnostic of the length of a mathematical vector: the number of elements (numbers) in it. We just load as many of them as we can from the mathematical vector into the vector register(s), compute on the vector register(s), then proceed to the next batch of elements (numbers) in the mathematical vector, until we run out of elements to compute on.

Example

As in the introductory post, we first build the development toolchain on a x86-64 host: assembler, compiler, linker etc. for bare metal programming, which means when we use this toolchain to compile programs, the final executable file will need minimum support of the operating system (to manage memory etc), or any simulator to interpret the RISC-V instructions on x86-64.

Below we multiply two vectors element-wise. If the vmul routine were more complex, stack operations to preserve integer registers and floating point registers may be necessary.

vmul.s:

.global vmul

.text
vmul:
    # vmul(size_t n, int *x, int *y, int *z)
    #
    # multiplies two vectors element-wise.
    #
    # input: n, x, y
    # output: z
    #
    # n in register a0
    # x in register a1
    # y in register a2
    # z in register a3

vmul_loop:
    beq a0, zero, vmul_return     # if a0 == 0, go to vmul_return

    # given the remaining number of elements to process in a0,
    # return the maximum number of elements that can be handled
    # by one vector instruction in t0.
    #
    # e32: each element (number) in the mathematical vector
    #      is 32-bit.
    #
    # m1:  grouping of 1 vector register.
    #
    # ta:  tail agnostic, when we are computing on the "tail"
    #      of the input vector, the last few elements in it.
    #      "ta" is usually a good setting. Leave it.
    #
    # ma:  mask agnostic, when a mask vector is present to indicate
    #      which elements in the vector register enter computation,
    #      which do not. "ma" is usually a good setting. Leave it.
    #
    # If the configuration is e32 and m1, and one vector register is 128-bit,
    #
    #   t0 = 4    if a0 >= 4
    #   t0 = a0   otherwise.
    #
    # If the configuration is e32 and m4, and one vector register is 128-bit,
    #
    #   t0 = 16   if a0 >= 16
    #   t0 = a0   otherwise.
    #
    vsetvli t0, a0, e32, m1, ta, ma

    vle32.v v0, (a1)        # load from memory into vector register
    vle32.v v1, (a2)
    vmul.vv v0, v0, v1      # integer multiplication element-wise
    vse32.v v0, (a3)        # store from vector register into memory

    sub a0, a0, t0          # having processed t0 numbers,
                            # decrement the number of elements to process

    slli t0, t0, 2          # t0 32-bit numbers take t0 x 4 bytes
                            # set t0 to t0 << 2 = t0 * 4

    add a1, a1, t0          # move the pointers to point to the next
    add a2, a2, t0          # element (32-bit number) to load or save
    add a3, a3, t0

    j vmul_loop             # jump to vmul_loop

vmul_return:
    mv a0, zero             # set return value
    ret

main.c:

#include <stddef.h>
#include <stdio.h>

int vmul(size_t n, const int *x, const int *y, int *z);

int main() {
  size_t n = 6;
  int x[] = {1, 3, 4, 5, 6, 7};
  int y[] = {5, 6, 7, 8, 9, 10};
  int z[n];

  int k = vmul(n, x, y, z);

  for (size_t i = 0; i < n; ++i)
    printf("%i ", z[i]);
  printf("\n");

  return k;
}

To make the executable file,

$ /opt/rv64gcv/bin/riscv64-unknown-elf-as --march=rv64gcv -o vmul.o vmul.s
$ /opt/rv64gcv/bin/riscv64-unknown-elf-gcc -o main vmul.o main.c

As we built it with the toolchain for bare-metal programming, we can directly run it,

$ ./main
5 18 28 40 54 70

It can equally be run with a simulator that interprets RISC-V instructions on x86-64. Under Ubuntu Linux,

$ sudo apt install qemu-user
$ qemu-riscv64 -cpu rv64,v=on,vext_spec=v1.0,vlen=128 main
5 18 28 40 54 70

To adapt vmul.s for floating-point multiplication, change one single line to

    vfmul.vv v0, v0, v2

and change accordingly main.c.

The flexibility of the V extension

In ARM64 SVE, one vector instruction can only be carried out over one vector register, largely. In the RISC-V64 V extension, one may group multiple vector registers for one vector instruction.

The m1 label in the vsetvli instruction specified that one vector instruction is carried out on one vector register only.

    # given the remaining number of elements to process in a0,
    # return the maximum number of elements that can be handled
    # by one vector instruction in t0.
    #
    vsetvli t0, a0, e32, m1, ta, ma

But we can make one vector instruction work on more than one vector registers, and thus more 32-bit elements, as denoted by t0. Note in the comment,

    # If the configuration is e32 and m1, and one vector register is 128-bit,
    #
    #   t0 = 4    if a0 >= 4
    #   t0 = a0   otherwise.
    #
    # If the configuration is e32 and m4, and one vector register is 128-bit,
    #
    #   t0 = 16   if a0 >= 16
    #   t0 = a0   otherwise.
    #

In the V extension, version 1.0, one may group maximum 8 vector registers together (Section 30.6.1 vtype encoding), in which case,

    # If the configuration is e32 and m8, and one vector register is 128-bit,
    #
    #   t0 = 32   if a0 >= 32
    #   t0 = a0   otherwise.
    #

If one vector instruction can work on more elements, the loop will require fewer steps to fully process the input.

When four vector registers are grouped together by m4, v0 will refer to the content in the original v0, v1, v2, v3; v4 to the content in the original v4, v5, v6, v7; so on and so forth. One can only access the vector registers via v0, v4, v8, …, up to v28 in the program.

Likewise, if eight vector registers are grouped by m8, the programmer can only access v0, v8, v16, v24 in the program.

Therefore if we group four vector registers by m4 to increase the number of elements for one vector instruction, three lines in vmul.s need be changed:

    vsetvli t0, a0, e32, m4, ta, ma
    ...
    vle32.v v4, (a2)
    vmul.vv v0, v0, v4      # integer multiplication element-wise

The number of loops will reduce four folds. Any user of the vmul routine will remain the same, as main.c.

The hardware design

ETH Zurich and Barcelona Supercomputing Centre are doing hardware designs of the V extension. Andes Technology, SiFive and a few other companies produce commercial designs.

The Parallel Ultra Low Power platform, is a collaboration between ETH Zürich and University of Bologna that develops open-source, scalable and energy-efficient RISC-V hardware and software. The “Training” and “Conference talks” sections are particularly informative, covering much more than just the V extension. For example, the talk “Understanding performance numbers in Integrated Circuit Design” was an excellent exposition.

The current designs for the V extension would generally impose a fixed maximum width for concurrent data, as for ARM64 SVE. For example, if one vector register is 128-bit, and the maximum width for concurrent data is 256-bit, then two vector registers is the maximum throughput. If the programmer specified m4 grouping four vector registers, the hardware will actually process content in the first two vector registers, then that in the next two vector registers, sequentially.

It seems if the hardware is built with simple units, and each unit can be adaptively turned on/off, it may maximise the power of the V extension, as allowed by the flexibility of the specification.

The latest designs with throughput of 80 32-bit numbers (80 x 32 = 2560 bits) or 2560 bits, already run at below 1 Watt Vitruvius+, Barcelona Supercomputing Centre. A complete ARM64 or Apple’s M microprocessor usually runs at 5 Watts at load. The throughput of Graviton 3 or 4, ARM64 processor designed by Amazon, is 16 32-bit numbers (16 x 32 = 4 x 128 = 512), or 512 bits per clock cycle for the SVE (Scalable Vector Extension).

OpenBLAS kernels with RISC-V64 V extension

In OpenBLAS the linear algebra library, for example, for the general matrix-vector multiplication (gemv) routine, the files with _rvv or _vector in the file name use the RISC-V64 V extension:

kernel/riscv64/gemv_n.c
kernel/riscv64/gemv_n_rvv.c
kernel/riscv64/gemv_n_vector.c
kernel/riscv64/gemv_t.c
kernel/riscv64/gemv_t_rvv.c
kernel/riscv64/gemv_t_vector.c
kernel/riscv64/zgemv_n.c
kernel/riscv64/zgemv_n_rvv.c
kernel/riscv64/zgemv_n_vector.c
kernel/riscv64/zgemv_t.c
kernel/riscv64/zgemv_t_rvv.c
kernel/riscv64/zgemv_t_vector.c

References

RISC-V instruction set manual, https://github.com/riscv/riscv-isa-manual, the main reference for RISC-V

… from the above GitHub repository, a link to the typeset specification of unprivileged instructions, https://riscv.github.io/riscv-isa-manual/snapshot/unprivileged/, with Section 30 on V extension, Appendix C providing multiple example V extension programs

Introduction to RISC-V, its integer registers, /2025/08/11/first-riscv64-program.html

RISC-V calling convention, its floating-point registers, https://riscv.org/wp-content/uploads/2024/12/riscv-calling.pdf

RISC-V announces ratification of the RVA23 profile standard, October 2024, https://riscv.org/blog/risc-v-announces-ratification-of-the-rva23-profile-standard/

RVA23 profiles, https://github.com/riscv/riscv-profiles/blob/main/src/rva23-profile.adoc

ARM64 SVE (Scalable Vector Extension), /2025/10/15/sve.html

PULP (Parallel Ultra Low Power) Platform, https://www.pulp-platform.org/

PULP (Parallel Ultra Low Power) training, https://www.pulp-platform.org/pulp_training.html

PULP (Parallel Ultra Low Power) conference and workshop materials, https://www.pulp-platform.org/conferences.html

…, on which for example “Understanding performance numbers in Integrated Circuit Design” was an excellent exposition.

PULP (Parallel Ultra Low Power) 10-year conference, https://pulp-platform.org/10years/

Vitruvius+: an Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications, https://dl.acm.org/doi/full/10.1145/3575861, the design is already upgraded for the V extension, version 1.0 according to the author

AWS Graviton, ARM64 processor designed by Amazon with support for the SVE (Scalable Vector Extension), https://en.wikipedia.org/wiki/AWS_Graviton

OpenBLAS, https://github.com/OpenMathLib/OpenBLAS