simd crash course

intro to simd

simd is a subset of instructions that most cpus support, whose purpose is speeding up data processing.

simd stands for "single instruction, multiple data". as the name indicates, this is for algorithms that want to use the same operation on several inputs at once, which cpus handle more efficiently than processing one input per instruction.

not all scalar operations have simd counterparts, and vice versa.
additionally, the instructions that can be vectorized depend on the target platform, but most platforms support variants of the basic arithmetic operations: addition, subtraction and multiplication.

adding two slices of numbers

a basic example is adding one slice of numbers to another. a rust implementation could look like this:

use std::iter;

const N: usize = 1024;

fn add(out: &mut [f64; N], a: &[f64; N], b: &[f64; N]) {
    for (out, (a, b)) in iter::zip(out, iter::zip(a, b)) {
        *out = *a + *b;
    }
}

non vectorized version

let's take a look at what it compiles down to (on x86_64). at first, we can ask the compiler to optimize for code size so that we have less assembly to look at, then we can move on to the perf-optimized version.

add:
  xor eax, eax

.LBB6_1: ; loop start
  movsd xmm0, qword ptr [rsi + 8*rax]
  addsd xmm0, qword ptr [rdx + 8*rax]
  movsd qword ptr [rdi + 8*rax], xmm0
  inc rax
  cmp rax, 1024
  jne .LBB6_1

  ret

we're mostly interested in what happens inside the loop, i.e., between .LBB6_1 and ret.

the cpu first loads (movsd → (mov)e (s)ingle (d)ouble precision float) a float from memory ([rdx + 8 * rax]) to xmm0, a floating point register.

  movsd xmm0, qword ptr [rsi + 8*rax]

it adds (addsd → (add) assign (s)ingle (d)ouble) another float from memory ([rdx + 8*rax]) to xmm0.

  addsd xmm0, qword ptr [rdx + 8*rax]

then it stores the result to memory.

  movsd qword ptr [rdi + 8*rax], xmm0

then finally, it increments the loop counter and checks whether it should continue looping or exit instead.

  inc rax
  cmp rax, 1024
  jne .LBB6_1

now what happens if we ask the compiler to optimize for performance instead?

compiler-vectorized version

this is the assembly it spits out

add:
  xor eax, eax
.LBB0_1: ; loop start
  movupd xmm0, xmmword ptr [rsi + 8*rax]
  movupd xmm1, xmmword ptr [rsi + 8*rax + 16]
  movupd xmm2, xmmword ptr [rdx + 8*rax]
  addpd xmm2, xmm0
  movupd xmm0, xmmword ptr [rdx + 8*rax + 16]
  addpd xmm0, xmm1
  movupd xmmword ptr [rdi + 8*rax], xmm2
  movupd xmmword ptr [rdi + 8*rax + 16], xmm0
  movupd xmm0, xmmword ptr [rsi + 8*rax + 32]
  movupd xmm1, xmmword ptr [rsi + 8*rax + 48]
  movupd xmm2, xmmword ptr [rdx + 8*rax + 32]
  addpd xmm2, xmm0
  movupd xmm0, xmmword ptr [rdx + 8*rax + 48]
  addpd xmm0, xmm1
  movupd xmmword ptr [rdi + 8*rax + 32], xmm2
  movupd xmmword ptr [rdi + 8*rax + 48], xmm0
  add rax, 8
  cmp rax, 1024
  jne .LBB0_1
  ret

this is slightly more daunting than the previous example, because the compiler automatically simd-ifies the code for us, but we can still process it piece by piece.
for reference, the input a is stored in the rsi register, the input b is stored in rdx, and the output is in rdi. rax holds the loop counter.

the cpu loads multiple values from a to a floating point register xmm0: movupd → (mov)e (u)naligned (p)acked (d)ouble. the source operand(s) in intel syntax go on the right, while the destination operands go on the left (right after the instruction name).
an xmm register can hold 128 bits of data, which means it loads two values in this case.

  movupd xmm0, xmmword ptr [rsi + 8*rax]

it then loads the next values from a to xmm1 in anticipation for the next iteration, which isn't too relevant here.

it then loads the values from b to xmm2, and adds xmm0 to them: addpd → (add) assign (p)acked (d)ouble.

  movupd xmm2, xmmword ptr [rdx + 8*rax]
  addpd xmm2, xmm0

it loads the next values from b in anticipation for the next iteration, which again isn't too interesting for now.

then finally, it stores the result to out.

  movupd xmmword ptr [rdi + 8*rax], xmm2

the interleaved instructions represent different iterations, which the compiler unrolls into a single big loop iteration, then advances the loop counter by the number of processed elements (4 unrolled iterations × 2 floats per register).

  add rax, 8
  cmp rax, 1024
  jne .LBB0_1

hand-vectorized version

instead of using direct intrinsics, we'll take the easy way out and use a simd library that abstracts away the architectural differences. faer uses pulp as a backend so we'll be looking at what the code from before would look like when written with it.

use pulp::{Arch, Simd, WithSimd};

pub fn add_simd(out: &mut [f64], a: &[f64], b: &[f64]) {
    // packing our function parameters in a struct that we implement
    // a trait for is a workaround for the lack of generic closures
    // in rust
    struct Impl<'a> {
        out: &'a mut [f64],
        a: &'a [f64],
        b: &'a [f64],
    }

    // this is essentially a generic closure, `for<S: Simd> FnOnce(S) -> Output`
    impl WithSimd for Impl<'_> {
        type Output = ();

        // inlining the top level function is required for the simd intrinsics themselves
        // to be inlined, which is crucial for performance
        #[inline(always)]
        fn with_simd<S: Simd>(self, simd: S) -> Self::Output {
            // unpack the function parameters that we packed in the struct
            let Self { out, a, b } = self;

            // split the slices into vector chunks and scalar chunks
            let (out0, out1) = S::as_mut_simd_f64s(out);
            let (a0, a1) = S::as_simd_f64s(a);
            let (b0, b1) = S::as_simd_f64s(b);

            // iterate over the vector chunks
            for (out, (a, b)) in iter::zip(out0, iter::zip(a0, b0)) {
                // unfortunately can't be just `*a + *b`
                // because the `simd` struct needs to be passed in as well for the api to be sound
                *out = simd.add_f64s(*a, *b);
            }

            // iterate over the scalar chunks
            for (out, (a, b)) in iter::zip(out1, iter::zip(a1, b1)) {
                // sum up the values into the output array as usual
                *out = *a + *b;
            }
        }
    }

    Arch::new().dispatch(Impl { out, a, b })
}

it's more complex than the scalar implementation, but most of the code is boilerplate to setup the actual simd code. the interesting part can be reduced to this

let (out0, out1) = S::as_mut_simd_f64s(out);
let (a0, a1) = S::as_simd_f64s(a);
let (b0, b1) = S::as_simd_f64s(b);

for (out, (a, b)) in iter::zip(out0, iter::zip(a0, b0)) {
    *out = simd.add_f64s(*a, *b);
}

for (out, (a, b)) in iter::zip(out1, iter::zip(a1, b1)) {
    *out = *a + *b;
}

since a simd register (also called a vector) can hold multiple elements, we need to split the input slices into a vectorizable chunk, and a non vectorizable chunk. S::as_{mut_}simd_f64s handles this for us.

then we iterate over both chunks, using simd operations (simd.add_f64s(*a, *b)) for the former, and scalar operations for the latter (*a + *b).

outside the core implementation, we call Arch::new() to detect the best instruction set that the cpu supports at runtime, and dispatch our implementation to that version of the code.

in my case, this ends up selecting the avx version, where avx is an extension to the base simd instruction set that doubles the vector size. so we now process 4 elements per instruction instead of 2.

benchmarks

plot showing the benchmark results. the hand-vectorized version wins until size 2048, at which point the compiler-optimized version catches up. the scalar version is much slower than both.

computing the sum of a slice

now imagine we have a function like

const N: usize = 1024;

fn sum(input: &[f64; N]) -> f64 {
    input.iter().sum()
}

let's throw it at the compiler and see if it can run its magic on it again...

.LCPI0_0:
  .quad 0x8000000000000000
sum:
  movsd xmm0, qword ptr [rip + .LCPI0_0]
  xor eax, eax
.LBB0_1: ; loop start
  addsd xmm0, qword ptr [rdi + 8*rax]
  addsd xmm0, qword ptr [rdi + 8*rax + 8]
  addsd xmm0, qword ptr [rdi + 8*rax + 16]
  addsd xmm0, qword ptr [rdi + 8*rax + 24]
  addsd xmm0, qword ptr [rdi + 8*rax + 32]
  addsd xmm0, qword ptr [rdi + 8*rax + 40]
  addsd xmm0, qword ptr [rdi + 8*rax + 48]
  addsd xmm0, qword ptr [rdi + 8*rax + 56]
  add rax, 8
  cmp rax, 1024
  jne .LBB0_1
  ret

wait, what? what it's using nothing but scalar addsd instructions. what happened to all the vectorization stuff?

turns out the compiler is actually forbidden from vectorizing this code. because floating point arithmetic is non-associative, the compiler can't reorder the operations in a way that lets it vectorize the computation, so it falls back to computing it sequentially, element by element.

this means we have to take the matter into our own hands. let's write the pulp accelerated version

use pulp::{Arch, Simd, WithSimd};

pub fn sum_simd(input: &[f64]) -> f64 {
    struct Impl<'a> {
        input: &'a [f64],
    }

    impl WithSimd for Impl<'_> {
        type Output = f64;

        #[inline(always)]
        fn with_simd<S: Simd>(self, simd: S) -> Self::Output {
            let Self { input } = self;

            let (input0, input1) = S::as_simd_f64s(input);

            // creates a vector register full of zeros
            let mut sum = simd.splat_f64s(0.0);

            for input in input0 {
                sum = simd.add_f64s(sum, *input);
            }

            let mut sum = simd.reduce_sum_f64s(sum);

            for input in input1 {
                sum = sum + input;
            }
            sum
        }
    }

    Arch::new().dispatch(Impl { input })
}

the compiler had more tricks up its sleeve the last time though: in addition to vectorizing the code, it also unrolled the loop. let's try that for our code

use pulp::{Arch, Simd, WithSimd};

pub fn sum_simd_unroll4(input: &[f64]) -> f64 {
    struct Impl<'a> {
        input: &'a [f64],
    }

    impl WithSimd for Impl<'_> {
        type Output = f64;

        #[inline(always)]
        fn with_simd<S: Simd>(self, simd: S) -> Self::Output {
            let Self { input } = self;

            let (input0, input1) = S::as_simd_f64s(input);
            let (input04, input01) = pulp::as_arrays::<4, _>(input0);

            // creates a vector register full of zeros
            let mut sum0 = simd.splat_f64s(0.0);
            let mut sum1 = simd.splat_f64s(0.0);
            let mut sum2 = simd.splat_f64s(0.0);
            let mut sum3 = simd.splat_f64s(0.0);

            for [input0, input1, input2, input3] in input04 {
                sum0 = simd.add_f64s(sum0, *input0);
                sum1 = simd.add_f64s(sum1, *input1);
                sum2 = simd.add_f64s(sum2, *input2);
                sum3 = simd.add_f64s(sum3, *input3);
            }

            sum0 = simd.add_f64s(sum0, sum1);
            sum2 = simd.add_f64s(sum2, sum3);

            sum0 = simd.add_f64s(sum0, sum2);

            for input in input01 {
                sum0 = simd.add_f64s(sum0, *input);
            }

            let mut sum = simd.reduce_sum_f64s(sum0);

            for input in input1 {
                sum = sum + input;
            }
            sum
        }
    }

    Arch::new().dispatch(Impl { input })
}

benchmarks

plot showing the benchmark results. the hand-vectorized version is much faster than the scalar one.

almost a 20x speedup!

generic programming

let's say we want an equivalent to the same code as before, that works for both f64 and f32. pulp doesn't yet expose this functionality, so faer builds an abstraction that lets us be generic over the scalar type.

additionally, it relies on a niche rust technique called lifetime branding, which allows us to index into a slice without bound checks in a sound manner. lifetime branding is not necessarily more efficient than the iterator model with std::iter::zip, but can make the code easier to write and read.