simd crash course
intro to simd
simd is a subset of instructions that most cpus support, whose purpose is speeding up data processing.
simd stands for "single instruction, multiple data". as the name indicates, this is for algorithms that want to use the same operation on several inputs at once, which cpus handle more efficiently than processing one input per instruction.
not all scalar operations have simd counterparts, and vice versa.
additionally, the instructions that can be vectorized depend on the target platform, but most platforms support variants of the basic arithmetic operations:
addition, subtraction and multiplication.
adding two slices of numbers
a basic example is adding one slice of numbers to another. a rust implementation could look like this:
use std::iter;
const N: usize = 1024;
fn add(out: &mut [f64; N], a: &[f64; N], b: &[f64; N]) {
for (out, (a, b)) in iter::zip(out, iter::zip(a, b)) {
*out = *a + *b;
}
}
non vectorized version
let's take a look at what it compiles down to (on x86_64
). at first, we can ask the compiler to optimize for code size
so that we have less assembly to look at, then we can move on to the perf-optimized version.
add:
xor eax, eax
.LBB6_1: ; loop start
movsd xmm0, qword ptr [rsi + 8*rax]
addsd xmm0, qword ptr [rdx + 8*rax]
movsd qword ptr [rdi + 8*rax], xmm0
inc rax
cmp rax, 1024
jne .LBB6_1
ret
we're mostly interested in what happens inside the loop, i.e., between .LBB6_1
and ret
.
the cpu first loads (movsd → (mov)e (s)ingle (d)ouble precision float
) a float from memory ([rdx + 8 * rax]
) to xmm0
, a floating point register.
movsd xmm0, qword ptr [rsi + 8*rax]
it adds (addsd → (add) assign (s)ingle (d)ouble
) another float from memory ([rdx + 8*rax]
) to xmm0
.
addsd xmm0, qword ptr [rdx + 8*rax]
then it stores the result to memory.
movsd qword ptr [rdi + 8*rax], xmm0
then finally, it increments the loop counter and checks whether it should continue looping or exit instead.
inc rax
cmp rax, 1024
jne .LBB6_1
now what happens if we ask the compiler to optimize for performance instead?
compiler-vectorized version
this is the assembly it spits out
add:
xor eax, eax
.LBB0_1: ; loop start
movupd xmm0, xmmword ptr [rsi + 8*rax]
movupd xmm1, xmmword ptr [rsi + 8*rax + 16]
movupd xmm2, xmmword ptr [rdx + 8*rax]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rdx + 8*rax + 16]
addpd xmm0, xmm1
movupd xmmword ptr [rdi + 8*rax], xmm2
movupd xmmword ptr [rdi + 8*rax + 16], xmm0
movupd xmm0, xmmword ptr [rsi + 8*rax + 32]
movupd xmm1, xmmword ptr [rsi + 8*rax + 48]
movupd xmm2, xmmword ptr [rdx + 8*rax + 32]
addpd xmm2, xmm0
movupd xmm0, xmmword ptr [rdx + 8*rax + 48]
addpd xmm0, xmm1
movupd xmmword ptr [rdi + 8*rax + 32], xmm2
movupd xmmword ptr [rdi + 8*rax + 48], xmm0
add rax, 8
cmp rax, 1024
jne .LBB0_1
ret
this is slightly more daunting than the previous example, because the compiler automatically simd-ifies the code for us, but we can still process it piece by piece.
for reference, the input a
is stored in the rsi
register, the input b
is stored in rdx
, and the output is in rdi
. rax
holds the loop counter.
the cpu loads multiple values from a
to a floating point register xmm0
: movupd → (mov)e (u)naligned (p)acked (d)ouble
.
the source operand(s) in intel syntax go on the right, while the destination operands go on the left (right after the instruction name).
an xmm
register can hold 128 bits of data, which means it loads two values in this case.
movupd xmm0, xmmword ptr [rsi + 8*rax]
it then loads the next values from a
to xmm1
in anticipation for the next iteration, which isn't too relevant here.
it then loads the values from b
to xmm2
, and adds xmm0
to them: addpd → (add) assign (p)acked (d)ouble
.
movupd xmm2, xmmword ptr [rdx + 8*rax]
addpd xmm2, xmm0
it loads the next values from b
in anticipation for the next iteration, which again isn't too interesting for now.
then finally, it stores the result to out
.
movupd xmmword ptr [rdi + 8*rax], xmm2
the interleaved instructions represent different iterations, which the compiler unrolls into a single big loop iteration, then advances the loop counter by the number of processed elements (4 unrolled iterations × 2 floats per register).
add rax, 8
cmp rax, 1024
jne .LBB0_1
hand-vectorized version
instead of using direct intrinsics, we'll take the easy way out and use a simd library that abstracts away the architectural differences. faer uses pulp as a backend so we'll be looking at what the code from before would look like when written with it.
use pulp::{Arch, Simd, WithSimd};
pub fn add_simd(out: &mut [f64], a: &[f64], b: &[f64]) {
// packing our function parameters in a struct that we implement
// a trait for is a workaround for the lack of generic closures
// in rust
struct Impl<'a> {
out: &'a mut [f64],
a: &'a [f64],
b: &'a [f64],
}
// this is essentially a generic closure, `for<S: Simd> FnOnce(S) -> Output`
impl WithSimd for Impl<'_> {
type Output = ();
// inlining the top level function is required for the simd intrinsics themselves
// to be inlined, which is crucial for performance
#[inline(always)]
fn with_simd<S: Simd>(self, simd: S) -> Self::Output {
// unpack the function parameters that we packed in the struct
let Self { out, a, b } = self;
// split the slices into vector chunks and scalar chunks
let (out0, out1) = S::as_mut_simd_f64s(out);
let (a0, a1) = S::as_simd_f64s(a);
let (b0, b1) = S::as_simd_f64s(b);
// iterate over the vector chunks
for (out, (a, b)) in iter::zip(out0, iter::zip(a0, b0)) {
// unfortunately can't be just `*a + *b`
// because the `simd` struct needs to be passed in as well for the api to be sound
*out = simd.add_f64s(*a, *b);
}
// iterate over the scalar chunks
for (out, (a, b)) in iter::zip(out1, iter::zip(a1, b1)) {
// sum up the values into the output array as usual
*out = *a + *b;
}
}
}
Arch::new().dispatch(Impl { out, a, b })
}
it's more complex than the scalar implementation, but most of the code is boilerplate to setup the actual simd code. the interesting part can be reduced to this
let (out0, out1) = S::as_mut_simd_f64s(out);
let (a0, a1) = S::as_simd_f64s(a);
let (b0, b1) = S::as_simd_f64s(b);
for (out, (a, b)) in iter::zip(out0, iter::zip(a0, b0)) {
*out = simd.add_f64s(*a, *b);
}
for (out, (a, b)) in iter::zip(out1, iter::zip(a1, b1)) {
*out = *a + *b;
}
since a simd register (also called a vector) can hold multiple elements,
we need to split the input slices into a vectorizable chunk, and a non vectorizable chunk.
S::as_{mut_}simd_f64s
handles this for us.
then we iterate over both chunks, using simd operations (simd.add_f64s(*a, *b)
) for the former,
and scalar operations for the latter (*a + *b
).
outside the core implementation, we call Arch::new()
to detect the best instruction set that the cpu supports at runtime,
and dispatch our implementation to that version of the code.
in my case, this ends up selecting the avx version, where avx is an extension to the base simd instruction set that doubles the vector size. so we now process 4 elements per instruction instead of 2.
benchmarks
computing the sum of a slice
now imagine we have a function like
const N: usize = 1024;
fn sum(input: &[f64; N]) -> f64 {
input.iter().sum()
}
let's throw it at the compiler and see if it can run its magic on it again...
.LCPI0_0:
.quad 0x8000000000000000
sum:
movsd xmm0, qword ptr [rip + .LCPI0_0]
xor eax, eax
.LBB0_1: ; loop start
addsd xmm0, qword ptr [rdi + 8*rax]
addsd xmm0, qword ptr [rdi + 8*rax + 8]
addsd xmm0, qword ptr [rdi + 8*rax + 16]
addsd xmm0, qword ptr [rdi + 8*rax + 24]
addsd xmm0, qword ptr [rdi + 8*rax + 32]
addsd xmm0, qword ptr [rdi + 8*rax + 40]
addsd xmm0, qword ptr [rdi + 8*rax + 48]
addsd xmm0, qword ptr [rdi + 8*rax + 56]
add rax, 8
cmp rax, 1024
jne .LBB0_1
ret
wait, what? what it's using nothing but scalar addsd
instructions. what happened to all the vectorization stuff?
turns out the compiler is actually forbidden from vectorizing this code. because floating point arithmetic is non-associative, the compiler can't reorder the operations in a way that lets it vectorize the computation, so it falls back to computing it sequentially, element by element.
this means we have to take the matter into our own hands. let's write the pulp accelerated version
use pulp::{Arch, Simd, WithSimd};
pub fn sum_simd(input: &[f64]) -> f64 {
struct Impl<'a> {
input: &'a [f64],
}
impl WithSimd for Impl<'_> {
type Output = f64;
#[inline(always)]
fn with_simd<S: Simd>(self, simd: S) -> Self::Output {
let Self { input } = self;
let (input0, input1) = S::as_simd_f64s(input);
// creates a vector register full of zeros
let mut sum = simd.splat_f64s(0.0);
for input in input0 {
sum = simd.add_f64s(sum, *input);
}
let mut sum = simd.reduce_sum_f64s(sum);
for input in input1 {
sum = sum + input;
}
sum
}
}
Arch::new().dispatch(Impl { input })
}
the compiler had more tricks up its sleeve the last time though: in addition to vectorizing the code, it also unrolled the loop. let's try that for our code
use pulp::{Arch, Simd, WithSimd};
pub fn sum_simd_unroll4(input: &[f64]) -> f64 {
struct Impl<'a> {
input: &'a [f64],
}
impl WithSimd for Impl<'_> {
type Output = f64;
#[inline(always)]
fn with_simd<S: Simd>(self, simd: S) -> Self::Output {
let Self { input } = self;
let (input0, input1) = S::as_simd_f64s(input);
let (input04, input01) = pulp::as_arrays::<4, _>(input0);
// creates a vector register full of zeros
let mut sum0 = simd.splat_f64s(0.0);
let mut sum1 = simd.splat_f64s(0.0);
let mut sum2 = simd.splat_f64s(0.0);
let mut sum3 = simd.splat_f64s(0.0);
for [input0, input1, input2, input3] in input04 {
sum0 = simd.add_f64s(sum0, *input0);
sum1 = simd.add_f64s(sum1, *input1);
sum2 = simd.add_f64s(sum2, *input2);
sum3 = simd.add_f64s(sum3, *input3);
}
sum0 = simd.add_f64s(sum0, sum1);
sum2 = simd.add_f64s(sum2, sum3);
sum0 = simd.add_f64s(sum0, sum2);
for input in input01 {
sum0 = simd.add_f64s(sum0, *input);
}
let mut sum = simd.reduce_sum_f64s(sum0);
for input in input1 {
sum = sum + input;
}
sum
}
}
Arch::new().dispatch(Impl { input })
}
benchmarks
almost a 20x speedup!
generic programming
let's say we want an equivalent to the same code as before, that works for both f64
and f32
.
pulp doesn't yet expose this functionality, so faer builds an abstraction that lets us be generic over the scalar type.
additionally, it relies on a niche rust technique called lifetime branding, which allows us to index into a slice without bound checks in a sound manner.
lifetime branding is not necessarily more efficient than the iterator model with std::iter::zip
, but can make the code easier to write and read.