Work with SIMD

Basically, at different levels, we have three choices to make use of SIMD capabilities provided by the hardware:

Inline assembly instructions
Intrinsics
Compiler auto-vectorization

Assembly Instructions

Assembly instructions are the ultimate weapon. Every CPU manufacturer must provide tedious but comprehensive manuals for developers to read. Assembly instructions are closer to the hardware, thus providing more flexibilities and controls. But nothing is free. The cost is poor portability: assembly syntax or calling conventions of different hardware may be inconsistent.

As a result, hand-written assembly is often avoided. If that is your interest, you may start with x86inc.asm[1].

Intrinsics

To avoid writing assembly code directly and facilitate developers, hardware or compiler manufacturers provide certain higher-level data structures and functions via compiler built-in intrinsics, hiding some complex stuff. As almost the infrastructure, they are released together with modern compilers, such as gcc, clang and MSVC. However, it works for x86 only and extremely precise control may be impossible.

By simply including specific header files, they are ready to use. Here is a sample code snippet adding two vectors with SIMD intrinsics, each of which holds four single precision float numbers.

#include "nmmintrin.h" // for SSE4.2

// intrinsic. SSE type __m128 holds 4 float(4 bytes) numbers
__m128 a4 = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
__m128 b4 = _mm_set_ps(5.0f, 6.0f, 7.0f, 8.0f);
__m128 sum4 = _mm_add_ps(a4, b4); // (12.0f, 10.0f, 8.0f, 6.0f)

// generated assembly code
movaps  xmm0, XMMWORD PTR .LC2[rip]
movaps  XMMWORD PTR [rbp-16], xmm0
movaps  xmm0, XMMWORD PTR .LC3[rip]
movaps  XMMWORD PTR [rbp-32], xmm0
movaps  xmm0, XMMWORD PTR [rbp-16]
movaps  XMMWORD PTR [rbp-64], xmm0
movaps  xmm0, XMMWORD PTR [rbp-32]
movaps  XMMWORD PTR [rbp-80], xmm0
movaps  xmm0, XMMWORD PTR [rbp-64]
addps   xmm0, XMMWORD PTR [rbp-80]
movaps  XMMWORD PTR [rbp-96], xmm0

Data Types and Registers

Streaming SIMD Extensions(SSE) SSE of the x86 architecture was designed by Intel and introduced in 1999 in their Pentium III series. The SSE has 16 registers, named by XMM0-XMM15, each of which is 128 bits long. SSE adds three typedefs, __m128, __m128d and __m128i, for holding float, double(d) and integer(i) respectively. Depending on the size of data type T, the number of data which can be processed simultaneously is given by 128/sizeof(T). For example, SSE instructions can produce results for 4 float numbers(32 bits) or 2 double numbers(64 bits) at a time.

As the successor and enhancement to SSE, AVX has 16 registers, named by YMM0-YMM15, each of which is 256 bits long. Later in July 2013, AVX-512 extends 256-bit AVX SIMD instructions. It has 32 512-bit registers, which are ZMM0-ZMM31.

To keep compatible with legacy instructions, XMM, YMM and ZMM overlap! XMM registers are treated as the lower half of the corresponding YMM register. This can introduce some performance issues when mixing SSE and AVX code. So as ZMM of AVX512. Read more from Advanced Vector Extensions.

Among different generations of SIMD instructions, there are common naming conversions for intrinsic data types and functions discussed later.

For SIMD registers, they are named as follows. As an example, a __m256d struct holds quad 64-bit double numbers.

__m<register_width_in_bits><data_type>

The data_type suffix represents difference operand types:

d: double precision floating point
i: integer
no suffix: single precision floating point

Intrinsics Functions

For intrinsic functions, they are named as:

_mm<register_width_in_bits>_<operation>_<suffix>

The first one or two letters of the suffix indicate the data format:

p: packed data
ep: extended packed data
s: scalar data

Remaining letters in suffix indicate the data type the instruction operates upon:

s: single-precision floating point
d: double-precision floating point
i: signed integer of 8, 16, 32, 64 or 128 bits
u: unsigned integer of 8, 16, 32, 64 or 128 bits

As an example, function call __m256d _mm256_add_pd(__m256d a, __m256d b) performs addition operation upon two vectors of 4 double precision numbers and returns a vector of 4 double precision numbers. Here are lists of x86 intrinsics:

To enable these SIMD instructions, both OS, compiler, and hardware need to provide certain kinds of support. Almost all modern OS(Linux 5.3+, Windows 7+) and compilers(GCC 8+, Clang 6.0+, MSVC of Visual Studio 2017+) already have support for AVX512.

Auto vectorization

Modern compilers offer auto-vectorization capabilities for loops. Developers still write the code as before, the compiler will decide if it could leverage some low-level SIMD instructions to optimize the execution. By default, optimization flags like -O2 or above allow the compiler to seek such kind of vectorization opportunities.

#include <stdio.h>
#include <stdlib.h>

int main() {
    int a[256], b[256], c[256];

    for(int i = 0; i < 256; i++) {
        a[i] = i;
        b[i] = i;
    }   

    for (int i = 0; i < 256; i++) {
        c[i] = a[i] + b[i];
    }   

    int sum = 0;
    for(int i = 0; i < 256; i++) {
        sum += c[i];
    }   

    return sum;
}

If the loops could be unrolled with SIMD instructions on your platform, g++ will generate logs to tell the result.

user@ubuntu::~/tmp$ g++ -O3 -ftree-vectorize -fopt-info-vec-all auto_vec.cpp
auto_vec.cpp:17:22: optimized: loop vectorized using 16 byte vectors
auto_vec.cpp:12:23: optimized: loop vectorized using 16 byte vectors
auto_vec.cpp:7:22: optimized: loop vectorized using 16 byte vectors
auto_vec.cpp:4:5: note: vectorized 3 loops in function.

References

[1] Writing x86 SIMD using x86inc.asm [cache]

PreviousWhat is SIMD NextExample: Vector Addition

Last updated 5 years ago

hashtagAssembly Instructions

hashtagIntrinsics

hashtagData Types and Registers

hashtagIntrinsics Functions

hashtagAuto vectorization