Basically, at different levels, we have three choices to make use of SIMD capabilities provided by the hardware:
Inline assembly instructions
Intrinsics
Compiler auto-vectorization
Assembly Instructions
Assembly instructions are the ultimate weapon. Every CPU manufacturer must provide tedious but comprehensive manuals for developers to read. Assembly instructions are closer to the hardware, thus providing more flexibilities and controls. But nothing is free. The cost is poor portability: assembly syntax or calling conventions of different hardware may be inconsistent.
As a result, hand-written assembly is often avoided. If that is your interest, you may start with x86inc.asm[1].
Intrinsics
To avoid writing assembly code directly and facilitate developers, hardware or compiler manufacturers provide certain higher-level data structures and functions via compiler built-in intrinsics, hiding some complex stuff. As almost the infrastructure, they are released together with modern compilers, such as gcc, clang and MSVC. However, it works for x86 only and extremely precise control may be impossible.
By simply including specific header files, they are ready to use. Here is a sample code snippet adding two vectors with SIMD intrinsics, each of which holds four single precision float numbers.
Streaming SIMD Extensions(SSE) SSE of the x86 architecture was designed by Intel and introduced in 1999 in their Pentium III series. The SSE has 16 registers, named by XMM0-XMM15, each of which is 128 bits long. SSE adds three typedefs, __m128, __m128d and __m128i, for holding float, double(d) and integer(i) respectively. Depending on the size of data type T, the number of data which can be processed simultaneously is given by 128/sizeof(T). For example, SSE instructions can produce results for 4 float numbers(32 bits) or 2 double numbers(64 bits) at a time.
As the successor and enhancement to SSE, AVX has 16 registers, named by YMM0-YMM15, each of which is 256 bits long. Later in July 2013, AVX-512 extends 256-bit AVX SIMD instructions. It has 32 512-bit registers, which are ZMM0-ZMM31.
To keep compatible with legacy instructions, XMM, YMM and ZMM overlap! XMM registers are treated as the lower half of the corresponding YMM register. This can introduce some performance issues when mixing SSE and AVX code. So as ZMM of AVX512. Read more from Advanced Vector Extensions.
SIMD Registers
Among different generations of SIMD instructions, there are common naming conversions for intrinsic data types and functions discussed later.
For SIMD registers, they are named as follows. As an example, a __m256d struct holds quad 64-bit double numbers.
The data_type suffix represents difference operand types:
d: double precision floating point
i: integer
no suffix: single precision floating point
Intrinsics Functions
For intrinsic functions, they are named as:
The first one or two letters of the suffix indicate the data format:
p: packed data
ep: extended packed data
s: scalar data
Remaining letters in suffix indicate the data type the instruction operates upon:
s: single-precision floating point
d: double-precision floating point
i: signed integer of 8, 16, 32, 64 or 128 bits
u: unsigned integer of 8, 16, 32, 64 or 128 bits
As an example, function call __m256d _mm256_add_pd(__m256d a, __m256d b) performs addition operation upon two vectors of 4 double precision numbers and returns a vector of 4 double precision numbers. Here are lists of x86 intrinsics:
To enable these SIMD instructions, both OS, compiler, and hardware need to provide certain kinds of support. Almost all modern OS(Linux 5.3+, Windows 7+) and compilers(GCC 8+, Clang 6.0+, MSVC of Visual Studio 2017+) already have support for AVX512.
Auto vectorization
Modern compilers offer auto-vectorization capabilities for loops. Developers still write the code as before, the compiler will decide if it could leverage some low-level SIMD instructions to optimize the execution. By default, optimization flags like -O2 or above allow the compiler to seek such kind of vectorization opportunities.
If the loops could be unrolled with SIMD instructions on your platform, g++ will generate logs to tell the result.