Quant and AVX
The AVX register has a fixed length. For AVX256, it handles as most 8 floating point numbers(32 bits) or 32 int8 numbers at a time. Quantization provides the opportunity for processing more numbers since it cuts the length of number, e.g., from 32 bits floating point to 8 bits integer. However, there are still some imperfect aspects.
For 32 bits numbers, either it is floating point or integer, the FMA instruction of AVX256 is able to handle 8 of them once. But for 8 bits integer, since it is very easy to overflow, larger 16 bits or even 32 bits number has to be used to store the intermediate or final result. Following is the process that the Intel Xeon Scalable processor handles 8 bits integer with FMA. It requires three dependent instructions:
VPMADDUBSW. u8 * s8 -> s16 multiply. It uses a 16 bits number to store the result of a 8 bits unsigned integer multiplied by a 8 bits signed integer.
VPMADDWD. broadcast s16 -> s32. It uses a 32 bits numbers to store the result of the sum of two 16 bits signed integers.
VPADDD. s32 -> s32 add. To accumulate the result.

It allows for 4x more input over fp32 at the cost of 3x more instructions or 33.3% more compute and 1/4 the memory requirement. The reduced memory and higher frequency available with lower precision makes it even faster.
For 16 bits input, FMA requires 2 dependent instructions. So 2x more input at cost of 2x more instruction results in no performance gain.
VPMADDWD. s16 * s16 -> s32 multiply.
VPADDD. s32 -> s32 add.

Once again, don't take it for granted that 8 bits is 4x faster than 32 bits to handle or 16 bits is 2x faster. That is something needs to be corrected right now.
AVX512 VNNI
Recently, the low precision 8 bits or 16 bits data type have been successfully used for deep learning inference. Many tight neural network loops require the repeated multiplication of two 8 bits or 16 bits values and accumulate the result to a 32 bits accumulator. Using the foundation AVX-512, this is possible using two or three instructions. To speedup this kind of operations is the major motivation behind VNNI extension.
The AVX512 VNNI x86 extension extends AVX-512 Foundation by introducing four new instructions for accelerating inner convolutional neural network loops.
VPDPBUSD fuses VPMADDUBSW, VPMADDWD, and VPADDD
VPDPWSSD fuses VPMADDWD and VPADDD
Other 2 new instructions are VPDPBUSDS and VPDPWSSDS, which are similar to above two except positive/negative numbers saturation.
AVX512 VNNI enables 8 bits multiplies with 32 bits accumulates with one single instruction. The VPMADDUBSW, VPMADDWD and VPADDD instructions are fused into the VPDPBUSD instruction(u8 x s8 -> s32). This allows for 4x more inputs over fp32 and theoretical peak 4x more throughput with 1/4 the memory requirement.

For 16 bits quantization, the same applies. The VPMADDWD and VPADDD instructions are fused into the VPDPWSSD instruction(s16 x s16 -> s32). This allows for 2x more inputs over fp32 and 2x more throughput with 1/2 the memory requirement.
AVX512 VNNI CPU support
Few intel processors supports VNNI after 2019, check https://en.wikipedia.org/wiki/AVX-512 for details. 10th gen Core i3/i5/i7 which will be released on 2020, will support VNNI.
Tips
Theoretical maximum FLOPS of a CPU can be almost achieved by SIMD instructions.
Use SIMD instructions can speed up Neural Network inference a lot, mainly on convolution or matrix multiply operations.
Use newer SIMD set as possible (AVX512 > AVX2 > AVX > SEE) and don’t mix them.
Quantization won’t help much without AVX512 VNNI support.
VNNI is not widely reachable for now (2020 Q1), but we expect to have it next year (2021).
Last updated