Quant for GEMM

If A and B are M*K and K*N matrix with floating numbers, the product C=A*B is a M*N matrix with floating numbers. The problem here is how we can benefit from quantization which calculates with low precision integers and recover original C from that.

During quantization step, elements of A and B are quantitated to integers to get new matrices [1]:

A_q : A^{(i,k)} = S_a(A_q^{(i,k)}-Z_a), i \in (0,M), k \in (0,K)\\ B_q : A^{(k,j)} = S_b(B_q^{(k,j)}-Z_b), k \in (0,K), j \in (0,N)\\

The original problem of calculating float matrix C can be reduced to the integer matrix multiplication problem given by the following equation:

C_q:C^{(i,j)}=\sum_{k=0}^K A^{(i,k)} * A^{(k,j)} = S_aS_b\sum_{k=0}^K (A_q^{(i,k)}-Z_a)(B_q^{(k,j)}-Z_b) = \text{} \\ S_aS_b\sum_{k=0}^K A_q^{(i,k)}B_q^{(k,j)} - Z_a\sum_{k=0}^K B_q^{(k,j)} - Z_b\sum_{k=0}^K A_q^{(i,k)} + KZ_aZ_b = \text{} \\ S_aS_bC_q^{(i,j)} - Z_a\sum_{k=0}^K B_q^{(k,j)} - Z_b\sum_{k=0}^K A_q^{(i,k)} + KZ_aZ_b

As seen, each value C[i][j] is calculated by three elements from quantitated matrices:

i-th element of accumulated vector over Aq's columns
j-th element of accumulated vector over Bq's rows
Cq[i][j] of quantitated matrices product

Example: Feed Forward NN

In the hidden layer of a neural network, GEMM is often followed by an activation function. There will be two approaches to apply quantization here.

z=a(wx+b) \\ \text{ReLU: } y=a(z)=\max(z, 0)

The first choice is the weights-only quantization. This can be done without requiring any validation or calibration data. It is very straigtforward using the algorithm descirbed above.

The other choice is the weights-and-activations quantization. Since activations need to be quantized, we need calibration data to calculate the dynamic ranges of activations. Typically, about 100 mini-batches are sufficient for the estimates of the ranges of the activation to converge. It is possible to merge the dequantization step of GEMM and quantization step of ReLU together.

Performance Consideration

The cost of matrix multiplication with uniform quantization depends on CPU's integer vector multiplication capacity. From the next section using AVX to accelerate lower numerical precision inference, we know that 8 bits integer MM can be 4 times faster than 32 bits float MM with AVX512 VNNI available, where quantization is very worth trying. The performance gain decreased to only 33% faster without AVX512 VNNI support. In this case quantization won't be so attractive considering extra quantization and de-quantization cost.

References

[1] Jacob, Benoit, et al. "Quantization and training of neural networks for efficient integer-arithmetic-only inference." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018.

PreviousQuant Basics NextQuant and AVX

Last updated 5 years ago

hashtagExample: Feed Forward NN

hashtagPerformance Consideration

hashtagReferences

Example: Feed Forward NN

Performance Consideration

References