Quant Basics
There is a long history of quantization in mathematics and digital signal processing. It maps an input number from a large set (often a continuous set) to another number in a smaller set (often a countable and finite set). By processing data in the quantitated domain, we get certain benefits:
Increased thoughput. Quantitated value require less space to store, thus more data could be processed and transferred in the same period of time.
Speed. Quantitated values might be processed more quickly if the system has capacity optimized for that.
One of the most common applications of the quantization technique in the daily life is the PCM encoding during a phone call. The voice signal is 8 bits quantitated before sending and de-quantitated(recovered) before the receiver hears it. If you feel the voice from phone is different from that of a human standing in front of you, you just detect the error or information lose introduced by the quantitation process.

Quantization Methods
It should be realized that there are many algorithms for converting a continuous float value to a discrete integer value. For example, the PCM uses A-law since it captures the dynamic range nature of the voice signal.
When it comes to tensor computations in neural network, uniform quantization becomes the most common choic for its simplicity and efficiency. It resonances with the priority of applying quantization here, which is the pursuit of high speed and throughput.
Uniform Quantization
Consider a floating point variable with range (x_min, x_max). It will be quantized to the range (0, n_levels−1), where n_levels=256 for 8 bits of precision. We need to derive two parameters: Scale(S) and Zero-point(Z), which map the floating point value to an integer[qdcn].
The scale specifies the step size of the quantizer and floating point zero maps to zero-point.
The zero-point is an integer, which is the quantization value of x_min.
For one sided distributions, therefore, the range (x_min, x_max) is relaxed to include zero. For example, a floating point variable with the range (2.1, 3.5) will be relaxed to the range (0, 3.5) and then quantized.
Once the quantization parameters are chosen, quantization and de-quantization proceed as follows:
Last updated