Example: Quant with ONNX
Quantize script A
ONNX provides scripts to quantize models. Please check here for details.
Quantize script B
There’re another set of scripts to quantize ONNX models. It's inside Nuphar execution provider and it only supports weight-only quantization.
Use model_quantizer.py to quantize MatMul operations. RNN/LSTM/GRU operations can be expanded by model_editor.py first, so MatMul operations can be detected by model_quantizer.py.
Here’s an example:

LSTM op in raw onnx model:

Expanded by model_editor.py. MatMul op is for W*X, Scan op is a subgraph which iterates remaining Op inside LSTM.

Quantized MatMul op by model_quantizer.py. Note that MatMul ops inside Scan subgraph will also be quantized, but they can’t be visualized in the Netron tool.
To understand the quantized graph, please check the formulas from past sections.
For implementation in the quantize script, it is simplified like this:

The quantized graph reflects above formulas exactly.
Performance Analysis
From previous section using AVX to accelerate lower numerical precision inference, we know that the maximum speedup of int8 quantization is 33% without AVX512 VNNI support. ONNX runtime claims 20% speedup with this quantization script and Nuphar execution Provider. But I observed performance regression in default ONNX runtime provider.
There are two obvious points to improve on this script B. One is to fuse those quantize and dequantize Ops. Second is to support Weight and Activation quantization. Both 2 points should be implemented in script A, but it seems to have some bugs and need further investigation.
Last updated