BERT
ONNX Runtime (ORT) has done lots of work to accelerate Bert inference. Most of the technologies we introduce in this section can be found in this ONNX blog.
Op Fusion
ORT fuses a few key Ops in transformer, including Gelu, LayerNormalization and Attention (multi-head attention actually).
Attention Op implementation
Inside ORT Attention Op, three matrix multiplications to get Q,K,V are merged into one. And the columns of matrix Q, K, and V are partitioned based on the number of self-attention heads, because multi-heads can be calculated in parallel by nature.
Performance on ONNX Runtime
We tested the optimization methods in the ONNX blog, and here’re the results.
Test Settings
Model
Bert-base
Sequence length
128
CPU
Intel Xeon W2123
ONNX Runtime
1.2.0 built with march=native
Model version
thread
Latency(ms)
Raw ONNX Bert-base
1
294
Raw ONNX Bert-base
8
111
Fuse Gelu, LayerNormalization
1
273
Fuse Gelu, LayerNormalization
8
91
Fuse Gelu, LayerNormalization, Embedding
1
286
Fuse Gelu, LayerNormalization, Embedding
8
95
Fuse Gelu, SkipLayerNormalization, Embedding, Attention
1
297
Fuse Gelu, SkipLayerNormalization, Embedding, Attention
8
75
From above results and Attention Op source code, we can see that in ONNX parallels the computation better with fused Op, but it also brings some drop back in single thread.
Dynamic input length
ONNX supports dynamic length for inputs. That benefits a lot in our experiments since shorter sequences require less computation. How much this can help will depend on your actual input length, here’s the result from Xin Liu’s experiment on our 3-layer transformer LU model:

Fixed length (length=32) vs Dynamic length. Y-axis stands for latency in microseconds.
Last updated