BERT

ONNX Runtime (ORT) has done lots of work to accelerate Bert inference. Most of the technologies we introduce in this section can be found in this ONNX blogarrow-up-right.

Op Fusion

ORT fuses a few key Ops in transformer, including Gelu, LayerNormalization and Attention (multi-head attention actually).

Attention Op implementation

Inside ORT Attention Oparrow-up-right, three matrix multiplications to get Q,K,V are merged into one. And the columns of matrix Q, K, and V are partitioned based on the number of self-attention heads, because multi-heads can be calculated in parallel by nature.

Performance on ONNX Runtime

We tested the optimization methods in the ONNX blog, and here’re the results.

Test Settings

Model

Bert-base

Sequence length

128

CPU

Intel Xeon W2123

ONNX Runtime

1.2.0 built with march=native

Model version

thread

Latency(ms)

Raw ONNX Bert-base

1

294

Raw ONNX Bert-base

8

111

Fuse Gelu, LayerNormalization

1

273

Fuse Gelu, LayerNormalization

8

91

Fuse Gelu, LayerNormalization, Embedding

1

286

Fuse Gelu, LayerNormalization, Embedding

8

95

Fuse Gelu, SkipLayerNormalization, Embedding, Attention

1

297

Fuse Gelu, SkipLayerNormalization, Embedding, Attention

8

75

From above results and Attention Op source code, we can see that in ONNX parallels the computation better with fused Op, but it also brings some drop back in single thread.

Dynamic input length

ONNX supports dynamic length for inputs. That benefits a lot in our experiments since shorter sequences require less computation. How much this can help will depend on your actual input length, here’s the result from Xin Liu’s experiment on our 3-layer transformer LU model:

Dynamic Input Length

Fixed length (length=32) vs Dynamic length. Y-axis stands for latency in microseconds.

Last updated