BERT

ONNX Runtime (ORT) has done lots of work to accelerate Bert inference. Most of the technologies we introduce in this section can be found in this ONNX blog.

Op Fusion

ORT fuses a few key Ops in transformer, including Gelu, LayerNormalization and Attention (multi-head attention actually).

Attention Op implementation

Inside ORT Attention Op, three matrix multiplications to get Q,K,V are merged into one. And the columns of matrix Q, K, and V are partitioned based on the number of self-attention heads, because multi-heads can be calculated in parallel by nature.

Performance on ONNX Runtime

We tested the optimization methods in the ONNX blog, and here’re the results.

Test Settings

Model

Bert-base

Sequence length

128

CPU

Intel Xeon W2123

ONNX Runtime

1.2.0 built with march=native

Model version

thread

Latency(ms)

Raw ONNX Bert-base

294

Raw ONNX Bert-base

111

Fuse Gelu, LayerNormalization

273

Fuse Gelu, LayerNormalization

Fuse Gelu, LayerNormalization, Embedding

286

Fuse Gelu, LayerNormalization, Embedding

Fuse Gelu, SkipLayerNormalization, Embedding, Attention

297

Fuse Gelu, SkipLayerNormalization, Embedding, Attention

From above results and Attention Op source code, we can see that in ONNX parallels the computation better with fused Op, but it also brings some drop back in single thread.

Dynamic input length

ONNX supports dynamic length for inputs. That benefits a lot in our experiments since shorter sequences require less computation. How much this can help will depend on your actual input length, here’s the result from Xin Liu’s experiment on our 3-layer transformer LU model:

Fixed length (length=32) vs Dynamic length. Y-axis stands for latency in microseconds.

PreviousLSTM NextWhat is Halide

Last updated 5 years ago

hashtagOp Fusion

hashtagAttention Op implementation

hashtagPerformance on ONNX Runtime

hashtagDynamic input length

Op Fusion

Attention Op implementation

Performance on ONNX Runtime

Dynamic input length