Graph Level Optimization
Graph level optimization in TVM is similar to other framework like TensorFlow and ONNX.
Operator fusion
Operator fusion is a simple but important optimization in Deep Learning inference. It simply fuses multiple small operations together, for example:
Conv2D + BiasAdd + [Activation]
MatMul + BiasAdd + [Activation]
Fusing ops together provides several performance advantages [1]:
Improves temporal and spatial locality of data access. For example, MatMul is computed block-wise and bias and activation function can be
applied while data is still “hot” in cache.
Completely eliminates Op scheduling overhead (big win for cheap ops)
Increases opportunities for ILP, vectorization etc.
In TVM, it classifies graph operators into four categories [2]:
injective (one-to-one map, e.g., add)
reduction (e.g., sum)
complex-out-fusable (can fuse element-wise map to output, e.g.,
conv2d)
opaque (cannot be fused, e.g., sort).
TVM defines fusion rules based on above operator categories:
Multiple injective operators can be fused into another injective operator.
A reduction operator can be fused with input injective operators (e.g., fuse scale and sum).
Operators such as conv2d are complex-out-fusable, and we can fuse injective operators to its output.
Here're some comparison between fused and non-fused operations from TVM tested on NVIDIA Titan X: ![With and without fusion [2]](https://nightr.gitbook.io/nnopt/~gitbook/image?url=https%3A%2F%2F369426842-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-legacy-files%2Fo%2Fassets%252F-M5GTf1_2Rw6OsaXDnMj%252Fsync%252F077dc91cc736d44c090aa0b1431077ccf98e71a5.png%3Fgeneration%3D1592554672511237%26alt%3Dmedia&width=300&dpr=3&quality=100&sign=65d954a4&sv=2)
Other rule-based graph optimization:
There're many rule-based optimization methods can be applied, for example [1]:
Arithmetic simplifications
Flattening: a+b+c+d => AddN(a, b, c, d)
Hoisting: AddN(x a, b x, x c) => x AddN(a+b+c)
Simplification to reduce number of nodes:
Numeric: x+x+x => 3*x
Logic: !(x > y) => x <= y
Broadcast minimization
Example: (matrix1 + scalar1) + (matrix2 + scalar2) => (matrix1 + matrix2) + (scalar1 + scalar2)
Layout optimization
Example: Use NCHW layout for Conv2D
![NCHW layout for Conv2D [1]](https://nightr.gitbook.io/nnopt/~gitbook/image?url=https%3A%2F%2F369426842-files.gitbook.io%2F%7E%2Ffiles%2Fv0%2Fb%2Fgitbook-legacy-files%2Fo%2Fassets%252F-M5GTf1_2Rw6OsaXDnMj%252Fsync%252Fcd70ad428d7eec3e8260ffcb0721fb23348b409e.png%3Fgeneration%3D1592554672688707%26alt%3Dmedia&width=300&dpr=3&quality=100&sign=9bb7cc31&sv=2)
Just like other popular frameworks, TVM also applies lots of rule-based graph optimizations.
Graph tuner
AutoTVM will search optimal schedule for individual operators, but they may still compose a non-optimal graph. So TVM introduces graph tuner to run graph level tuning.
Graph tuner will benchmark all possible layout transformations in the graph, given a set of schedule candidates, and then combine schedule and layout transformation execution time together to find the optimal schedule combination. [3]
TBD: need more investigation on graph tuner.
Reference
[1] TensorFlow Graph Optimizations
[2] Tianqi Chen, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, 2018. USENIX Association.
Last updated