TVM's operator implementation adopts the decoupled compute/schedule principle from Halide.
TVM introduces a tensor expression language to support automatic code generation. The following code shows an example tensor expression to compute transposed matrix multiplication:
As shown in previous GEMM chapter, this naïve implementation is not efficient and we can optimize it by better data locality and applying hardware intrinsics like SIMD. From previous Halide chapter, we know that writing optimal codes directly is complex and Halide provides a set of API to decouple computation definition and schedule. TVM adopts the same idea and provides similar APIs for the same purpose.
Just like Halide, TVM's schedule APIs include split, tile, compute_at, etc. Check TVM's tutorial here for detailed introduction. One major difference between Halide is the tensorize API, TVM supports to customize tensorize API by python codes, so we can use different intrinsic functions for different hardware easily. Check here for details.
With these schedule APIs, we can optimize matrix multiplication like below codes. Full sample codes can be found here.
s = te.create_schedule(C.op)# Tile and split the computation into smaller blocksxo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1],32,32)ko, ki = s[C].split(k,factor=4)s[C].reorder(xo, yo, ko, ki, xi, yi)# Parallel yi dimension with intrinsicss[C].vectorize(yi)
AutoTVM
Like Halide, TVM also supports tuning schedule parameters automatically. There're 2 steps to do it, first is to define the search space, and then search through the space automatically.
Define search space
For previous matrix multiplication sample, we set tile factor to 32, but it may not be optimal. Instead, We can define a search space.
Search through the space
TVM provides different methods to search the space, like random search and grid search. They are good enough for smaller space, but for larger space Machine Learning based methods can be more efficient.
ML model that takes the lowered loop program as input and predicts its running time on a given hardware back-end. The model, trained using runtime measurement data collected during exploration, does not require the user to input detailed hardware information. The model uses a rank objective to predict the relative order of runtime costs (A runs faster than B)[1].
In practice, TVM provides a gradient tree boosting model. On average, it does prediction in 0.67ms, thousands of times faster than running a real measurement[1].
Check here for a sample of tuning E2E model automatically.
References
[1] Tianqi Chen, et al. TVM: An automated end-to-end optimizing compiler for deep learning. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), Carlsbad, CA, 2018. USENIX Association.