Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines [halide]. Halide is a Domain Specific Language embedded in C++. This means you write C++ code that builds an in-memory representation of a Halide pipeline using Halide's C++ API.
Halide was designed for image processing pipeline initially. Because it's all about tensor computation, the ideas can also be applied to Deep Learning inference. In fact, TVM uses HalideIR as data structure for arithmetic simplification and low level lowering. So it'll be very inspiring to understand Halide's design for Deep Learning inference acceleration.
Halide's motivation
Challenges
For image processing or Deep Learning inference, the performance difference between a naïve implementation of a given pipeline and a highly optimized one is frequently an order of magnitude or more.
With current tools, often the only way to approach peak performance is by hand-writing parallel, vectorized, tiled, and globally fused code in low-level C, CUDA, intrinsic, and assembly. Writing and tuning them requires immense effort by expert programmers, and the end result is not portable to different architectures, nor composable with other algorithms, without sacrificing much of this painstakingly earned performance.
Libraries of optimized subroutines do not solve the problem, either, since many critical optimizations involve fusion for producer-consumer locality across stages. [1]
A 3x3 blur filter example
This example was introduced in [1] and we will use it in following sections.
Though the original implementation is very simple, optimizing it with hand-write codes can be complex and require expertise. Below codes show a 10x faster implementation which takes locality, SIMD, parallel into consideration. Here we just want you to have an intuition about how complex the optimization can be, you don't have to understand the code details for now.
voidbox_filter_3x3(const Image &in, Image &blury){ __m128i one_third =_mm_set1_epi16(21846);#pragma ompparallelforfor(int yTile =0; yTile <in.height(); yTile +=32){ __m128i a, b, c, sum, avg; __m128i blurx[(256/8)*(32+2)];// allocate tile blurx array for(int xTile =0; xTile <in.width(); xTile +=256){ __m128i *blurxPtr = blurx;for(int y =-1; y <32+1; y++){constuint16_t*inPtr =&(in[yTile+y][xTile]);for(int x =0; x <256; x +=8){ a =_mm_loadu_si128((__m128i*)(inPtr-1)); b =_mm_loadu_si128((__m128i*)(inPtr+1)); c =_mm_load_si128((__m128i*)(inPtr)); sum =_mm_add_epi16(_mm_add_epi16(a, b), c); avg =_mm_mulhi_epi16(sum, one_third);_mm_store_si128(blurxPtr++, avg); inPtr +=8;}} blurxPtr = blurx;for(int y =0; y <32; y++){ __m128i *outPtr =(__m128i *)(&(blury[yTile+y][xTile]));for(int x =0; x <256; x +=8){ a =_mm_load_si128(blurxPtr+(2*256)/8); b =_mm_load_si128(blurxPtr+256/8); c =_mm_load_si128(blurxPtr++); sum =_mm_add_epi16(_mm_add_epi16(a, b), c); avg =_mm_mulhi_epi16(sum, one_third);_mm_store_si128(outPtr++, avg);}}}}}
Goal of Halide
Halide aims to decouple the algorithm definition from its execution strategy. By providing portable and composable abstraction, programmers can try out different optimizations easily, Halide can also automates the search for optimized mappings of the resulting pipelines to parallel machines and complex memory hierarchies.
We'll show you how Halide solve this challenge in following sections.
References
[1] Ragan-Kelley, Jonathan, et al. "Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines." Acm Sigplan Notices 48.6 (2013): 519-530.