What is Halide

Halide is a programming language designed to make it easier to write high-performance image and array processing code on modern machines [halide]. Halide is a Domain Specific Language embedded in C++. This means you write C++ code that builds an in-memory representation of a Halide pipeline using Halide's C++ API.

Halide was designed for image processing pipeline initially. Because it's all about tensor computation, the ideas can also be applied to Deep Learning inference. In fact, TVM uses HalideIR as data structure for arithmetic simplification and low level lowering. So it'll be very inspiring to understand Halide's design for Deep Learning inference acceleration.

Halide's motivation

Challenges

For image processing or Deep Learning inference, the performance difference between a naïve implementation of a given pipeline and a highly optimized one is frequently an order of magnitude or more.

With current tools, often the only way to approach peak performance is by hand-writing parallel, vectorized, tiled, and globally fused code in low-level C, CUDA, intrinsic, and assembly. Writing and tuning them requires immense effort by expert programmers, and the end result is not portable to different architectures, nor composable with other algorithms, without sacrificing much of this painstakingly earned performance.

Libraries of optimized subroutines do not solve the problem, either, since many critical optimizations involve fusion for producer-consumer locality across stages. [1]

A 3x3 blur filter example

This example was introduced in [1] and we will use it in following sections.

void box_filter_3x3(const Image &in, Image &blury) {
    Image blurx(in.width(), in.height());  // allocate blurx array
    blurx(x, y) = (in(x-1, y) + in(x, y) + in(x+1, y))/3;
    blury(x, y) = (blurx(x, y-1) + blurx(x, y) + blurx(x, y+1))/3;
}

Though the original implementation is very simple, optimizing it with hand-write codes can be complex and require expertise. Below codes show a 10x faster implementation which takes locality, SIMD, parallel into consideration. Here we just want you to have an intuition about how complex the optimization can be, you don't have to understand the code details for now.

void box_filter_3x3(const Image &in, Image &blury) {
    __m128i one_third = _mm_set1_epi16(21846); 
    #pragma omp parallel for
    for (int yTile = 0; yTile < in.height(); yTile += 32) { 
        __m128i a, b, c, sum, avg; 
        __m128i blurx[(256/8)*(32+2)]; // allocate tile blurx array 
        for (int xTile = 0; xTile < in.width(); xTile += 256) { 
        __m128i *blurxPtr = blurx; 
        for (int y = -1; y < 32+1; y++) { 
            const uint16_t *inPtr = &(in[yTile+y][xTile]); 
            for (int x = 0; x < 256; x += 8) {           
            a = _mm_loadu_si128((__m128i*)(inPtr-1)); 
            b = _mm_loadu_si128((__m128i*)(inPtr+1)); 
            c = _mm_load_si128((__m128i*)(inPtr)); 
            sum = _mm_add_epi16(_mm_add_epi16(a, b), c); 
            avg = _mm_mulhi_epi16(sum, one_third); 
            _mm_store_si128(blurxPtr++, avg); 
            inPtr += 8; 
        }} 
        blurxPtr = blurx; 
        for (int y = 0; y < 32; y++) { 
            __m128i *outPtr = (__m128i *)(&(blury[yTile+y][xTile])); 
            for (int x = 0; x < 256; x += 8) { 
            a = _mm_load_si128(blurxPtr+(2*256)/8); 
            b = _mm_load_si128(blurxPtr+256/8); 
            c = _mm_load_si128(blurxPtr++); 
            sum = _mm_add_epi16(_mm_add_epi16(a, b), c); 
            avg = _mm_mulhi_epi16(sum, one_third); 
            _mm_store_si128(outPtr++, avg); 
}}}}}

Goal of Halide

Halide aims to decouple the algorithm definition from its execution strategy. By providing portable and composable abstraction, programmers can try out different optimizations easily, Halide can also automates the search for optimized mappings of the resulting pipelines to parallel machines and complex memory hierarchies.

We'll show you how Halide solve this challenge in following sections.

References

[1] Ragan-Kelley, Jonathan, et al. "Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines." Acm Sigplan Notices 48.6 (2013): 519-530.

PreviousBERT NextSchedule

Last updated 5 years ago

hashtagHalide's motivation

hashtagChallenges

hashtagA 3x3 blur filter example

hashtagGoal of Halide

hashtagReferences

Halide's motivation

Challenges

A 3x3 blur filter example

Goal of Halide

References