Schedule Trade-offs

The trade-offs between producer-consumer locality, redundant recomputation and parallelism is critical to computation performance.

Producer-consumer locality can be measured by operations that can occur between producer computing a value and consumer reading it back.
Redundant recomputation measures if we recompute a value which has been computed before.
Parallelism measures the degree of parallelism available, by counting how many threads or simd lanes could be kept busy doing useful work.

We will use the 3x3 blur example mentioned in last section to illustrate this trade-offs. We will show different implementations of this example, and discuss the trade-offs of each.

Breadth-first

This is the most common strategy in hand-written pipelines, and what results from composing library routines together: each stage executes breadth-first across its input before passing its entire output to the next stage. There is abundant parallelism, since all the required points in each stage can be computed and stored independently, but there is little producer-consumer locality, since all the values of blurx must be computed and stored before the first one is used by out.

alloc blurx[2048][3072]
for each y in 0..2048:
    for each x in 0..3072:
        blurx[y][x] = (in[y][x-1] + in[y][x] + in[y][x+1])/3
alloc out[2046][3072]
for each y in 1..2047:
    for each x in 0..3072:
        out[y][x] = (blurx[y-1][x] + blurx[y][x] + blurx[y+1][x])/3

Full fusion

Each pixel can be computed independently, providing the same abundant data parallelism from the breadth-first strategy. The distance from producer to consumer is small, maximizing locality. But because shared values in blurx are not reused across iterations, this strategy performs redundant work.

alloc out[2046][3072]
for each y in 1..2047:
    for each x in 0..3072:
        alloc blurx[-1..1]
        for each i in -1..1:
            blurx[i]= (in[y-1+i][x-1]+in[y-1+i][x]+in[y-1+i][x+1])/3
        out[y][x] = (blurx[0] + blurx[1] + blurx[2])/3

Sliding window

This interleaves the computation over a sliding window, with out trailing blurx by the stencil radius (one scanline). It wastes no work, computing each point in blurx exactly once, and the maximum distance between a value being produced in blurx and consumed in out is proportional to the stencil height (three scanlines), not the entire image. But to achieve this, it has introduced a dependence between the loop iterations: a given iteration of out depends on the last three outer loop iterations of blurx. This only works if these loops are evaluated sequentially. Interleaving the stages while producing each value only once requires tightly synchronizing the order of computation, sacrificing parallelism.

alloc out[2046][3072]
alloc blurx[3][3072]
for each y in -1..2047:
    for each x in 0..3072:
        blurx[(y+1)%3][x]=(in[y+1][x-1]+in[y+1][x]+in[y+1][x+1])/3
        if y < 1: continue
        out[y][x] = (blurx[(y-1)%3][x]
        + blurx[ y % 3 ][x]
        + blurx[(y+1)%3][x])/3

Faster implementations

Above three implementations breadth-first, full fusion and sliding window go worst on producer-consumer locality, redundant work, parallelism respectively. Faster implementations always have better balance on these 3 factors, like two implementations below:

Tiled

alloc out[2046][3072]
for each ty in 0..2048/32:
    for each tx in 0..3072/32:
        alloc blurx[-1..33][32]
        for y in -1..33:
            for x in 0..32:
                blurx[y][x] = (in[ty*32+y][tx*32+x-1]
                + in[ty*32+y][tx*32+x]
                + in[ty*32+y][tx*32+x+1])/3
        for y in 0..32:
            for x in 0..32:
                out[ty*32+y][tx*32+x] = (blurx[y-1][x]
                + blurx[y ][x]
                + blurx[y+1][x])/3

Sliding in tiles

alloc out[2046][3072]
for each ty in 0..2048/8:
    alloc blurx[-1..1][3072]
    for y in -2..8:
        for x in 0..3072:
            blurx[(y+1)%3][x] = (in[ty*8+y+1][tx*32+x-1]
            + in[ty*8+y+1][tx*32+x]
            + in[ty*8+y+1][tx*32+x+1])/3
        if y < 0: continue
        for x in 0..3072:
            out[ty*8+y][x] = (blurx[(y-1)%3][x]
            + blurx[ y % 3][x]
            + blurx[(y+1)%3][x])/3

Measure the trade-offs

Each implementations above make different trade-offs between locality, redundant recomputation, and parallelism. Here we quantify these effects for our two-stage blur pipeline. The span measures the degree of parallelism available, by counting how many threads or simd lanes could be kept busy doing useful work. The Max. reuse distance measures locality, by counting the maximum number of operations that can occur between computing a value and reading it back. Work amplification measures redundant work, by comparing the number of arithmetic operations done to the breadth-first case. Each of the first three strategies represent an extreme point of the choice space, and is weak in one regard.

The fastest schedules are mixed strategies, such as the tiled ones in the last two rows.

References

[1] Ragan-Kelley, Jonathan, et al. "Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines." Acm Sigplan Notices 48.6 (2013): 519-530.

PreviousSchedule NextCompiler

Last updated 5 years ago

hashtagBreadth-first

hashtagFull fusion

hashtagSliding window

hashtagFaster implementations

hashtagTiled

hashtagSliding in tiles

hashtagMeasure the trade-offs

hashtagReferences