Blocking

The CPU cache is an important topic for any computation task. There is no exception for GEMM. Whenever CPU requests a data block in RAM, it doesn't talk to RAM directly. Instead, the data block is loaded into the cache, and then the CPU reads it from the cache. Reading or writing cache is significantly faster than RAM. These are the numbers Intel lists for a Pentium M processor, measured in CPU cycles[1].

To Where

Cycles

≤ 1

L1d

∼ 3

∼ 14

Main Memory

∼ 240

For GEMM, if the matrix A, B and C are two large, it will be impossible to cache them completely. To best utilize cache, matrix A and B need to be divided into submatrices to finally get the result matrix C. This technique is called Blocking. From algorithm perspective, the idea is all about divide and conquer. What is more, it will improve cache locality at the same time.

Computational Intensity

There is a nice and clear prove on why blocking improves general matrix multiplication(GEMM) speed from Demmel/Yelick[3]. To simplify the prove, we ignore the existence of RAM/L1/L2/l3. Instead, it is assumed that there are only two levels of storage: memory(slow) and cache(fast). The cost of accessing data from cache is assumed to be zero. We also define the following notations:

s = number of elements(floating numbers) moved between cache and memory.
ts = time per memory operation, include moving data to cache and processing.
f = number of arithmetic operations on cache data, including addition and multiplication.
tf = time per arithmetic operation, it is far less that ts.
q = f/s, average number of flops per memory access.

The actual time spent on matrix multiplication(MM) is a sum of computation and data fetch costs. Among them, tf and ts are assumed to be fixed numbers that are determined by the system architecture. Given the dimensions of A, B and C are m*k, k*n, m*n, theoretically, the total number of arithmetic operations of MM is also a fixed number 2*m*n*k(m*n*k additions and m*n*k multipications). The only thing left we are able to change is q.

f * t_f + s * t_s = f * t_f * (1 + \frac{t_s}{t_f} * \frac{1}{q})

The q is often referred to as arithmetic or computational intensity[5]. It is an important measurment on the algorithm efficiency.

A small q means the throughput for data movement between memory and cache is low, and the CPU is waiting for data to be ready and thus not fully utilized.
A big q will lower the time cost above to its minimal value f*tf, which means the data bandwith is big enough compared to the CPU computing power and not to be the bottleneck.

As a result, a good GEMM implementation is the one that minimizes the cost of moving data between slow memory and fast cache. It is reflected by a big computational intensity q.

Naive GEMM Implementation

In naive GEMM implementation, we multipy one row vecotr from A and one column vecotor from B to get one element of C.

# implements C = C + A*B
for i = 1 to m
    # load row i of A into fast cache
    for j = 1 to n
        # load C(i,j) into fast cache
        # load column j of B into fast cache
        for p = 1 to k
            C(i,j) = C(i,j) + A(i,p) * B(p,j)
        # save C(i,j) back to slow memory

The computational intensity of the implementation is around 2, which is a quite low constant.

q=\frac{2mnk}{m*k+mn*k+2mn} \\ = \frac{2}{\frac{1}{n}+1+\frac{2}{k}} \\ \approx 2, \text{ given }n \text{ and } k \text{ are large}

Blocking Accelerates GEMM

With blocking, A, B and C are sliced into blocks of sizes mc*kr, kr*nr and mn*nr respectively, along both row and column dimemsions. Suppose that the cache is large enough to hold all three blocks from A, B and C, the GEMM implementation and its data movements are as follows.

for i = 0 to m/mc
    for j = 0 to n/nr
        # load block C(i,j) into fast cache
        for p = 0 to k/kr
            # load block A(i,p) into fast cache
            # load block B(p,j) into fast cache
            C(i,j) = C(i,j) + A(i,p) * B(p,j) # do a matrix multiply on blocks
        # save block C(i,j) back to slow memory

With the above implementation, here are how many times each block of A, B, C are loaded or saved. Among them, only load operations exist for blocks of A and B. Blocks from C has both load and save operations for the same number of times.

\text{load+save times for each block of} \\ \text{A: }\frac{m}{m_c} * \frac{n}{n_r} * \frac{k}{k_r} \\ \text{B: }\frac{m}{m_c} * \frac{n}{n_r} * \frac{k}{k_r} \\ \text{ C: }2*\frac{m}{m_c} * \frac{n}{n_r}

Then the total number of data movements between memory and cache is given by the following equation. Apparently, increasing block sizes nr and mc reduces data movements.

\frac{m}{m_c} \frac{n}{n_r} \frac{k}{k_r} * (m_c*k_r) + \frac{m}{m_c} \frac{n}{n_r} \frac{k}{k_r} * (k_r*n_r) + 2*\frac{m}{m_c} \frac{n}{n_r}*(mc*n_r) \\ = mnk*\left(\frac{1}{n_r}+\frac{1}{m_c}\right)+2mn

Accordingly, the computational intensity is as follows. If blocks are chosen to be square matrices(b=nr=mc) and block sizes are much smaller numbers than the matrix dimension size k, the computational intensity is decuced to be the block size b.

q=\frac{f}{m}=\frac{2mnk}{mnk*\left(\frac{1}{n_r}+\frac{1}{m_c}\right)+2mn} \\ =\frac{2}{\left(\frac{1}{n_r}+\frac{1}{m_c}\right)+\frac{2}{k}} \\ \approx b, \text{ given }b=n_r=m_c \ll k

The bigger the block size, the higher the computational intensity. This tells us clearly that blocking will improve the speed of GEMM. As an edge case, if the fast cache is large enough to store A, B and C completely, every next time we request an element from them they will be in cache already. Then each elememt of A and B is loaded exactly only once and each element of C is moved twice(1 load+1 save). The computational intensity will be at its peak value.

\text{peak value: }\frac{2mnk}{mk+kn+2mn}

The blocking technique is also called Tiling or Loop Tiling and a block is named a tile. It hides the cost of data movements from memory to cache by increasing the block size in GEMM between large matrices. In most libraries' implementation, the q is designed to be more than 1000.

Blocking Strategies

There are multiple approaches to get the final GEMM result. We could choose to iterate blocks from A first, B first or at the same time. Anatomy of High-Performance Matrix Multiplication has all of them discussed[2]. Here are illustrations for two of them: GEBP+GEPM and GEPDOT+GEPM.

We call each submatrix a block and each row or column made of multiple blocks, a panel. Here is a visual representation.

In GEBP(Block x Panel)+GEPM(Panel x Matrix), we iterate through blocks of A first. Each block of A and the corresponding row panel of B contribute to a row panel of C(GEBP). Each row panel of A and the whole B contribute to a row panel of C(GEPM).

# GEBP + GEPM
for i = 0 to m/mc
    for p = 0 to k/kr
        # load block A(i,p)
        for j = 0 to n/nr
            # load block B(p,j)
            # load block C(i,j)
            C(i,j) = C(i,j) + A(i,p) * B(p,j) # do a matrix multiply on blocks
            # save block C(i,j) back

In GEPDOT(Panel x Panel)+GEMP(Matrix x Panel), we iterate through blocks of A and B at the same time. Each row panel of A and column panel of B contribute to a block of C(GEPDOT). The whole A and a column panel of B contribute to a column panel of C(GEMP).

# GEPDOT + GEMP
for i = 0 to m/mc
    for j = 0 to n/nr
        # load block C(i,j)
        for p = 0 to k/kr
            # load block A(i,p)
            # load block B(p,j)
            C(i,j) = C(i,j) + A(i,p) * B(p,j) # do a matrix multiply on blocks
        # save block C(i,j) back

There isn't an apparently faster approach. For GEBP+GEPM, A is visited only once, but B and C are visited multiple times. For GEPDOT+GEMP, both A and B are visited multiple times, but C is visited only once. In the following, we will pick GEBP+GEPM to discuss, which is one of the most commonly used strategies in many vector acceleration libraries.

Given a specific strategy, different settings for blocking sizes of A, B and C(mc,kc,nr) also have a significant impact on the performance. To avoid reading data from RAM frequently and improve the cache hit rate, general principles are:

During the calculation of each block of result C(mc*nr), guarantee that the block from A(mc*kc), the block from B(kc*nr) and the block of result C(mc*nr) can fit into cache completely during the calculation.
A should be in cache until no longer needed. It means for the entire row panel of B, A should not be evicted from the cache.

However, selecting the best blocking sizes for each of the blocks are non- trivial. Take the constraints that needs to be considered when deciding the value of mc*kc as an example. The cost of updating a row panel of C is as follows. From the last equation, the larger the mc*kc, the lower the possibility that data transfer becomes the bottleneck compared to required CPU flops. But cache size is limited which implies mc*kc cannot be too large. At the same time, we also need to reserve spaces for other blocks from B and C.

\text{Memops}: m_c*k_c+k_c*n+2m_c*n \\ \text{FLOPs}: 2m_c*k_c*n \\ \frac{\text{FLOPs}}{\text{memops}}: \frac{2m_c*k_c}{2m_c+k_c}, \text{ given }kc \ll n \\

The selection of cache layers for blocks of A, B and C is another problem. L1 is faster than L2 but smaller in size. Is L1 the best choice for the block of A for it is required during we iterate all blocks in a panel of B? If L2 is chosen for A, is the data transfer rate between L2 and L1 fast enough to guarantee that it won't be the bottleneck? Answers to those question are not obvious.

We won't go further on this topic. To summarize, the optimal strategy and its settings for blocking sizes and layers are worth more advanced analysis or even empirical testings. They should be different for various hardware or even be different under different system workload. But the guideline for blocking is clear.

choose a blocking strategy
find the optimal settings of blocking sizes
decide the right cache layers for blocks

References

[1] What Every Programmer Should Know About Memory. [link][cache]

[2] Goto, Kazushige, and Robert A. van de Geijn. "Anatomy of high-performance matrix multiplication." ACM Transactions on Mathematical Software (TOMS) 34.3 (2008): 1-25. [link][cache]

[3] Optimizing Cache Performance in Matrix Multiplication. [link][cache]

[4] Tiling matrix-matrix multiply, code tuning. [link][cache]

[5] Estimation of the Computational Intensity. [link][cache]

PreviousMM Revisited NextPacking

Last updated 5 years ago

hashtagComputational Intensity

hashtagNaive GEMM Implementation

hashtagBlocking Accelerates GEMM

hashtagBlocking Strategies

hashtagReferences

Computational Intensity

Naive GEMM Implementation

Blocking Accelerates GEMM

Blocking Strategies

References