Packing
In blocking, matrices A, B and C are divided into blocks. The result C is accumulated by a series of products between blocks of A and B. The problem left is how to multiply one pair of blocks(kernels) from A and B efficiently. It is the focus of this section.
Before we start, a few important facts on the data exchange pattern between RAM, cache and CPUs are worth noticing.
Firstly, the data transfer between RAM and different cache layers are at the unit of the cache line. A cache line is likely to be 64 bytes for most systems. Whenever one byte of memory is requested by the CPU, additional data around it are loaded as well.
Secondly, When a cache line is loaded, other cache lines near it may also be loaded. The system will prefetch some cache lines that are probably needed in the future based on its statistics.
All those facts lead to the consensus that it is better to visit data in memory contiguously. As a result, elements in blocks from A or B have to be re-ordered. The technique is called Packing.
As an example, below matrix A(entire A) is divided to be 12(3x4) blocks in the blocking step. Each block is a 4x4 submatrix. Since the entire A is in row-major order, numbers within a block won't be in a continuous memory area. After packing, numbers in the block are rearranged to be continuous in memory for substantial calculations on it.

There are two levels or steps for packing. They are usually done at the same time, but for different purposes.
Make the block occupy a continuous memory block. It is to improve cache locality.
Further finely reorder numbers within a block. It is required by specific SIMD intrinsics.
To keep the notations simple, A and B is referred to as blocks from A and B instead of the entire A or B in the following.
Using FMA
As an example, it is assumed that we have AVX128 which handles up to four 32 bits floating numbers or integers and all matrices are of row-major order. The block size is set to be 4x4. Although the block size won't be so small in real world, we choose it to have clearer pictures to zoom in. The idea is the same when expanding to bigger block sizes.

The FMA instruction multiplies two vectors and then add it by another vector of the same length, all element-wisely. In this case, the first float value of A shown in bold box is broadcasted 4 times to fill a 128 bits register. First 4 values in row 1 (blue) of B occupies another 128 bits register. The values in those two registers are multiplied and the result is accumulated to another 128 bits register containing the first row of C. Then the next value in A is broadcasted to be multiplied by the 4 float values of B. This continues until the first column of A is read and it comes to the next row of B.

Efficient use of FMA instructions might require a data layout transformation to A, B and C in order to read continuous bits. The above code:
read A element-wisely, and in the column-major order.
read B row by row.
When A, B and C are stored in row-major manner, it is not accessing memory continuously for all. In order to read cache more efficiently, the data layout is better to be arranged in a different way from we have in the beginning of the section:
A should be packed into column-major order.
B and C are not going to be changed, their row-major layout is already continuous and cache friendly.

Last updated