FLOPS

CPI and Latency

One valuable measure of how a computer system is performing is the average number of CPU cycles that are needed for instructions to complete. It provides indications of how much latency is in the system. Metrics realted to this are CPI, throughput and latency.

Cycles per instruction (CPI) is defined by CPU cycles taken for one instruction to complete. A single machine instruction may take one or more CPU cycles to complete. But the average or effective CPI is measured on the total number of CPU cycles that are used divided by the total number of instructions executed within a period of time. Modern CPUs are pipelined. Instructions flow through the CPU in stages. The design gives CPUs the ability to start another instruction before current instruction is completed on each clock cycle. The throughput, which is the reciprocal of CPI, is how often an operation can start when no data dependencies force it to wait (not IO bound).

Latency is the number of CPU clocks it takes for an instruction to have its data available for use by another instruction. Therefore, an instruction which has a latency of 6 clocks will have its data available for another instruction that many clocks after it starts its execution[1]. It is measured on a single instrution and we don't average it over time. As a result, that CPI and latency are not the same concepts.

Throughput(or CPI) and latency are two separate metrics that are often provided together as the basis for instruction performance on a microprocessor. As an example, here is the latency and CPI numbers of the SIMD instruction _mm512_fmadd_pd from Intel's Specification. For Skylake, the CPI is 0.5, which means on average one clock cycle can produce two _mm512_fmadd_pd instructions. The latency of 4 means that it takes 4 clock cycles for the instruction to complete.

Architecture

Latency

CPI

Icelake

Skylake

0.5

Knights Landing

0.5

FLOPS

For scientific computing tasks, there are intensive floating point calculations. Floating point operations per second (FLOPS, flops) thus becomes an important and comprehensive measurement for the system performance. Take the product of two vectors as example. If a and b are vectors of length 1,000, the total number of floating point operations during the process will be 2,000(1,000 multiplications and additional 1,000 additions to accumulate the results). Provided that the task finishes in half a second, the FLOPS of the task is 4,000 FLOPS.

Another relevant term, FLOPs(plural of FLOP), is used in context of floating point operations as well. It is the amount of floating point operations carried out by the computer hardware. In conclusion, FLOPS is a measurement on speed while FLOPs is a measurement on the absoulte amount. For the vector product example above, its FLOPs and FLOPS are 2000 and 4000 respectively.

\text{FLOPS} = \frac{\text{FLOPs}}{\text{Time Elapsed}}

Theoretical FLOPS of CPU

Regarding SIMD instructions, the maximum FLOPS of a CPU can be calculated by:

\text{FLOPS}=\frac{1}{\text{CPI}} * \frac{\text{Vector Elements}}{\text{Instruction}} * \text{Number of Cores} * \text{Clock Frequency}

Specifically, with FMA it will be:

\frac{\text{2 FLOPs}}{\text{FMA Instruction}} * \text{FLOPS}

Let's look at an example. Here are the specifications for the Intel Xeon W2123 CPU. We want its FLOPS of multiplication/addition operations on double precision floating point numbers.

Intel Xeon W2123 Specification

Num of cores

Base speed

3.6 GHz

AVX 512

Yes

FMA

Yes

The following capabilities of the CPU could be leveraged to maximize the FLOPS:

With AVX 512, a vector of 8 double elements can be calculated per instruction.
With FMA instruction _mm512_fmadd_pd, 2 FLOPs(1 multiply and 1 add) can be done per instruction.

Given that the CPI of the FMA instruction is 0.5, the final theoretical FLOPS number is derived:

2 * \frac{1}{0.5} * 8 * 4 * 3.6 = 460.8 \text{ GFLOPS}

Benchmark FLOPS of CPU

This git repo provides FLOPS benchmark tools running on both Windows and Linux for a variety SIMD instructions.

As an example, here is a reference benchmark result for the above Intel Xeon W2123 CPU on Windows. We can see in the last line that GFLOPS of double precision FMA(with _mm512_fmadd_pd) is about 429, which is very close to the theoretical result 460.8. Achieving the theoretical value in real code requires very careful tuning, and near-zero cache misses, and no bottlenecks on anything else.

Running Skylake Purley tuned binary with 1 thread...
Single-Precision - 128-bit AVX - Add/Sub GFlops = 25.888
Double-Precision - 128-bit AVX - Add/Sub GFlops = 13.072
Single-Precision - 128-bit AVX - Multiply GFlops = 26.112
Double-Precision - 128-bit AVX - Multiply GFlops = 13.248
Single-Precision - 128-bit AVX - Multiply + Add GFlops = 26.256
Double-Precision - 128-bit AVX - Multiply + Add GFlops = 12.624
Single-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 57.024
Double-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 28.416
Single-Precision - 256-bit AVX - Add/Sub GFlops = 41.92
Double-Precision - 256-bit AVX - Add/Sub GFlops = 19.776
Single-Precision - 256-bit AVX - Multiply GFlops = 49.536
Double-Precision - 256-bit AVX - Multiply GFlops = 24.48
Single-Precision - 256-bit AVX - Multiply + Add GFlops = 42.912
Double-Precision - 256-bit AVX - Multiply + Add GFlops = 22.08
Single-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 101.952
Double-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 52.224
Single-Precision - 512-bit AVX512 - Add/Sub GFlops = 101.888
Double-Precision - 512-bit AVX512 - Add/Sub GFlops = 49.408
Single-Precision - 512-bit AVX512 - Multiply GFlops = 101.376
Double-Precision - 512-bit AVX512 - Multiply GFlops = 51.072
Single-Precision - 512-bit AVX512 - Multiply + Add GFlops = 106.368
Double-Precision - 512-bit AVX512 - Multiply + Add GFlops = 49.152
Single-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 205.824
Double-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 107.136
Running Skylake Purley tuned binary with 8 thread(s)...
Single-Precision - 128-bit AVX - Add/Sub GFlops = 116.832
Double-Precision - 128-bit AVX - Add/Sub GFlops = 57.296
Single-Precision - 128-bit AVX - Multiply GFlops = 113.952
Double-Precision - 128-bit AVX - Multiply GFlops = 56.808
Single-Precision - 128-bit AVX - Multiply + Add GFlops = 111.12
Double-Precision - 128-bit AVX - Multiply + Add GFlops = 58.056
Single-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 233.28
Double-Precision - 128-bit FMA3 - Fused Multiply Add GFlops = 117.648
Single-Precision - 256-bit AVX - Add/Sub GFlops = 212.672
Double-Precision - 256-bit AVX - Add/Sub GFlops = 106.432
Single-Precision - 256-bit AVX - Multiply GFlops = 200.832
Double-Precision - 256-bit AVX - Multiply GFlops = 103.872
Single-Precision - 256-bit AVX - Multiply + Add GFlops = 207.744
Double-Precision - 256-bit AVX - Multiply + Add GFlops = 102.24
Single-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 415.488
Double-Precision - 256-bit FMA3 - Fused Multiply Add GFlops = 212.736
Single-Precision - 512-bit AVX512 - Add/Sub GFlops = 419.584
Double-Precision - 512-bit AVX512 - Add/Sub GFlops = 209.792
Single-Precision - 512-bit AVX512 - Multiply GFlops = 418.944
Double-Precision - 512-bit AVX512 - Multiply GFlops = 205.248
Single-Precision - 512-bit AVX512 - Multiply + Add GFlops = 404.352
Double-Precision - 512-bit AVX512 - Multiply + Add GFlops = 206.784
Single-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 853.248
Double-Precision - 512-bit AVX512 - Fused Multiply Add GFlops = 429.312

Note that the CPU has 4 physical cores/8 logical cores. We need to set thread number to 8 to achieved 4 times FLOPS of that for the single thread. If we set thread number to 4, the FLOPS will drop down. This is caused by they may be scheduled to the same physical cores.

FLOPS of Popular CPUs

Here is a list of a few CPUs' FLOPS given a single core.

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):

16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver

16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.

32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions

Benchmark FLOPS of Code

The perf tool(also known as perf_events) on Linux could be used to profile the flops of a program. Here is a naive implementation for matrix multiplication c=a*b.

// File: matmul.cpp

#include <stdio.h>
#include <stdlib.h>

const int m = 1024, n = 1024, k = 256;
float a[m*k], b[k*n], c[m*n];

void mat_mul(float* a, float* b, float* c, int m, int n, int k) {
    for(int i = 0; i < m; i++) {
        for(int j = 0; j < n; j++) {
            for(int kk = 0; kk < k; kk++)
                c[i*n + j] += (a[i*k + kk] * b[kk*n + j]);
        }
    }
}

void init(float* mat, int r, int c) {
    for(int i = 0; i < r; i++) {
        for(int j = 0; j < c; j++) {
            mat[i*c + j] = i+j;
        }
    }
}

int main() {
    init(a, m, k);
    init(b, k, n);
    for(int ite = 0; ite < 100; ite++) {
        mat_mul(a, b, c, m, n, k);
    }

    return 0;
}

The total FLOPs of the task is about 53.6G. It is completed within about 12.6 seconds. Theoretically, the GFLOPS is about 4.2G, which is almost the same as the emperical result given by perf.

\text{FLOPs(theoretical): }2*m*n*k*100 = 53,687,091,200 \\ \text{FLOPS(theoretical): }\frac{53,687,091,200}{12.68}=4.2

user@ubuntu:gcc -o3 matmul.cpp
user@ubuntu:~/tmp$ sudo perf stat -M GFLOPs  ./a.out 

Performance counter stats for './a.out':

0      fp_arith_inst_retired.scalar_single #      4.2 GFLOPs                   (66.63%)
0      fp_arith_inst_retired.scalar_double                                     (66.66%)
0      fp_arith_inst_retired.128b_packed_double                                     (66.69%)
13,424,846,085      fp_arith_inst_retired.128b_packed_single                                     (66.69%)
0      fp_arith_inst_retired.256b_packed_double                                     (66.68%)
0      fp_arith_inst_retired.256b_packed_single                                     (66.65%)
12,682,405,956 ns   duration_time                                               

12.682405956 seconds time elapsed

12.677527000 seconds user
0.004000000 seconds sys

The above code is compiled with -O3 option. GCC optimizes the floating point operations with SIMD instructions(SSE). If we disable the optimization, the GLOPS will drop significantly.

\text{FLOPs(emperical): }53,687,577,968 \\ \text{FLOPS(emperical): }\frac{53,687,577,968}{62.05}=0.9

user@ubuntu:~/tmp$ gcc -O2 matmul.cpp 
user@ubuntu:~/tmp$ sudo perf stat -M GFLOPs  ./a.out 
[sudo] password for luye: 

Performance counter stats for './a.out':

53,687,577,968      fp_arith_inst_retired.scalar_single #      0.9 GFLOPs                   (66.66%)
0      fp_arith_inst_retired.scalar_double                                     (66.66%)
0      fp_arith_inst_retired.128b_packed_double                                     (66.66%)
0      fp_arith_inst_retired.128b_packed_single                                     (66.67%)
0      fp_arith_inst_retired.256b_packed_double                                     (66.67%)
0      fp_arith_inst_retired.256b_packed_single                                     (66.67%)
62,055,631,190 ns   duration_time                                               

62.055631190 seconds time elapsed

62.043677000 seconds user
0.007999000 seconds sys

References

1. Measuring Instruction Latency and Throughput[link][cache]

2. Instruction tables. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. [link]

PreviousRAM and Cache NextMM Revisited

Last updated 5 years ago

hashtagCPI and Latency

hashtagFLOPS

hashtagTheoretical FLOPS of CPU

hashtagBenchmark FLOPS of CPU

hashtagFLOPS of Popular CPUs

hashtagBenchmark FLOPS of Code

hashtagReferences

CPI and Latency

FLOPS

Theoretical FLOPS of CPU

Benchmark FLOPS of CPU

FLOPS of Popular CPUs

Benchmark FLOPS of Code

References