FLOPS
CPI and Latency
One valuable measure of how a computer system is performing is the average number of CPU cycles that are needed for instructions to complete. It provides indications of how much latency is in the system. Metrics realted to this are CPI, throughput and latency.
Cycles per instruction (CPI) is defined by CPU cycles taken for one instruction to complete. A single machine instruction may take one or more CPU cycles to complete. But the average or effective CPI is measured on the total number of CPU cycles that are used divided by the total number of instructions executed within a period of time. Modern CPUs are pipelined. Instructions flow through the CPU in stages. The design gives CPUs the ability to start another instruction before current instruction is completed on each clock cycle. The throughput, which is the reciprocal of CPI, is how often an operation can start when no data dependencies force it to wait (not IO bound).
Latency is the number of CPU clocks it takes for an instruction to have its data available for use by another instruction. Therefore, an instruction which has a latency of 6 clocks will have its data available for another instruction that many clocks after it starts its execution[1]. It is measured on a single instrution and we don't average it over time. As a result, that CPI and latency are not the same concepts.
Throughput(or CPI) and latency are two separate metrics that are often provided together as the basis for instruction performance on a microprocessor. As an example, here is the latency and CPI numbers of the SIMD instruction _mm512_fmadd_pd from Intel's Specification. For Skylake, the CPI is 0.5, which means on average one clock cycle can produce two _mm512_fmadd_pd instructions. The latency of 4 means that it takes 4 clock cycles for the instruction to complete.
Architecture
Latency
CPI
Icelake
4
1
Skylake
4
0.5
Knights Landing
6
0.5
FLOPS
For scientific computing tasks, there are intensive floating point calculations. Floating point operations per second (FLOPS, flops) thus becomes an important and comprehensive measurement for the system performance. Take the product of two vectors as example. If a and b are vectors of length 1,000, the total number of floating point operations during the process will be 2,000(1,000 multiplications and additional 1,000 additions to accumulate the results). Provided that the task finishes in half a second, the FLOPS of the task is 4,000 FLOPS.
Another relevant term, FLOPs(plural of FLOP), is used in context of floating point operations as well. It is the amount of floating point operations carried out by the computer hardware. In conclusion, FLOPS is a measurement on speed while FLOPs is a measurement on the absoulte amount. For the vector product example above, its FLOPs and FLOPS are 2000 and 4000 respectively.
Theoretical FLOPS of CPU
Regarding SIMD instructions, the maximum FLOPS of a CPU can be calculated by:
Specifically, with FMA it will be:
Let's look at an example. Here are the specifications for the Intel Xeon W2123 CPU. We want its FLOPS of multiplication/addition operations on double precision floating point numbers.
Intel Xeon W2123 Specification
Num of cores
4
Base speed
3.6 GHz
AVX 512
Yes
FMA
Yes
The following capabilities of the CPU could be leveraged to maximize the FLOPS:
With AVX 512, a vector of 8 double elements can be calculated per instruction.
With FMA instruction _mm512_fmadd_pd, 2 FLOPs(1 multiply and 1 add) can be done per instruction.
Given that the CPI of the FMA instruction is 0.5, the final theoretical FLOPS number is derived:
Benchmark FLOPS of CPU
This git repo provides FLOPS benchmark tools running on both Windows and Linux for a variety SIMD instructions.
As an example, here is a reference benchmark result for the above Intel Xeon W2123 CPU on Windows. We can see in the last line that GFLOPS of double precision FMA(with _mm512_fmadd_pd) is about 429, which is very close to the theoretical result 460.8. Achieving the theoretical value in real code requires very careful tuning, and near-zero cache misses, and no bottlenecks on anything else.
Note that the CPU has 4 physical cores/8 logical cores. We need to set thread number to 8 to achieved 4 times FLOPS of that for the single thread. If we set thread number to 4, the FLOPS will drop down. This is caused by they may be scheduled to the same physical cores.
FLOPS of Popular CPUs
Here is a list of a few CPUs' FLOPS given a single core.
Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):
16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver
16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.
32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
Benchmark FLOPS of Code
The perf tool(also known as perf_events) on Linux could be used to profile the flops of a program. Here is a naive implementation for matrix multiplication c=a*b.
The total FLOPs of the task is about 53.6G. It is completed within about 12.6 seconds. Theoretically, the GFLOPS is about 4.2G, which is almost the same as the emperical result given by perf.
The above code is compiled with -O3 option. GCC optimizes the floating point operations with SIMD instructions(SSE). If we disable the optimization, the GLOPS will drop significantly.
References
1. Measuring Instruction Latency and Throughput[link][cache]
2. Instruction tables. Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD, and VIA CPUs. [link]
Last updated