GEMM FPGA Benchmark Results

The benchmark results are given divided by the used version of the benchmark, since internal changes in the benchmark code might lead to different performance results. All measurements were done with single precision floating point matrices of size 4096x4096 which equals 64 MB of data. If this size was not evenly dividable by the number of replications, the matrix size was further reduced to achieve equal load for every kernel replication. The measurements were executed 10 times and the best result is published.

The results and the used configuration is given in Table 8 and are also available as CSV.

Table 8 GEMM FPGA Benchmark Results

Version

1.4

1.0

1.0

1.0

1.0

FPGA board

Bittware 520N

BittWare 520N

Alveo U280

Alveo U280

PAC D5005

FPGA

Intel Stratix 10 GX2800

Intel Stratix 10 GX2800

Xilinx XCU280

Xilinx XCU280

Intel Stratix 10 SX

Memory Type

DDR

DDR

DDR

HBM2

SVM

SDK

21.2.0

19.4.0

2019.2

2019.2

19.4.0

BSP/Shell

20.4.0_hpc

19.2.0_hpc

2019.2.3

2019.2.3

18.1.2_svm

CPU

AMD EPYC Milan 7763

Intel Xeon Gold 6148

Intel Xeon Gold 6148

Intel Xeon Gold 6148

Intel Xeon Gold 6148

System

Noctua 2

Noctua 1

Noctua 1

Noctua 1

BLOCK_SIZE

512

512

256

256

512

GEMM_SIZE

8

8

8

8

8

GLOBAL_MEM_UNROLL

8

16

16

16

16

DATA_TYPE

float

float

float

float

float

NUM_REPLICATIONS

5

5

3

3

5

LUT

310564

275754

568558

499002

299427

LUT percent

33

36.0

51.87

42.64

33.0

Register

793535

861277

441602

920127

829802

Register percent

36.0

19.43

38.7

33.0

BRAM

8321

8860

666

666

9041

BRAM percent

71

76.0

43.11

36.71

77.0

DSP

3318

3398

7683

7683

3398

DSP percent

58

59.0

85.23

85.18

59.0

Frequency

273.07

160.42

100.00

236.00

225.00

GFLOPs

1232.50

708.95

266.91

603.86

739.59

GFLOPs norm

90.27

88.39

85.29

88.97

65.74

Error

9.15527e-5

6.0e-7

2.0e-6

2.0e-6

6.0e-7