NVIDIA A100-SXM4-40GB (sm_80, 39.586 GiB)

Benchmarks were run on a DGX-A100 at PC2.

For comparison: Datasheet by NVIDIA.

Peakflops

CUDA cores

julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float32
 └ max: 19.5

julia> theoretical_peakflops_gpu(; dtype=Float64, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float64
 └ max: 9.7

julia> peakflops_gpu(; dtype=Float32, tensorcores=false);
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float32
 └ max: 19.1

julia> peakflops_gpu(; dtype=Float64, tensorcores=false);
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float64
 └ max: 9.6

julia> peakflops_gpu(; dtype=Float16, tensorcores=false);
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float16
 └ max: 12.8

Tensor cores

julia> theoretical_peakflops_gpu(; dtype=Int8, tensorcores=true);
Theoretical Peakflops (TOP/s):
 ├ tensorcores: true
 ├ dtype: Int8
 └ max: 623.7

julia> theoretical_peakflops_gpu(; dtype=Float16, tensorcores=true);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float16
 └ max: 311.9

julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=true);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float32
 └ max: 155.9

julia> peakflops_gpu(; dtype=Int8, tensorcores=true); # as of writing, only works with CUDA.jl#master
Peakflops (TOP/s):
 ├ tensorcores: true
 ├ dtype: Int8
 └ max: 620.11

julia> peakflops_gpu(; dtype=Float16, tensorcores=true);
Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float16
 └ max: 311.2

julia> peakflops_gpu(; dtype=:TensorFloat32, tensorcores=true); # as of writing, only works with Julia >= 1.8.0 and CUDA.jl PR 1419
Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: TensorFloat32
 └ max: 155.55

Memory bandwidth

julia> theoretical_memory_bandwidth();
Theoretical Maximal Memory Bandwidth (GiB/s):
 └ max: 1448.4

julia> memory_bandwidth();
Memory Bandwidth (GiB/s):
 └ max: 1220.7

julia> GiB(1220.7) |> change_base
~1310.72 GB

julia> memory_bandwidth_saxpy();
Memory Bandwidth (GiB/s):
 └ max: 1192.09

Host-to-device bandwidth

julia> host2device_bandwidth()
Host <-> Device Bandwidth (GiB/s):
 └ max: 11.84

Host (pinned) <-> Device Bandwidth (GiB/s):
 └ max: 24.33

Peer-to-peer bandwidth

julia> p2p_bandwidth();
Bandwidth (GiB/s):
 ├ max: 247.32
 ├ min: 173.5
 ├ avg: 229.63
 └ std_dev: 31.67

julia> p2p_bandwidth_all()
8×8 Matrix{Union{Nothing, Float64}}:
    nothing  245.706     241.075     244.467     246.434     242.229     245.085     245.033
 239.046        nothing  241.776     243.853     241.626     245.136     244.467     240.379
 246.957     242.633        nothing  242.937     245.291     248.114     239.193     242.684
 244.724     241.375     244.211        nothing  245.861     238.117     245.085     242.28
 241.576     246.329     242.582     245.602        nothing  246.59      240.677     243.343
 247.114     240.18      245.965     244.006     236.616        nothing  242.28      244.673
 243.802     242.028     248.326     239.933     244.365     245.033        nothing  245.498
 245.136     246.904     239.488     243.343     244.057     240.627     243.445        nothing

GPU information

julia> CUDA.versioninfo()
CUDA toolkit 11.5, local installation
NVIDIA driver 495.29.5, for CUDA 11.5
CUDA driver 11.5

Libraries: 
- CUBLAS: 11.7.3
- CURAND: 10.2.6
- CUFFT: 10.6.0
- CUSOLVER: 11.2.1
- CUSPARSE: 11.7.0
- CUPTI: 16.0.0
- NVML: 11.0.0+495.29.5
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
  7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)

julia> gpuinfo()
Device: NVIDIA A100-SXM4-40GB (CuDevice(0))
Total amount of global memory: 42.5 GB
Number of CUDA cores: 6912
Number of multiprocessors: 108 (64 CUDA cores each)
GPU max. clock rate: 1410 Mhz
Memory clock rate: 1215 Mhz
Memory bus width: 5120-bit
L2 cache size: 41.9 MB
Max. texture dimension sizes (1D): 131072
Max. texture dimension sizes (2D): 131072, 65536
Max. texture dimension sizes (3D): 16384, 16384, 16384
Max. layered 1D texture size: 32768 (2048 layers)
Max. layered 2D texture size: 32768, 32768 (2048 layers)
Total amount of constant memory: 65.5 kB
Total amount of shared memory per block: 49.2 kB
Total number of registers available per block: 65536
Warp size: 32
Max. number of threads per multiprocessor: 2048
Max. number of threads per block: 1024
Max. dimension size of a thread block (x,y,z): 1024, 1024, 64
Max. dimension size of a grid size (x,y,z): 2147483647, 65535, 65535
Texture alignment: 512.0 B
Maximum memory pitch: 2.1 GB
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing host memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for surfaces: Yes
Device has ECC support: Yes
Device supports Unified Addressing (UVA): Yes
Device supports managed memory: Yes
Device supports compute preemption: Yes
Supports cooperative kernel launch: Yes
Supports multi-device co-op kernel launch: Yes
Device PCI domain ID / bus ID / device ID: 0 / 7 / 0
Compute mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

julia> gpuinfo_p2p_access()
P2P Access Supported:
8×8 Matrix{Bool}:
 0  1  1  1  1  1  1  1
 1  0  1  1  1  1  1  1
 1  1  0  1  1  1  1  1
 1  1  1  0  1  1  1  1
 1  1  1  1  0  1  1  1
 1  1  1  1  1  0  1  1
 1  1  1  1  1  1  0  1
 1  1  1  1  1  1  1  0

P2P Atomic Supported:
8×8 Matrix{Bool}:
 0  1  1  1  1  1  1  1
 1  0  1  1  1  1  1  1
 1  1  0  1  1  1  1  1
 1  1  1  0  1  1  1  1
 1  1  1  1  0  1  1  1
 1  1  1  1  1  0  1  1
 1  1  1  1  1  1  0  1
 1  1  1  1  1  1  1  0