NVIDIA A100-SXM4-40GB (sm_80, 39.586 GiB)
Benchmarks were run on a DGX-A100 at PC2.
For comparison: Datasheet by NVIDIA.
Peakflops
CUDA cores
julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float32
└ max: 19.5
julia> theoretical_peakflops_gpu(; dtype=Float64, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float64
└ max: 9.7
julia> peakflops_gpu(; dtype=Float32, tensorcores=false);
Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float32
└ max: 19.1
julia> peakflops_gpu(; dtype=Float64, tensorcores=false);
Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float64
└ max: 9.6
julia> peakflops_gpu(; dtype=Float16, tensorcores=false);
Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float16
└ max: 12.8
Tensor cores
julia> theoretical_peakflops_gpu(; dtype=Int8, tensorcores=true);
Theoretical Peakflops (TOP/s):
├ tensorcores: true
├ dtype: Int8
└ max: 623.7
julia> theoretical_peakflops_gpu(; dtype=Float16, tensorcores=true);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: true
├ dtype: Float16
└ max: 311.9
julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=true);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: true
├ dtype: Float32
└ max: 155.9
julia> peakflops_gpu(; dtype=Int8, tensorcores=true); # as of writing, only works with CUDA.jl#master
Peakflops (TOP/s):
├ tensorcores: true
├ dtype: Int8
└ max: 620.11
julia> peakflops_gpu(; dtype=Float16, tensorcores=true);
Peakflops (TFLOP/s):
├ tensorcores: true
├ dtype: Float16
└ max: 311.2
julia> peakflops_gpu(; dtype=:TensorFloat32, tensorcores=true); # as of writing, only works with Julia >= 1.8.0 and CUDA.jl PR 1419
Peakflops (TFLOP/s):
├ tensorcores: true
├ dtype: TensorFloat32
└ max: 155.55
Memory bandwidth
julia> theoretical_memory_bandwidth();
Theoretical Maximal Memory Bandwidth (GiB/s):
└ max: 1448.4
julia> memory_bandwidth();
Memory Bandwidth (GiB/s):
└ max: 1220.7
julia> GiB(1220.7) |> change_base
~1310.72 GB
julia> memory_bandwidth_saxpy();
Memory Bandwidth (GiB/s):
└ max: 1192.09
Host-to-device bandwidth
julia> host2device_bandwidth()
Host <-> Device Bandwidth (GiB/s):
└ max: 11.84
Host (pinned) <-> Device Bandwidth (GiB/s):
└ max: 24.33
Peer-to-peer bandwidth
julia> p2p_bandwidth();
Bandwidth (GiB/s):
├ max: 247.32
├ min: 173.5
├ avg: 229.63
└ std_dev: 31.67
julia> p2p_bandwidth_all()
8×8 Matrix{Union{Nothing, Float64}}:
nothing 245.706 241.075 244.467 246.434 242.229 245.085 245.033
239.046 nothing 241.776 243.853 241.626 245.136 244.467 240.379
246.957 242.633 nothing 242.937 245.291 248.114 239.193 242.684
244.724 241.375 244.211 nothing 245.861 238.117 245.085 242.28
241.576 246.329 242.582 245.602 nothing 246.59 240.677 243.343
247.114 240.18 245.965 244.006 236.616 nothing 242.28 244.673
243.802 242.028 248.326 239.933 244.365 245.033 nothing 245.498
245.136 246.904 239.488 243.343 244.057 240.627 243.445 nothing
GPU information
julia> CUDA.versioninfo()
CUDA toolkit 11.5, local installation
NVIDIA driver 495.29.5, for CUDA 11.5
CUDA driver 11.5
Libraries:
- CUBLAS: 11.7.3
- CURAND: 10.2.6
- CUFFT: 10.6.0
- CUSOLVER: 11.2.1
- CUSPARSE: 11.7.0
- CUPTI: 16.0.0
- NVML: 11.0.0+495.29.5
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
0: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
1: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
2: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
3: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
4: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
5: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
6: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
7: NVIDIA A100-SXM4-40GB (sm_80, 39.583 GiB / 39.586 GiB available)
julia> gpuinfo()
Device: NVIDIA A100-SXM4-40GB (CuDevice(0))
Total amount of global memory: 42.5 GB
Number of CUDA cores: 6912
Number of multiprocessors: 108 (64 CUDA cores each)
GPU max. clock rate: 1410 Mhz
Memory clock rate: 1215 Mhz
Memory bus width: 5120-bit
L2 cache size: 41.9 MB
Max. texture dimension sizes (1D): 131072
Max. texture dimension sizes (2D): 131072, 65536
Max. texture dimension sizes (3D): 16384, 16384, 16384
Max. layered 1D texture size: 32768 (2048 layers)
Max. layered 2D texture size: 32768, 32768 (2048 layers)
Total amount of constant memory: 65.5 kB
Total amount of shared memory per block: 49.2 kB
Total number of registers available per block: 65536
Warp size: 32
Max. number of threads per multiprocessor: 2048
Max. number of threads per block: 1024
Max. dimension size of a thread block (x,y,z): 1024, 1024, 64
Max. dimension size of a grid size (x,y,z): 2147483647, 65535, 65535
Texture alignment: 512.0 B
Maximum memory pitch: 2.1 GB
Concurrent copy and kernel execution: Yes with 3 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing host memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for surfaces: Yes
Device has ECC support: Yes
Device supports Unified Addressing (UVA): Yes
Device supports managed memory: Yes
Device supports compute preemption: Yes
Supports cooperative kernel launch: Yes
Supports multi-device co-op kernel launch: Yes
Device PCI domain ID / bus ID / device ID: 0 / 7 / 0
Compute mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
julia> gpuinfo_p2p_access()
P2P Access Supported:
8×8 Matrix{Bool}:
0 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1
1 1 0 1 1 1 1 1
1 1 1 0 1 1 1 1
1 1 1 1 0 1 1 1
1 1 1 1 1 0 1 1
1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 0
P2P Atomic Supported:
8×8 Matrix{Bool}:
0 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1
1 1 0 1 1 1 1 1
1 1 1 0 1 1 1 1
1 1 1 1 0 1 1 1
1 1 1 1 1 0 1 1
1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 0