Tesla V100-SXM2-32GB (sm_70, 32.00 GiB)

Benchmarks were run on the GPU cluster of the physics faculty of University of Bielefeld.

For comparison: Datasheet by NVIDIA.

Peakflops

Theoretical

julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float32
 └ max: 15.7

julia> theoretical_peakflops_gpu(; dtype=Float64, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float64
 └ max: 7.8

julia> theoretical_peakflops_gpu(; dtype=Float16, tensorcores=true);
Theoretical Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float16
 └ max: 125.3

Empirical

julia> peakflops_gpu(; dtype=Float32, tensorcores=false);
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float32
 └ max: 15.5

julia> peakflops_gpu(; dtype=Float64, tensorcores=false);
Peakflops (TFLOP/s):
 ├ tensorcores: false
 ├ dtype: Float64
 └ max: 7.7

julia> peakflops_gpu(; dtype=Float16, tensorcores=true);
Peakflops (TFLOP/s):
 ├ tensorcores: true
 ├ dtype: Float16
 └ max: 116.4

Memory bandwidth

julia> theoretical_memory_bandwidth();
Theoretical Maximal Memory Bandwidth (GiB/s):
 └ max: 836.4

julia> memory_bandwidth();
Memory Bandwidth (GiB/s):
 └ max: 722.31

julia> GiB(722.31) |> change_base
~775.57 GB

julia> memory_bandwidth(GiB(1.4)) |> GiB |> change_base
Memory Bandwidth (GiB/s):
 └ max: 740.02
~794.59 GB

julia> memory_bandwidth_saxpy(; size=2^20*200) |> GiB |> change_base
Memory Bandwidth (GiB/s):
 └ max: 754.39
~810.02 GB

Host-to-device bandwidth

julia> host2device_bandwidth()
Host <-> Device Bandwidth (GiB/s):
 └ max: 4.46

Host (pinned) <-> Device Bandwidth (GiB/s):
 └ max: 12.14

Peer-to-peer bandwidth

julia> p2p_bandwidth();
Bandwidth (GiB/s):
 ├ max: 22.46
 ├ min: 0.21
 ├ avg: 17.99
 └ std_dev: 9.94

julia> p2p_bandwidth_all()
8_8 Matrix{Union{Nothing, Float64}}:
   nothing  22.5406    45.0071    22.5367    44.9879     8.11161    8.0912     8.09942
 22.4844      nothing  22.5393    44.9706     8.11274   45.0036     8.08732    8.10026
 44.7958    22.5406      nothing  45.0158     8.10381    8.11014   22.5406     7.39972
 22.5301    45.0088    45.0001      nothing   7.428      8.10646    8.08204   22.5401
 44.9897     7.14454    7.14674    7.14906     nothing  22.5371    44.9723    22.538
  6.63411   44.9966     7.14542    7.14467   22.5371      nothing  22.5362    44.9949
  7.75585    7.75844   22.5375     8.10314   44.9966    22.5353      nothing  44.9984
  7.75575    7.75911    8.10116   22.5371    22.5367    44.9966    44.9914      nothing

GPU information

julia> CUDA.versioninfo()
CUDA toolkit 11.6, local installation
NVIDIA driver 510.47.3, for CUDA 11.6
CUDA driver 11.6

Libraries: 
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.1
- CUSOLVER: 11.3.3
- CUSPARSE: 11.7.2
- CUPTI: 16.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: missing
- CUTENSOR: missing

Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80

Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false

8 devices:
  0: Tesla V100-SXM2-32GB (sm_70, 31.449 GiB / 32.000 GiB available)
  1: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
  2: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
  3: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
  4: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
  5: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
  6: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
  7: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)

julia> gpuinfo()
Device: Tesla V100-SXM2-32GB (CuDevice(0))
Total amount of global memory: 31.749 GiB
Number of CUDA cores: 5120
Number of multiprocessors: 80 (64 CUDA cores each)
GPU max. clock rate: 1530 MHz
Memory clock rate: 877 MHz
Memory bus width: 4096-bit
L2 cache size: 6.000 MiB
Max. texture dimension sizes (1D): 131072
Max. texture dimension sizes (2D): 131072, 65536
Max. texture dimension sizes (3D): 16384, 16384, 16384
Max. layered 1D texture size: 32768 (2048 layers)
Max. layered 2D texture size: 32768, 32768 (2048 layers)
Total amount of constant memory: 64.000 KiB
Total amount of shared memory per block: 48.000 KiB
Total number of registers available per block: 65536
Warp size: 32
Max. number of threads per multiprocessor: 2048
Max. number of threads per block: 1024
Max. dimension size of a thread block (x,y,z): 1024, 1024, 64
Max. dimension size of a grid size (x,y,z): 2147483647, 65535, 65535
Texture alignment: 512 bytes
Maximum memory pitch: 2.000 GiB
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing host memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for surfaces: Yes
Device has ECC support: Yes
Device supports Unified Addressing (UVA): Yes
Device supports managed memory: Yes
Device supports compute preemption: Yes
Supports cooperative kernel launch: Yes
Supports multi-device co-op kernel launch: Yes
Device PCI domain ID / bus ID / device ID: 0 / 31 / 0
Compute mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)

julia> gpuinfo_p2p_access()
P2P Access Supported:
8_8 Matrix{Bool}:
 0  1  1  1  1  1  1  1
 1  0  1  1  1  1  1  1
 1  1  0  1  1  1  1  1
 1  1  1  0  1  1  1  1
 1  1  1  1  0  1  1  1
 1  1  1  1  1  0  1  1
 1  1  1  1  1  1  0  1
 1  1  1  1  1  1  1  0

P2P Atomic Supported:
8_8 Matrix{Bool}:
 0  1  1  1  1  0  0  0
 1  0  1  1  0  1  0  0
 1  1  0  1  0  0  1  0
 1  1  1  0  0  0  0  1
 1  0  0  0  0  1  1  1
 0  1  0  0  1  0  1  1
 0  0  1  0  1  1  0  1
 0  0  0  1  1  1  1  0