Tesla V100-SXM2-32GB (sm_70, 32.00 GiB)
Benchmarks were run on the GPU cluster of the physics faculty of University of Bielefeld.
For comparison: Datasheet by NVIDIA.
Peakflops
Theoretical
julia> theoretical_peakflops_gpu(; dtype=Float32, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float32
└ max: 15.7
julia> theoretical_peakflops_gpu(; dtype=Float64, tensorcores=false);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float64
└ max: 7.8
julia> theoretical_peakflops_gpu(; dtype=Float16, tensorcores=true);
Theoretical Peakflops (TFLOP/s):
├ tensorcores: true
├ dtype: Float16
└ max: 125.3
Empirical
julia> peakflops_gpu(; dtype=Float32, tensorcores=false);
Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float32
└ max: 15.5
julia> peakflops_gpu(; dtype=Float64, tensorcores=false);
Peakflops (TFLOP/s):
├ tensorcores: false
├ dtype: Float64
└ max: 7.7
julia> peakflops_gpu(; dtype=Float16, tensorcores=true);
Peakflops (TFLOP/s):
├ tensorcores: true
├ dtype: Float16
└ max: 116.4
Memory bandwidth
julia> theoretical_memory_bandwidth();
Theoretical Maximal Memory Bandwidth (GiB/s):
└ max: 836.4
julia> memory_bandwidth();
Memory Bandwidth (GiB/s):
└ max: 722.31
julia> GiB(722.31) |> change_base
~775.57 GB
julia> memory_bandwidth(GiB(1.4)) |> GiB |> change_base
Memory Bandwidth (GiB/s):
└ max: 740.02
~794.59 GB
julia> memory_bandwidth_saxpy(; size=2^20*200) |> GiB |> change_base
Memory Bandwidth (GiB/s):
└ max: 754.39
~810.02 GB
Host-to-device bandwidth
julia> host2device_bandwidth()
Host <-> Device Bandwidth (GiB/s):
└ max: 4.46
Host (pinned) <-> Device Bandwidth (GiB/s):
└ max: 12.14
Peer-to-peer bandwidth
julia> p2p_bandwidth();
Bandwidth (GiB/s):
├ max: 22.46
├ min: 0.21
├ avg: 17.99
└ std_dev: 9.94
julia> p2p_bandwidth_all()
8_8 Matrix{Union{Nothing, Float64}}:
nothing 22.5406 45.0071 22.5367 44.9879 8.11161 8.0912 8.09942
22.4844 nothing 22.5393 44.9706 8.11274 45.0036 8.08732 8.10026
44.7958 22.5406 nothing 45.0158 8.10381 8.11014 22.5406 7.39972
22.5301 45.0088 45.0001 nothing 7.428 8.10646 8.08204 22.5401
44.9897 7.14454 7.14674 7.14906 nothing 22.5371 44.9723 22.538
6.63411 44.9966 7.14542 7.14467 22.5371 nothing 22.5362 44.9949
7.75585 7.75844 22.5375 8.10314 44.9966 22.5353 nothing 44.9984
7.75575 7.75911 8.10116 22.5371 22.5367 44.9966 44.9914 nothing
GPU information
julia> CUDA.versioninfo()
CUDA toolkit 11.6, local installation
NVIDIA driver 510.47.3, for CUDA 11.6
CUDA driver 11.6
Libraries:
- CUBLAS: 11.8.1
- CURAND: 10.2.9
- CUFFT: 10.7.1
- CUSOLVER: 11.3.3
- CUSPARSE: 11.7.2
- CUPTI: 16.0.0
- NVML: 11.0.0+510.47.3
- CUDNN: missing
- CUTENSOR: missing
Toolchain:
- Julia: 1.7.1
- LLVM: 12.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80
Environment:
- JULIA_CUDA_USE_BINARYBUILDER: false
8 devices:
0: Tesla V100-SXM2-32GB (sm_70, 31.449 GiB / 32.000 GiB available)
1: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
2: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
3: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
4: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
5: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
6: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
7: Tesla V100-SXM2-32GB (sm_70, 31.745 GiB / 32.000 GiB available)
julia> gpuinfo()
Device: Tesla V100-SXM2-32GB (CuDevice(0))
Total amount of global memory: 31.749 GiB
Number of CUDA cores: 5120
Number of multiprocessors: 80 (64 CUDA cores each)
GPU max. clock rate: 1530 MHz
Memory clock rate: 877 MHz
Memory bus width: 4096-bit
L2 cache size: 6.000 MiB
Max. texture dimension sizes (1D): 131072
Max. texture dimension sizes (2D): 131072, 65536
Max. texture dimension sizes (3D): 16384, 16384, 16384
Max. layered 1D texture size: 32768 (2048 layers)
Max. layered 2D texture size: 32768, 32768 (2048 layers)
Total amount of constant memory: 64.000 KiB
Total amount of shared memory per block: 48.000 KiB
Total number of registers available per block: 65536
Warp size: 32
Max. number of threads per multiprocessor: 2048
Max. number of threads per block: 1024
Max. dimension size of a thread block (x,y,z): 1024, 1024, 64
Max. dimension size of a grid size (x,y,z): 2147483647, 65535, 65535
Texture alignment: 512 bytes
Maximum memory pitch: 2.000 GiB
Concurrent copy and kernel execution: Yes with 6 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing host memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for surfaces: Yes
Device has ECC support: Yes
Device supports Unified Addressing (UVA): Yes
Device supports managed memory: Yes
Device supports compute preemption: Yes
Supports cooperative kernel launch: Yes
Supports multi-device co-op kernel launch: Yes
Device PCI domain ID / bus ID / device ID: 0 / 31 / 0
Compute mode: Default (multiple host threads can use ::cudaSetDevice() with device simultaneously)
julia> gpuinfo_p2p_access()
P2P Access Supported:
8_8 Matrix{Bool}:
0 1 1 1 1 1 1 1
1 0 1 1 1 1 1 1
1 1 0 1 1 1 1 1
1 1 1 0 1 1 1 1
1 1 1 1 0 1 1 1
1 1 1 1 1 0 1 1
1 1 1 1 1 1 0 1
1 1 1 1 1 1 1 0
P2P Atomic Supported:
8_8 Matrix{Bool}:
0 1 1 1 1 0 0 0
1 0 1 1 0 1 0 0
1 1 0 1 0 0 1 0
1 1 1 0 0 0 0 1
1 0 0 0 0 1 1 1
0 1 0 0 1 0 1 1
0 0 1 0 1 1 0 1
0 0 0 1 1 1 1 0