Peakflops
Index
GPUInspector.kernel_fma
GPUInspector.peakflops_gpu
GPUInspector.peakflops_gpu_fmas
GPUInspector.peakflops_gpu_matmul
GPUInspector.peakflops_gpu_matmul_graphs
GPUInspector.peakflops_gpu_matmul_scaling
GPUInspector.peakflops_gpu_wmmas
GPUInspector.theoretical_peakflops_gpu
References
GPUInspector.peakflops_gpu
— Methodpeakflops_gpu(; tensorcores=hastensorcores(), kwargs...)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform
_kernel_fma_nfmas() * size
many FMAs on CUDA cores (iftensorcores == false
)_kernel_wmma_nwmmas()
many WMMAs on Tensor Cores (iftensorcores == true
)
For more keyword argument options see peakflops_gpu_fmas
and peakflops_gpu_wmmas
.
GPUInspector.theoretical_peakflops_gpu
— MethodEstimates the theoretical peak performance of a CUDA device in TFLOP/s.
Keyword arguments:
tensorcores
(default:hastensorcores()
): toggle usage of tensore cores. Iffalse
, cuda cores will be used.verbose
(default:true
): toggle printing of informationdevice
(default:device()
): CUDA device to be analyzeddtype
(default:tensorcores ? Float16 : Float32
): element type of the matrices
GPUInspector.peakflops_gpu_matmul
— Methodpeakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls
many (in-place) matrix-matrix multiplications.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the matrices.size
(default:2^14
): matrices will have dimensions(size, size)
.nmatmuls
(default:5
): number of matmuls that will make up the kernel to be timed.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose
(default:true
): toggle printing.
See also: peakflops_gpu_matmul_scaling
, peakflops_gpu_matmul_graphs
.
GPUInspector.peakflops_gpu_matmul_graphs
— MethodSame as peakflops_gpu_matmul
but uses CUDA's graph API to define and launch the kernel.
See also: peakflops_gpu_matmul_scaling
.
GPUInspector.peakflops_gpu_matmul_scaling
— Methodpeakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops
Asserts the scaling of the given peakflops_func
tion (defaults to peakflops_gpu_matmul
) with increasing matrix size. If verbose=true
(default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul
.
GPUInspector.kernel_fma
— MethodDummy kernel doing _kernel_fma_nfmas()
many FMAs (default: 100_000
).
GPUInspector.peakflops_gpu_fmas
— Methodpeakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size
many FMAs on CUDA cores.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the matrices.size
(default:5_000_000
): length of vectors.nkernel
(default:5
): number of kernel calls that make up one benchmarking sample.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose
(default:true
): toggle printing.
GPUInspector.peakflops_gpu_wmmas
— Methodpeakflops_gpu_wmmas()
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas()
many WMMAs on Tensor Cores.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float16
): element type of the matrices. We currently only supportFloat16
(Int8
,:TensorFloat32
,:BFloat16
, andFloat64
might or might not work).nkernel
(default:10
): number of kernel calls that make up one benchmarking sample.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.threads
(default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).blocks
(default:2048
): how many blocks to use (part of the kernel launch configuration).verbose
(default:true
): toggle printing.