Peakflops

Index

GPUInspector.kernel_fma
GPUInspector.peakflops_gpu
GPUInspector.peakflops_gpu_fmas
GPUInspector.peakflops_gpu_matmul
GPUInspector.peakflops_gpu_matmul_graphs
GPUInspector.peakflops_gpu_matmul_scaling
GPUInspector.peakflops_gpu_wmmas
GPUInspector.theoretical_peakflops_gpu

References

GPUInspector.peakflops_gpu — Method

peakflops_gpu(; tensorcores=hastensorcores(), kwargs...)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform

_kernel_fma_nfmas() * size many FMAs on CUDA cores (if tensorcores == false)
_kernel_wmma_nwmmas() many WMMAs on Tensor Cores (if tensorcores == true)

For more keyword argument options see peakflops_gpu_fmas and peakflops_gpu_wmmas.

GPUInspector.theoretical_peakflops_gpu — Method

Estimates the theoretical peak performance of a CUDA device in TFLOP/s.

Keyword arguments:

tensorcores (default: hastensorcores()): toggle usage of tensore cores. If false, cuda cores will be used.
verbose (default: true): toggle printing of information
device (default: device()): CUDA device to be analyzed
dtype (default: tensorcores ? Float16 : Float32): element type of the matrices

GPUInspector.peakflops_gpu_matmul — Method

peakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float32): element type of the matrices.
size (default: 2^14): matrices will have dimensions (size, size).
nmatmuls (default: 5): number of matmuls that will make up the kernel to be timed.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
verbose (default: true): toggle printing.

GPUInspector.peakflops_gpu_matmul_graphs — Method

Same as peakflops_gpu_matmul but uses CUDA's graph API to define and launch the kernel.

GPUInspector.peakflops_gpu_matmul_scaling — Method

peakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops

Asserts the scaling of the given peakflops_function (defaults to peakflops_gpu_matmul) with increasing matrix size. If verbose=true (default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul.

GPUInspector.kernel_fma — Method

Dummy kernel doing _kernel_fma_nfmas() many FMAs (default: 100_000).

GPUInspector.peakflops_gpu_fmas — Method

peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float32): element type of the matrices.
size (default: 5_000_000): length of vectors.
nkernel (default: 5): number of kernel calls that make up one benchmarking sample.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
verbose (default: true): toggle printing.

GPUInspector.peakflops_gpu_wmmas — Method

peakflops_gpu_wmmas()

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float16): element type of the matrices. We currently only support Float16 (Int8, :TensorFloat32, :BFloat16, and Float64 might or might not work).
nkernel (default: 10): number of kernel calls that make up one benchmarking sample.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
threads (default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).
blocks (default: 2048): how many blocks to use (part of the kernel launch configuration).
verbose (default: true): toggle printing.