Peakflops

Index

References

GPUInspector.peakflops_gpuMethod
peakflops_gpu(; tensorcores=hastensorcores(), kwargs...)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform

  • _kernel_fma_nfmas() * size many FMAs on CUDA cores (if tensorcores == false)
  • _kernel_wmma_nwmmas() many WMMAs on Tensor Cores (if tensorcores == true)

For more keyword argument options see peakflops_gpu_fmas and peakflops_gpu_wmmas.

GPUInspector.theoretical_peakflops_gpuMethod

Estimates the theoretical peak performance of a CUDA device in TFLOP/s.

Keyword arguments:

  • tensorcores (default: hastensorcores()): toggle usage of tensore cores. If false, cuda cores will be used.
  • verbose (default: true): toggle printing of information
  • device (default: device()): CUDA device to be analyzed
  • dtype (default: tensorcores ? Float16 : Float32): element type of the matrices
GPUInspector.peakflops_gpu_matmulMethod
peakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the matrices.
  • size (default: 2^14): matrices will have dimensions (size, size).
  • nmatmuls (default: 5): number of matmuls that will make up the kernel to be timed.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • verbose (default: true): toggle printing.

See also: peakflops_gpu_matmul_scaling, peakflops_gpu_matmul_graphs.

GPUInspector.peakflops_gpu_matmul_scalingMethod
peakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops

Asserts the scaling of the given peakflops_function (defaults to peakflops_gpu_matmul) with increasing matrix size. If verbose=true (default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul.

GPUInspector.peakflops_gpu_fmasMethod
peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the matrices.
  • size (default: 5_000_000): length of vectors.
  • nkernel (default: 5): number of kernel calls that make up one benchmarking sample.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • verbose (default: true): toggle printing.
GPUInspector.peakflops_gpu_wmmasMethod
peakflops_gpu_wmmas()

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float16): element type of the matrices. We currently only support Float16 (Int8, :TensorFloat32, :BFloat16, and Float64 might or might not work).
  • nkernel (default: 10): number of kernel calls that make up one benchmarking sample.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • threads (default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).
  • blocks (default: 2048): how many blocks to use (part of the kernel launch configuration).
  • verbose (default: true): toggle printing.