CUDA Extension

Index

References

CUDAExt.get_temperatureFunction
get_temperature(device=CUDA.device())

Get current temperature of the given CUDA device in degrees Celsius.

source
CUDAExt.get_temperaturesFunction
get_temperatures(devices=CUDA.devices())

Get current temperature of the given CUDA devices in degrees Celsius.

source
CUDAExt.gpuidFunction

Get GPU index of the given device.

Note: GPU indices start with zero.

source
CUDAExt._peakflops_gpu_fmasMethod
_peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the matrices.
  • size (default: 5_000_000): length of vectors.
  • nkernel (default: 5): number of kernel calls that make up one benchmarking sample.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.
source
CUDAExt._peakflops_gpu_wmmasMethod
_peakflops_gpu_wmmas()

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float16): element type of the matrices. We currently only support Float16 (Int8, :TensorFloat32, :BFloat16, and Float64 might or might not work).
  • nkernel (default: 10): number of kernel calls that make up one benchmarking sample.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • threads (default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).
  • blocks (default: 2048): how many blocks to use (part of the kernel launch configuration).
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.
source
CUDAExt.peakflops_gpu_matmulMethod
peakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.

Keyword arguments:

  • device (default: CUDA.device()): CUDA device to be used.
  • dtype (default: Float32): element type of the matrices.
  • size (default: 2^14): matrices will have dimensions (size, size).
  • nmatmuls (default: 5): number of matmuls that will make up the kernel to be timed.
  • nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
  • verbose (default: true): toggle printing.
  • io (default: stdout): set the stream where the results should be printed.

See also: peakflops_gpu_matmul_scaling, peakflops_gpu_matmul_graphs.

source
CUDAExt.StressTestBatchedType

GPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.

source
CUDAExt.StressTestEnforcedType

GPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).

source
CUDAExt.StressTestFixedIterType

GPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter accordingly.

source
CUDAExt.StressTestStoreResultsType

GPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).

This stress test is somewhat inspired by gpu-burn by Ville Timonen.

source
CUDAExt.alloc_memMethod
alloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)

Allocates memory on the devices whose IDs are provided via devs. Returns a vector of memory handles (i.e. CuArrays).

Examples:

alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1
source
CUDAExt.toggle_tensorcoremathFunction
toggle_tensorcoremath([enable::Bool; verbose=true])

Switches the CUDA.math_mode between CUDA.FAST_MATH (enable=true) and CUDA.DEFAULT_MATH (enable=false). For matmuls of CuArray{Float32}s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.

If no arguments are provided, this functions toggles between the two math modes.

source
GPUInspector.gpuinfoMethod
gpuinfo(deviceid::Integer)

Print out detailed information about the NVIDIA GPU with the given deviceid.

Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".

(This method is from the CUDA backend.)

source
GPUInspector.monitoring_stopMethod
monitoring_stop(; verbose=true) -> results

Specifically, results is a named tuple with the following keys:

  • time: the (relative) times at which we measured
  • temperature, power, compute, mem

(This method is from the CUDA backend.)

source