CUDA Extension
Index
CUDAExt._kernel_fma
CUDAExt._peakflops_gpu_fmas
CUDAExt._peakflops_gpu_wmmas
CUDAExt.alloc_mem
CUDAExt.clear_all_gpus_memory
CUDAExt.clear_gpu_memory
CUDAExt.get_gpu_utilization
CUDAExt.get_gpu_utilizations
CUDAExt.get_power_usage
CUDAExt.get_power_usages
CUDAExt.get_temperature
CUDAExt.get_temperatures
CUDAExt.gpuid
CUDAExt.hastensorcores
CUDAExt.peakflops_gpu_matmul
CUDAExt.peakflops_gpu_matmul_graphs
CUDAExt.peakflops_gpu_matmul_scaling
CUDAExt.toggle_tensorcoremath
GPUInspector.gpuinfo
GPUInspector.memory_bandwidth_saxpy
GPUInspector.monitoring_stop
CUDAExt.StressTestBatched
CUDAExt.StressTestEnforced
CUDAExt.StressTestFixedIter
CUDAExt.StressTestStoreResults
References
CUDAExt.get_gpu_utilization
— Functionget_gpu_utilization(device=CUDA.device())
Get the current utilization of the given CUDA device in percent.
CUDAExt.get_gpu_utilizations
— Functionget_gpu_utilizations(devices=CUDA.devices())
Get the current utilization of the given CUDA devices in percent.
CUDAExt.get_power_usage
— Methodget_power_usage(device=CUDA.device())
Get current power usage of the given CUDA device in Watts.
CUDAExt.get_power_usages
— Functionget_power_usages(devices=CUDA.devices())
Get current power usage of the given CUDA devices in Watts.
CUDAExt.get_temperature
— Functionget_temperature(device=CUDA.device())
Get current temperature of the given CUDA device in degrees Celsius.
CUDAExt.get_temperatures
— Functionget_temperatures(devices=CUDA.devices())
Get current temperature of the given CUDA devices in degrees Celsius.
CUDAExt.gpuid
— FunctionGet GPU index of the given device.
Note: GPU indices start with zero.
CUDAExt._kernel_fma
— MethodDummy kernel doing _kernel_fma_nfmas()
many FMAs (default: 100_000
).
CUDAExt._peakflops_gpu_fmas
— Method_peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size
many FMAs on CUDA cores.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the matrices.size
(default:5_000_000
): length of vectors.nkernel
(default:5
): number of kernel calls that make up one benchmarking sample.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
CUDAExt._peakflops_gpu_wmmas
— Method_peakflops_gpu_wmmas()
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas()
many WMMAs on Tensor Cores.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float16
): element type of the matrices. We currently only supportFloat16
(Int8
,:TensorFloat32
,:BFloat16
, andFloat64
might or might not work).nkernel
(default:10
): number of kernel calls that make up one benchmarking sample.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.threads
(default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).blocks
(default:2048
): how many blocks to use (part of the kernel launch configuration).verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
CUDAExt.peakflops_gpu_matmul
— Methodpeakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)
Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls
many (in-place) matrix-matrix multiplications.
Keyword arguments:
device
(default:CUDA.device()
): CUDA device to be used.dtype
(default:Float32
): element type of the matrices.size
(default:2^14
): matrices will have dimensions(size, size)
.nmatmuls
(default:5
): number of matmuls that will make up the kernel to be timed.nbench
(default:5
): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose
(default:true
): toggle printing.io
(default:stdout
): set the stream where the results should be printed.
See also: peakflops_gpu_matmul_scaling
, peakflops_gpu_matmul_graphs
.
CUDAExt.peakflops_gpu_matmul_graphs
— MethodSame as peakflops_gpu_matmul
but uses CUDA's graph API to define and launch the kernel.
See also: peakflops_gpu_matmul_scaling
.
CUDAExt.peakflops_gpu_matmul_scaling
— Methodpeakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops
Asserts the scaling of the given peakflops_func
tion (defaults to peakflops_gpu_matmul
) with increasing matrix size. If verbose=true
(default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul
.
CUDAExt.StressTestBatched
— TypeGPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.
CUDAExt.StressTestEnforced
— TypeGPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).
CUDAExt.StressTestFixedIter
— TypeGPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter
accordingly.
CUDAExt.StressTestStoreResults
— TypeGPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).
This stress test is somewhat inspired by gpu-burn by Ville Timonen.
CUDAExt.alloc_mem
— Methodalloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)
Allocates memory on the devices whose IDs are provided via devs
. Returns a vector of memory handles (i.e. CuArray
s).
Examples:
alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1
CUDAExt.clear_all_gpus_memory
— FunctionReclaim the unused memory of all available GPUs.
CUDAExt.clear_gpu_memory
— FunctionReclaim the unused memory of the currently active GPU (i.e. device()
).
CUDAExt.hastensorcores
— FunctionChecks whether the given CuDevice
has Tensor Cores.
CUDAExt.toggle_tensorcoremath
— Functiontoggle_tensorcoremath([enable::Bool; verbose=true])
Switches the CUDA.math_mode
between CUDA.FAST_MATH
(enable=true
) and CUDA.DEFAULT_MATH
(enable=false
). For matmuls of CuArray{Float32}
s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.
If no arguments are provided, this functions toggles between the two math modes.
GPUInspector.memory_bandwidth_saxpy
— MethodExtra keyword arguments:
cublas
(default:true
): toggle betweenCUDA.axpy!
and a custom_saxpy_gpu_kernel!
.
(This method is from the CUDA backend.)
GPUInspector.gpuinfo
— Methodgpuinfo(deviceid::Integer)
Print out detailed information about the NVIDIA GPU with the given deviceid
.
Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".
(This method is from the CUDA backend.)
GPUInspector.monitoring_stop
— Methodmonitoring_stop(; verbose=true) -> results
Specifically, results
is a named tuple with the following keys:
time
: the (relative) times at which we measuredtemperature
,power
,compute
,mem
(This method is from the CUDA backend.)