CUDA Extension
Index
CUDAExt._kernel_fmaCUDAExt._peakflops_gpu_fmasCUDAExt._peakflops_gpu_wmmasCUDAExt.alloc_memCUDAExt.clear_all_gpus_memoryCUDAExt.clear_gpu_memoryCUDAExt.get_gpu_utilizationCUDAExt.get_gpu_utilizationsCUDAExt.get_power_usageCUDAExt.get_power_usagesCUDAExt.get_temperatureCUDAExt.get_temperaturesCUDAExt.gpuidCUDAExt.hastensorcoresCUDAExt.peakflops_gpu_matmulCUDAExt.peakflops_gpu_matmul_graphsCUDAExt.peakflops_gpu_matmul_scalingCUDAExt.toggle_tensorcoremathGPUInspector.gpuinfoGPUInspector.memory_bandwidth_saxpyGPUInspector.monitoring_stopCUDAExt.StressTestBatchedCUDAExt.StressTestEnforcedCUDAExt.StressTestFixedIterCUDAExt.StressTestStoreResults
References
CUDAExt.get_gpu_utilization — Functionget_gpu_utilization(device=CUDA.device())Get the current utilization of the given CUDA device in percent.
CUDAExt.get_gpu_utilizations — Functionget_gpu_utilizations(devices=CUDA.devices())Get the current utilization of the given CUDA devices in percent.
CUDAExt.get_power_usage — Methodget_power_usage(device=CUDA.device())Get current power usage of the given CUDA device in Watts.
CUDAExt.get_power_usages — Functionget_power_usages(devices=CUDA.devices())Get current power usage of the given CUDA devices in Watts.
CUDAExt.get_temperature — Functionget_temperature(device=CUDA.device())Get current temperature of the given CUDA device in degrees Celsius.
CUDAExt.get_temperatures — Functionget_temperatures(devices=CUDA.devices())Get current temperature of the given CUDA devices in degrees Celsius.
CUDAExt.gpuid — FunctionGet GPU index of the given device.
Note: GPU indices start with zero.
CUDAExt._kernel_fma — MethodDummy kernel doing _kernel_fma_nfmas() many FMAs (default: 100_000).
CUDAExt._peakflops_gpu_fmas — Method_peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.
Keyword arguments:
device(default:CUDA.device()): CUDA device to be used.dtype(default:Float32): element type of the matrices.size(default:5_000_000): length of vectors.nkernel(default:5): number of kernel calls that make up one benchmarking sample.nbench(default:5): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose(default:true): toggle printing.io(default:stdout): set the stream where the results should be printed.
CUDAExt._peakflops_gpu_wmmas — Method_peakflops_gpu_wmmas()Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.
Keyword arguments:
device(default:CUDA.device()): CUDA device to be used.dtype(default:Float16): element type of the matrices. We currently only supportFloat16(Int8,:TensorFloat32,:BFloat16, andFloat64might or might not work).nkernel(default:10): number of kernel calls that make up one benchmarking sample.nbench(default:5): number of measurements to be performed the best of which is used for the TFLOP/s computation.threads(default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).blocks(default:2048): how many blocks to use (part of the kernel launch configuration).verbose(default:true): toggle printing.io(default:stdout): set the stream where the results should be printed.
CUDAExt.peakflops_gpu_matmul — Methodpeakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.
Keyword arguments:
device(default:CUDA.device()): CUDA device to be used.dtype(default:Float32): element type of the matrices.size(default:2^14): matrices will have dimensions(size, size).nmatmuls(default:5): number of matmuls that will make up the kernel to be timed.nbench(default:5): number of measurements to be performed the best of which is used for the TFLOP/s computation.verbose(default:true): toggle printing.io(default:stdout): set the stream where the results should be printed.
See also: peakflops_gpu_matmul_scaling, peakflops_gpu_matmul_graphs.
CUDAExt.peakflops_gpu_matmul_graphs — MethodSame as peakflops_gpu_matmul but uses CUDA's graph API to define and launch the kernel.
See also: peakflops_gpu_matmul_scaling.
CUDAExt.peakflops_gpu_matmul_scaling — Methodpeakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flopsAsserts the scaling of the given peakflops_function (defaults to peakflops_gpu_matmul) with increasing matrix size. If verbose=true (default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul.
CUDAExt.StressTestBatched — TypeGPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.
CUDAExt.StressTestEnforced — TypeGPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).
CUDAExt.StressTestFixedIter — TypeGPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter accordingly.
CUDAExt.StressTestStoreResults — TypeGPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).
This stress test is somewhat inspired by gpu-burn by Ville Timonen.
CUDAExt.alloc_mem — Methodalloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)Allocates memory on the devices whose IDs are provided via devs. Returns a vector of memory handles (i.e. CuArrays).
Examples:
alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1CUDAExt.clear_all_gpus_memory — FunctionReclaim the unused memory of all available GPUs.
CUDAExt.clear_gpu_memory — FunctionReclaim the unused memory of the currently active GPU (i.e. device()).
CUDAExt.hastensorcores — FunctionChecks whether the given CuDevice has Tensor Cores.
CUDAExt.toggle_tensorcoremath — Functiontoggle_tensorcoremath([enable::Bool; verbose=true])Switches the CUDA.math_mode between CUDA.FAST_MATH (enable=true) and CUDA.DEFAULT_MATH (enable=false). For matmuls of CuArray{Float32}s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.
If no arguments are provided, this functions toggles between the two math modes.
GPUInspector.memory_bandwidth_saxpy — MethodExtra keyword arguments:
cublas(default:true): toggle betweenCUDA.axpy!and a custom_saxpy_gpu_kernel!.
(This method is from the CUDA backend.)
GPUInspector.gpuinfo — Methodgpuinfo(deviceid::Integer)Print out detailed information about the NVIDIA GPU with the given deviceid.
Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".
(This method is from the CUDA backend.)
GPUInspector.monitoring_stop — Methodmonitoring_stop(; verbose=true) -> resultsSpecifically, results is a named tuple with the following keys:
time: the (relative) times at which we measuredtemperature,power,compute,mem
(This method is from the CUDA backend.)