CUDA Extension

Index

CUDAExt._kernel_fma
CUDAExt._peakflops_gpu_fmas
CUDAExt._peakflops_gpu_wmmas
CUDAExt.alloc_mem
CUDAExt.clear_all_gpus_memory
CUDAExt.clear_gpu_memory
CUDAExt.get_gpu_utilization
CUDAExt.get_gpu_utilizations
CUDAExt.get_power_usage
CUDAExt.get_power_usages
CUDAExt.get_temperature
CUDAExt.get_temperatures
CUDAExt.gpuid
CUDAExt.hastensorcores
CUDAExt.peakflops_gpu_matmul
CUDAExt.peakflops_gpu_matmul_graphs
CUDAExt.peakflops_gpu_matmul_scaling
CUDAExt.toggle_tensorcoremath
GPUInspector.gpuinfo
GPUInspector.memory_bandwidth_saxpy
GPUInspector.monitoring_stop
CUDAExt.StressTestBatched
CUDAExt.StressTestEnforced
CUDAExt.StressTestFixedIter
CUDAExt.StressTestStoreResults

References

CUDAExt.get_gpu_utilization — Function

get_gpu_utilization(device=CUDA.device())

Get the current utilization of the given CUDA device in percent.

source

CUDAExt.get_gpu_utilizations — Function

get_gpu_utilizations(devices=CUDA.devices())

Get the current utilization of the given CUDA devices in percent.

source

CUDAExt.get_power_usage — Method

get_power_usage(device=CUDA.device())

Get current power usage of the given CUDA device in Watts.

source

CUDAExt.get_power_usages — Function

get_power_usages(devices=CUDA.devices())

Get current power usage of the given CUDA devices in Watts.

source

CUDAExt.get_temperature — Function

get_temperature(device=CUDA.device())

Get current temperature of the given CUDA device in degrees Celsius.

source

CUDAExt.get_temperatures — Function

get_temperatures(devices=CUDA.devices())

Get current temperature of the given CUDA devices in degrees Celsius.

source

CUDAExt.gpuid — Function

Get GPU index of the given device.

Note: GPU indices start with zero.

source

CUDAExt._kernel_fma — Method

Dummy kernel doing _kernel_fma_nfmas() many FMAs (default: 100_000).

source

CUDAExt._peakflops_gpu_fmas — Method

_peakflops_gpu_fmas(; size::Integer=5_000_000, dtype=Float32, nbench=5, nkernel=5, device=CUDA.device(), verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_fma_nfmas() * size many FMAs on CUDA cores.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float32): element type of the matrices.
size (default: 5_000_000): length of vectors.
nkernel (default: 5): number of kernel calls that make up one benchmarking sample.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

source

CUDAExt._peakflops_gpu_wmmas — Method

_peakflops_gpu_wmmas()

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform _kernel_wmma_nwmmas() many WMMAs on Tensor Cores.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float16): element type of the matrices. We currently only support Float16 (Int8, :TensorFloat32, :BFloat16, and Float64 might or might not work).
nkernel (default: 10): number of kernel calls that make up one benchmarking sample.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
threads (default: max. threads per block): how many threads to use per block (part of the kernel launch configuration).
blocks (default: 2048): how many blocks to use (part of the kernel launch configuration).
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

source

CUDAExt.peakflops_gpu_matmul — Method

peakflops_gpu_matmul(; device, dtype=Float32, size=2^14, nmatmuls=5, nbench=5, verbose=true)

Tries to estimate the peak performance of a GPU in TFLOP/s by measuring the time it takes to perform nmatmuls many (in-place) matrix-matrix multiplications.

Keyword arguments:

device (default: CUDA.device()): CUDA device to be used.
dtype (default: Float32): element type of the matrices.
size (default: 2^14): matrices will have dimensions (size, size).
nmatmuls (default: 5): number of matmuls that will make up the kernel to be timed.
nbench (default: 5): number of measurements to be performed the best of which is used for the TFLOP/s computation.
verbose (default: true): toggle printing.
io (default: stdout): set the stream where the results should be printed.

source

CUDAExt.peakflops_gpu_matmul_graphs — Method

Same as peakflops_gpu_matmul but uses CUDA's graph API to define and launch the kernel.

source

CUDAExt.peakflops_gpu_matmul_scaling — Method

peakflops_gpu_matmul_scaling(peakflops_func = peakflops_gpu_matmul; verbose=true) -> sizes, flops

Asserts the scaling of the given peakflops_function (defaults to peakflops_gpu_matmul) with increasing matrix size. If verbose=true (default), displays a unicode plot. Returns the considered sizes and TFLOP/s. For further options, see peakflops_gpu_matmul.

source

CUDAExt.StressTestBatched — Type

GPU stress test (matrix multiplications) in which we try to run for a given time period. We try to keep the CUDA stream continously busy with matmuls at any point in time. Concretely, we submit batches of matmuls and, after half of them, we record a CUDA event. On the host, after submitting a batch, we (non-blockingly) synchronize on, i.e. wait for, the CUDA event and, if we haven't exceeded the desired duration already, submit another batch.

source

CUDAExt.StressTestEnforced — Type

GPU stress test (matrix multiplications) in which we run almost precisely for a given time period (duration is enforced).

source

CUDAExt.StressTestFixedIter — Type

GPU stress test (matrix multiplications) in which we run for a given number of iteration, or try to run for a given time period (with potentially high uncertainty!). In the latter case, we estimate how long a synced matmul takes and set niter accordingly.

source

CUDAExt.StressTestStoreResults — Type

GPU stress test (matrix multiplications) in which we store all matmul results and try to run as many iterations as possible for a certain memory limit (default: 90% of free memory).

This stress test is somewhat inspired by gpu-burn by Ville Timonen.

source

CUDAExt.alloc_mem — Method

alloc_mem(memsize::UnitPrefixedBytes; devs=(CUDA.device(),), dtype=Float32)

Allocates memory on the devices whose IDs are provided via devs. Returns a vector of memory handles (i.e. CuArrays).

Examples:

alloc_mem(MiB(1024)) # allocate on the currently active device
alloc_mem(B(40_000_000); devs=(0,1)) # allocate on GPU0 and GPU1

source

CUDAExt.clear_all_gpus_memory — Function

Reclaim the unused memory of all available GPUs.

source

CUDAExt.clear_gpu_memory — Function

Reclaim the unused memory of the currently active GPU (i.e. device()).

source

CUDAExt.hastensorcores — Function

Checks whether the given CuDevice has Tensor Cores.

source

CUDAExt.toggle_tensorcoremath — Function

toggle_tensorcoremath([enable::Bool; verbose=true])

Switches the CUDA.math_mode between CUDA.FAST_MATH (enable=true) and CUDA.DEFAULT_MATH (enable=false). For matmuls of CuArray{Float32}s, this should have the effect of using/enabling and not using/disabling tensor cores. Of course, this only works on supported devices and CUDA versions.

If no arguments are provided, this functions toggles between the two math modes.

source

GPUInspector.memory_bandwidth_saxpy — Method

Extra keyword arguments:

cublas (default: true): toggle between CUDA.axpy! and a custom _saxpy_gpu_kernel!.

(This method is from the CUDA backend.)

source

GPUInspector.gpuinfo — Method

gpuinfo(deviceid::Integer)

Print out detailed information about the NVIDIA GPU with the given deviceid.

Heavily inspired by the CUDA sample "deviceQueryDrv.cpp".

(This method is from the CUDA backend.)

source

GPUInspector.monitoring_stop — Method

monitoring_stop(; verbose=true) -> results

Specifically, results is a named tuple with the following keys:

time: the (relative) times at which we measured
temperature, power, compute, mem

(This method is from the CUDA backend.)

source