ACCELERATING PYTORCH NETWORKS WITH NATIVE CUDA GRAPHS SUPPORT | MICHAEL CARILLI
PyTorch networks running on high-throughput NVIDIA GPUs are frequently limited by CPU overhead. A CUDA Graph is a record of GPU operations that can be replayed with a single call, which incurs much less CPU overhead than dispatching ops one at a time. Pytorch 1.10 introduces a Python API for CUDA graphs. By creating a graph once, then replaying it in a training or inference loop, you can accelerate a network whose runtime is at least partially CPU-bound. Graphed regions of the network can also interoperate seamlessly with eagerly executed regions. In this talk, Michael Carilli (Senior DL Framework Engineer, NVIDIA) details the benefits and limitations of CUDA graphs, observed speedups (in some cases over 2X) for real workloads, and the usage of the new API.