TUTEL-MoE-STACK OPTIMIZATION FOR MODERN DISTRIBUTED TRAINING | RAFAEL SALAS & YIFAN XIONG

The Mixture-of-Experts (MoE) is a sparsely activated deep learning model architecture that has sublinear compute costs with respect to their parameters. MoE is one of the few scalable approaches for training trillion-parameter scale deep learning models. This talk will present Tutel, an open-source project built with the Pytorch framework. Tutel is being actively developed by Microsoft and has been integrated into Microsoft’s Deepspeed project as well as Meta’s Fairseq project. Tutel currently supports both CUDA and ROCm. Tutel aims to improve the end-to-end MoE performance on the Azure Platform for large-scale deep learning training. We demonstrate number of Tutel results on the Microsoft Azure NDv4 platform: 7.49x speedup for a single MoE layer; 1.75x speedup on 64 VMs (over the default Fairseq implementation); and 40% end-to-end speedup on 64 VMs for Meta’s GPT-3 MoE. Tutel currently supports both CUDA and ROCm and leverages All-to-All communication improvements from Microsoft’s MSCCL library. We encourage the PyTorch developer community to explore Tutel for scaling their respective MoE models!

Source of this PyTorch AI Video

AI video(s) you might be interested in …