Supercharge Your Model Training With MosaicML Composer: Community & Partner Talks at PT Conf. 2022
Hey folks, nice to meet you everyone, both in person and online. My name is Hagai. I lead engineering at MosaicML and I’m going to spend the next 10 minutes or so introducing you to MosaicML Composer and how you can use it to supercharge your PyTorch training. Today, thanks to the amazing AI community and thanks to fantastic tools like PyTorch, AI is everywhere and it’s powering our day-to-day lives in many ways, from personalized recommendations through voice assistance and up to novel applications such as autonomous vehicles and AI-powered drug discovery. Now, as AI supports a growing number of use cases, it needs to become more intelligent. And this growth in intelligence is achieved through a growth in both the model complexity as well as the scale of the training data. Now, the table shown here shows how model sizes have been trending for state-of-the-art language models over the past four years or so. And if you take a look at the trend line that just showed up here and mind you the why axis is log scale, it’s very clear that you’re looking at exponential growth in model sizes. And the reality is that this growth is actually not showing signs of slowing down just yet. Now, MosaicML was founded a couple of years ago to address this very challenge of growing AI complexity. Our mission is making machine learning training efficient for everyone and it’s a very relevant challenge given the growth in complexity that we just discussed. This is Team MosaicML in our San Francisco Dog Patch Office. It’s an incredible team and I’m honored to be part of that team and a lot of what I’m going to talk to you about is actually the work of this fantastic team. Now let’s get to the chase and introduce you to MosaicML Composer. So Composer is a Python library, it’s built on top of PyTorch and the purpose of Composer is training neural networks. It enables high accuracy, faster training and overall reduces the cost of training your model. The library was developed by machine learning developers and for machine learning developers and so it’s loaded and packed with useful features. This session is only 10 minutes pretty short so I’m going to focus on three main aspects of Composer, the trainer API, then training optimizations and lastly streaming data loading. So starting with the training, the trainer encapsulates PyTorch’s training loop and it also makes in support for things like multi GPU training, multi node training. It also allows you to customize your training using callbacks or hooks that get called at different stages of the training. And we can actually see what this means in practice. So this is a very simple example of using Composer trainer. Step one is you create the model class that implements the composer model interface and you can start by just implementing three functions in it, forward and loss and basically you’re good to go. This example is simply wrapping around the standard resident 18 from Torch Vision. Next you instantiate the Composer trainer and you pass in the model you just created, you pass in the standard PyTorch data loaders, PyTorch optimizer. You specify the number of epochs and the device to run on and then you just call fit and that’s it. You let Composer handle the training loop and all of that complexity. Now you can kick back, allow Composer and your GPUs to do the heavy lifting. Metrics are locked during training and you can check them in your console or Composer integrates really well with tools like Quits and Bices or Comments ML and you can check your metrics there. And once training is done you can look up your Torch metrics in the trainer state object as you can see here. Sweet. Moving on to applying training optimizations. Now Composer packs more than 20 built in optimizations to speed up training and what’s awesome is that these optimizations which is fairly complex algorithmically but you can very easily apply them and even compose multiple optimizations together. Let’s see how it looks like. You first instantiate the optimizations you wish to apply. All of Composer optimizations reside under Composer.algritans and in this code example we instantiate progressive resize, blur pool and label smoothing and just these three if you use them you’re actually going to see really good speed ups with models such as ResNet. Now there are dedicated docs within Composer for each one of these optimizations pointing at the research papers and giving a lot of insight into how the algorithm actually work. Now just like before you instantiate the trainer object and now you also passing the list of algorithms that you instantiate just in the previous step and that’s it. You call trainer.fit and Composer handles all of the nitty gritty details of applying the optimizations during training. And last but not least streaming data loading. So this capability enables you to stream your training data from the cloud, freeing you from the need to download and manage data on the local storage. Now for those that ever train large scale language models you know that is painful and time consuming process that basically a streaming data loading saves for you. And what’s really cool is that this feature is implemented as a drop in replacement for PyTorch iterable data set. Let’s see what this looks like. So step one is converting the data set into a supported format that allows the streaming engine to index the data and stream it over. And you can use the MDS writer class to generate the streaming data set. Next, you upload the data set to your favorite cloud storage. In this example, just shows using AWS S3. But of course you can use whatever cloud storage you would like. The streaming engine will know how to read for most cloud vendors. Now once you did this, you instantiate the standard PyTorch data loader and you provide an instance of streaming.dataset which by itself extends PyTorch iterable data set. And last but not least, instantiate the trainer, you provide the data loader instance, you call fit and you kick back while training data is actually streamed from the cloud. Once you created that streaming data set in your cloud bucket, you can stream it for multiple training jobs without the need to always download the data locally and manage it there. Sweet. So with all of that, how fast can you actually go with composer? I mentioned the optimizations earlier. This is a mosaic of the different optimizations that are supported. We call these optimizations sometimes speed up methods and they’re available on composer today and more optimizations are regularly implemented. Now many of these are highly complex algorithmically. And again, the nice thing about composer is that it encapsulates all of that complexity. So you don’t have to worry about it and you can very easily apply it to make sure your model trains fast and efficiently. Let’s look at some concrete results of what was achieved with these optimizations. Using composer to train resident 50 with the set of composer optimizations, deliver a 4.4x speed up on training speed running on 8 and a 100 GPUs. And this result was actually submitted to ML Perth and was actually faster than Nvidia’s own highly optimized training which was also submitted to ML Perth. Moving to NLP, applying composer optimizations spared up birth large training by 2.7x, again on 8 and a 100 GPUs. And again, this result was also submitted to ML Perth and was actually faster than all other birth large submissions on 8 and a 100 GPUs. In fact, it was more than two times faster than the second best result. So these results are impressive and just think about how quickly you can iterate on exploring new models, new ideas by using composer. If you’re curious to learn more, check out composer on GitHub, take a look at our docs during the composer community and let us know what you think. We really would love to hear from you and really into helping people adopt composer and benefit from it. Thanks for tuning in. I really enjoyed this session and hope you did as well. Thank you.