Practical Guide on PyTorch Inference Using AWS Inferentia: PyTorch Conference 2022 Poster

Hello, I’m Keita Watanabe and if you take a look at the slide, yeah, I actually have my name actually. So, okay, seems like the title, we have two titles but please ignore practice guide on AWS Inferential because in this session I’m not going to talk about Inferential which is already covered in the postal session. Unfortunately, this session originally supposed to be done by our colleague Yorosh but because of some unfortunate accident, he’s not available so I’ll be talking about today’s session on behalf of himself. So before going to actual topic, let me briefly introduce my team. So my team, AWS Frameworks team is supporting our self-managed machine learning workload which is a machine learning workload which doesn’t use any managed solution such as SageMaker for example. And as a software stack, we are mostly using Kubernetes, EKS, Splam, PowerCross, AWS Bouch and the primary workload is LRM but not limited to that. Machine learning in general will be our scope. And so today I’m going to talk about our latest option for distributed or large scale machine learning model training but before jumping to that, let me briefly talk about LRM trends and so forth. So first up is a large language models trends. So as all of you have already know and also have seen in the session or new reps, the number of parameters of machine learning models are increasing over time, especially recent years. Which presents new challenges such as in communication through Niko or Groo and also we need to tackle memory limitations and also need to exploring heterogeneous architectures which kind of raises the needs to support different data types. So without being said for such challenges, we recently released AWS training which is cost efficient LRM option available. So AWS training is the second generation AWS developed in-house chip for machine learning training which is 1.5 more performance for popular NLP models compared with P4D instance and also up to four times network bandwidth compared with the same instance. And as a result, the cost wise we have up to 50% reduction to train LRM models. So far we have three type of instances which supports training. The smallest one is TR1.2XLATCH which has one training chips and the largest one is TR132.XLATCH which equips 16 train M chips. And in this event which actually happening at the same time as we have new reps, we have launched TR1.1N.32XLATCH which is essentially TR1.32XLATCH but have double instance network capacity. So this train M has native support for wide range of data types such as FP32, TF32, TF16, FP16, UNT and configurable FP8. So other result compared with P4D instance we have achieved huge improvement in P4D performance. For example, compared with in the comparison of the performance of a train M is 1.4 times better than that of P4D. Similarly for floating point 32 we have 2.5 more performance capability in a train M and 5 times more performance in a float 32. This is partly enabled by our new feature called stochastic rounding. So in usual rounding say if you have 1.2 it will always be rounded to 1. But on the other hand when you do stochastic rounding it will be rounded based on its number. But in this particular case, 2 times out of 10 you have to have 2 and for the last we have 1. Based on that we can achieve better performance with faster training time. So let me first last we explain about PyTouch integrations. PyTouch integration of a train M is enabled by the software stack shown in here. In the first layer we have a PyTouch of course which has lazy tensors for either execution and the comparison. And the second layer is GIT cache which hiding compression overhead and kind of easing development of machine learning models. And PyTouch XRA converts PyTouch operations into corresponding XRA operations and XRA is the compiler based linear algebra execution engine. And all of those stuff enables distributed trainings. And in training we support mixed precision FP16 and BF16. And the DDP is also supported and if you want to go even more supporting multiple instance training, FSDP is also supported by a neural SDK which is SDK we use to train a model on training. So this is the minimal code we can use to train about large models. So here we are using HuggingFace Transformers. And if you have experience training your model using HuggingFace and PyTouch, you might notice that there is not so much difference from what you do in usual model training using GPUs. All you need to do is a few things. The first one is import a Torch XRA as shown on the top and then as usual you got to instantiate your models, PyTouch models and prepare training groups. In the training groups we have probably unfamiliar lines called exam mark steps which tells XRA to compile the graph to optimize its performance. And during training you just need to do optimizer steps to do the backward computation and update your network. All of them can be done in this model. So that’s essentially the brief explanation of the models but if you want to train a PyTouch model in what knife way you can do them as well. So in this situation you got the needs to instantiate exam device and transfer your PyTouch models and the tensors which will be put into training steps into exam devices just like what you do in your training with your GPU devices. Sorry it doesn’t go to the next slide. Excuse me? No questions. Yeah, yeah, yeah, yeah, no questions. Of course. So yeah, that’s about it. Thanks so much.

AI video(s) you might be interested in …