Scaling inference on CPUs with TorchServe
Hi everyone, I’m Manjin and I’m an AI Frameworks Engineer at Intel. In this session, I will talk about scaling inference on CPUs with TorchServe. There are five main topics for today’s agenda. First, I will give an overview on TorchServe. Second, on CPU performance tuning principles for getting a strong out-of-box CPU performance. Third, on Intel extension for pipe torch. Fourth, on TorchServe with Intel extension for pipe torch. And finally, we’ll conclude with a summary. Let’s get started. What is TorchServe? TorchServe is a performing flexible and easy to use tool for model serving in scaling pipe torch models and production for inference. Model serving is the process of integrating train-deep learning models into a larger system in production. TorchServe allows users to easily deploy their train-deep learning models. TorchServe provides support for both pipe torch e-ver mode models as well as torch-skipped models. TorchServe provides a number of default model handlers for handling the model initialization, entry process, inference, and post-processing steps for various domains, including image classification, image segmentation, object detection, and text classification. Users can also write their own custom handlers for handling their own custom models. Now let’s discuss on general CPU performance tuning principles to boost out-of-box TorchServe performance. There are two main principles that I will discuss. First, avoiding logical cores for deep learning workloads generally improve performance. The majority of time in deep learning inference or training is spent on millions of repeated operations of gem, general matrix multiply, and most gem operators benefit from using non-hyperturbing. Therefore, we generally recommend avoiding logical cores by setting thread-up and need you to physical cores. Secondly, multi-socket systems have non-uniform memory access, Numa. If a process is not Numa-aware, slow remote memory is frequently accessed when threads migrate cross-socket. We address this problem by setting thread-up and need you to a specific socket. Knowing these principles in mind, proper CPU runtime configuration can significantly boost out-of-box performance. Before we discuss, let’s first note how users can obtain their CPU information. CPU information can be retrieved with LSTPU command on Linux. The following is an example of LSTPU execution. There are two sockets. In each socket has 28 physical cores. Since hyperthreading is enabled, each core can run two threads. In other words, each socket has another 28 logical cores. Therefore, there are 112 CPU cores on service. The first principle is to avoid logical cores. Then, run a few small play-at or dot product execution units shared by hyperthreading cores. With hyperthreading enabled, open-mpeed threads will continue for the same gem execution units. When two logical threads run gem at the same time, we will be sharing the same core resources causing frontend bound, such that the overhead from this frontend bound is greater than the gain from running both logical threads at the same time. Instead, if we use one thread per physical core, we avoid this type of contingency. Notice on the left, we avoid logical cores by setting the thread of the needy to a physical cores 0.55. We recommend binding a process to a specific socket to avoid slow remote memory access. Each socket has its own local memory, and the sockets are connected to each other, which allows each socket to access the local memory of another socket called remote memory. However, it’s better to utilize local memory until void remote memory access, which can be around 2x slower. Notice on the left, we pin threads to physical cores on the first socket, 0.27, to maintain the locality of memory access. Let’s apply these principles to deploying resident 50 with torches hurt. On top is AutoBox torches serve single worker inference without core pinning. When the number of threads is not manually set, pie torched by default sets the number of threads to the number of physical cores, 56 in this case. We notice one main worker thread is Sange, then it launches 56 threads on all cores including logical cores. Furthermore, as we did not pin threads to cores of basic switch socket, the operating system periodically schedules threads on cores located on different sockets. Comparing local versus remote memory access over time, we can verify remote memory accesses, which results in some optimal performance. On bottom is the same setting but with core pinning. We observed that one main worker thread is large, then it launches threads on all physical cores on the first new man node is 0-27, and we can verify that now almost all memory accesses are local accesses. Now, we have a look at multi worker inference. On top is AutoBox torches serve multi worker inference with four workers without core pinning. We notice each of the four main worker threads launches 56 threads launching a total of 56 times four threads, which is twice the total number of cores. Therefore, cores are guaranteed to be heavily overlapped with high logical core utilization, multiple workers using multiple cores at the same time. On bottom is the same setting but with core pinning. To avoid core overlap amongst the workers, we can equally distribute the physical cores to workers and bind them to each worker. In this case, bind workers load to cores 0-13, worker 1 to cores 14 to 27, and so on. Doing so make each worker use its assigned resources as highly efficiently as possible, and minimize resource conflict amongst the workers. We’ve discussed general principles to CPU configurations. Now I will give an overview on Intel extension for PyTorch. Intel extension for PyTorch is a Python package to extend PyTorch with up to day feature optimizations for X-Trop performance builds on Intel hardware. There are three major optimization techniques of Intel extension for PyTorch. They include operators, graph and runtime performance optimizations. A Docker-level and globalization and parallelization to maximizes CPU efficiency, low precision data types including B416 and Intel, data layout optimization for better cache locality. At the graph level include graph optimization, while PyTorch by default runs in e-ver mode, torched scripted models can convert the topology into graph to apply constant folding and operator fusion into the graph for reduced compute and better cache locality. At runtime, Intel extension for PyTorch provides runtime extension to a slower even further, including fine-grained thread of vanity control, launcher for automatic setting of the optimal CPU configuration. Eventually, all targeted features and optimizations are targeted to upstream stroke PyTorch. Compared to stock PyTorch, Intel extension for PyTorch optimizations are more aggressive at a broader scope. Intel extension for PyTorch provides simple front-end Python APIs for users to easily optimize their model to get performance optimizations. For FP32 and B416, users simply needs to import Intel extension for PyTorch package in and apply the IPexa to optimize against their model object to apply optimizations to their model. As well as for Intel extension for PyTorch, Intel quantization API has the same look and feel as that of PyTorch. Now we will introduce torched serve with Intel extension for PyTorch. Intel extension for PyTorch has already been integrated into torched serve to be of out-of-box performance boosts within ease of user API. Simply out a few lines and convicted of properties, which is a file that torched serve uses to store configurations. So that IPexa enable to enable in-calc extension for PyTorch. For CPU launcher, enable to a torched enable launcher. Launcher automates setting the most optimal CPU configuration. User can also specify launcher arguments to CPU launcher odds. In this example, we’re specifying NodeID 0 to use the physical course of the first new menu, applying the CPU tuning principles that we’ve discussed. We added the performance boost of torched serve with Intel extension for PyTorch from Resna 50 and BERT. We observed 1.21×2 push speed of for BERT and 2.4×3 push speed of for Resna 50 with IPV to increase. Now that we’ve discussed Intel extension for PyTorch to the torched serve, we’ll conclude with a summary. In this session, we have walked through some general CPU performance tuning principles to boost out-of-box CPU performance. We have also introduced Intel extension for PyTorch and its integration into torched serve with an ease of use API showcasing up to 7.71×2 push speed of for Resna 50 and 2.2×3 push speed of for BERT. And finally, please check our GitHub to torched serve and Intel extension for PyTorch and related blocks that dive deep into today’s topic.