11-785 Spring 2023 Recitation 0C: Introduction to PyTorch
Welcome to the station 0C, Introduction to PyDodge. Today, we will explore the various facilities of this framework for training a new network. Prior knowledge of Python and object programming is the only prerequisite for this edition. If you need a refresher, please watch Paul’s station on Python and before this one. Python is one of the popular deep learning frameworks. While each framework comes with its own learning curve, mastering one makes it easy to learn others. As most of them use the same online concepts and ideas. The popularity of PyDodge is due to the great flexibility it provides for research and prototyping as compared to the framework such as TensorFlow, Jax, Cafe etc. On the flip side, TensorFlow is known for being production-grade. If you’re using Google app, PyDodge comes pre-installed. For your machine, you can install the latest version of PyDodge using either a pip or a conda. Based on your graphic card, you might have to install an older version. Check this link for the corresponding paper or conda commands. If you device lacks a graphic card, you can also install the CPU version. However, using Google Colab or a cloud VM with GPU is recommended in this case. Unis Station covers Google Colab, Rooming and Arjun Explore Amazon Web Services, and Wish induces Google Cloud list registration. Now, here’s a list of libraries we need to install. While this is not required on a NumPy, and in the standing of NumPy and Dimension Arrays and the operations that can be performed on them, transfer well to PyTor sensors. If you’re new to NumPy, make sure to check the critical decision on fundamentals of NumPy. Matt Brotlib is for a Visualized Data Set and Predictions of a Network, while Dot Summary X is for Exploring Model Architects. If using Google Colab and this code snippet shows devices CPU, go to runtime and select Change Runtime Type, then show GPU selected for Hardware Excellator, and then click Save. This entire notebook can be executed on the free version of Google Colab. Let’s say our objectives to model the Poon operator exclusive war using a multi-layer perceptron. Now this is an interesting problem because a single neuron cannot learn this function. However, three neurons put together in this fashion can exactly model this function. Here the first layer of Poon neurons is called Hidden Layer, and you will learn more about this in lecture one. This registration will give a bird’s eye view of many concepts that will be explored in detail over the semester. Now, to begin our deep learning project, we need a label dataset. As this is a simple example, we can generate a synthetic dataset very quickly. Pytosh provides us with two classes in this regard, dataset and data loader. Data set is an abstract class that we implement for our custom data. Now, unless we are using the built-in datasets provided by Pytosh, we almost always end up providing a custom dataset for every deep learning project. On the other hand, we usually do not have to write a custom data loader. In most cases, we can instantiate an object of the default data loader class with our custom dataset as input. The role of a data loader is to make the dataset retrieval so that a model can easily access the samples and they can spot in labels. In the exorded dataset class, the functions length, and the title comes from the abstract dataset class. As the name suggests, the dataset is the total set of the dataset and in digital samples by index respectively. Here we generate the inputs for exord in continuous space. Before going further, let us double-click on the tensor operations used here. Pytosh sensors are quite similar to the N-dimension arrays, with two major differences. One, they can leverage GPUs, GPUs and other hardware accelerators for performing a large number of calculations in parallel. And two, they are optimized for something known as automatic differentiation. We will come back to the second point multiple times during this iteration without going into too much today. We have a future recitation specifically dedicated to computing derivatives and automatic differentiation using Pytosh AutoGlad. Random creates a tensor of the desired size where each entry is a uniformly generated integer between low and high. Here, low value is included and high is excluded. Hence, all values in the data tensor are either 0 or 1. The data parameter converts these integer values into slow. Random N, on the other hand, creates a tensor of the same dimensions as data tensor, with each value beginning from a normal distribution, with mean 0 and standardization of one. Here, a few more ways of critical tensors with commonly used values. Once and zeros, critical tensor with all values as 1 and 0 respectively. I create an additive matrix, ran, populates its values from a uniform distribution between 0 and 1, where 0 is included and 1 is excluded. A range works quite like the range function of Python that it generates a 1-dimensional tensor from value 0 to the number parameter minus 1. The appendix covers more ways of critical tensors and ways by parallel operations that we performed on there. Now, let us visualize the dataset we just created. Note that we assign output labels to the generated input points before adding cautionals. So, if the standard division is too large, we are likely to end up with points that are way too far from the ocean tester and hence appear to be misclassified. Here, we also see a conversion between PyTorch tensors and Numpy NDIs. Data.CPU moves a tensor to CPU memory. As you discussed before, PyTorch tensors can access both CPU and GPU, while Numpy can only work with CPU. Sometimes, tensors also have to be detached from the computation graph before being converted to Numpy. Like this. We will cover this in more detail in a short time. Canoting from Numpy to tensor is pretty straightforward. Here, the return tensor and input Numpy NDIA point to the same memory and feature modifications of 1 will automatically reflect in the other. Now, with the dataset set created, let us create the attainable data loader object. Here, batch size denotes a number of samples to be clapped together into a single batch. The higher the batch size, the greater the memory requirement and lower the compute time for training once to the entire dataset. Shuffling the dataset is recommended during training, as there is no special significance in the order of samples in the dataset. Number of workers create additional support systems for loading the dataset. Again, it reduces compute time at the cost of memory overhead. When memory speeds up the transfer of data from CPU to GPU memory, drop-last can be used to force all batches to be of the same size. This can be used in training, but make sure you set shuffle equal to true. For else, you will be missing a few values every time from the training to set. Alright, let us move on to the main part of our project, the architecture of our network. We intentionally copied it simply. As discussed before, a single linear classifier cannot learn the XR function. Hence, we are using two linear layers here. This two corresponds to the dimension of the input, as XR is a binary operator. This two and this two corresponds to the dimension of the hidden layer. The output features of the previous layer must always be equal to the input features of the next layer. Finally, this one corresponds to the dimension of the final output. Overall, this architecture is quite like the Multi-Dare perceptron we saw earlier. We also added a TAN-AGE activation, looked in the two linear layers. We wait for the next just to learn what happens if you do not have an activation or linear layers. Sequential ensures that the output of one function in the list is fed as the input to the next function. This simplifies the forward function to a single line. Note that we didn’t write any backward function. This is because by touch automatically takes care of backward automation. For any customer to work architecture, using the computation graph for automatic differentiation. Our customer to architecture can also be spread across multiple classes. And by touch, we’ll still handle backward automation for us. To understand the model, we can either directly print it as a string or loop through its name parameters. This requires grad equal to 2, shows that the tensor is attached to the compression graph. And device equal to 2.0 shows that the tensor is on the first GPU. If your machine has multiple GPUs, you might see kuda1, 2 or 3 etc. To convert the tensor to numpy and DRA, we must first detach it and move it to CPU. Or else we’ll get a runtime error. We can also print a summary of a model by sending an example batch of data. Note that both the model and data should be on the same device, GPU in this case, or else we get a runtime error. Here, 8 denotes the batch size and 2, the number of input features. With the dataset and model ready, we need to decide on the loss function and optimize it. The choices made here might look arbitrary at this stage. That’s fine. Over the semester, you will study many loss functions and optimizes in great detail. And we’ll have a little bit understanding of what to use and where. As the output of an XOR is either 0 or 1, we are working on a binary classification problem. For every data point, our network will predict a label that will either match the correct label as per XOR output or not. We need a loss function to assign an chemical value to all wrong classifications put together. A live solution could be to simply count all the misclassifications. However, we also want to iteratively train a network to reduce this misclassification as much as possible. Hence, we need the loss function to be defencialable with respect to parameters of the network. This will help us to remotely move towards the desired parameter values that result in minimum loss. The combustion graph keeps tracks of these derivatives, while it is a job of the optimizer to update interrupt parameters based on computer-degradients. Binary cross entropy is a common loss function for binary classification problems. You can check out this video for a short explanation, or wait for the lectures to cover cross entropy in detail. Always check the documentation of the loss function you intend to use as some also implement an activation function together with loss. This is usually done for greater numerical stability or efficiency. VCE with LOS implements a sigmoid function along with the binary cross entropy loss. Hence, we do not use any activation after a last linear year. For the optimizer, we are using stochastic gradient descent. Now a word of question here, especially if you are training a model on GPUs. Make sure to move your model to GPU before you initialize the optimizer with model parameters, or else no learning will happen. We also generate a larger training and validation dataset. And the corresponding data roll out. We train the model out training data and use validation data for manually tuning the hyper parameters. We usually also have a test dataset that is completely unseen by a model for benchmarking its performance against other models. For this simple problem, we will follow the test dataset and benchmarking. Let’s move on to the training loop. To make the code modular, it’s a good practice to write a training and validation function for one epox operating. And then call it a training loop. Here, we set the model to train mode. And it went through each patch in the train data roller. The train mode is needed because certain tors.nm.monuils, such as batch norm, dropout etc, have different forward steps for training and evaluation. As the model is on GPU, we also need to move the data to GPU. Our final layer had output feature equal to 1. Hence, we have a single output for each time in the batch. We use squeeze to reduce this two-dimensional array into a one-dimensional dimension. Next, we calculate loss by comparing the model predictions with two labels. The return loss is a zero-dimensional tensor and can be converted to scalar by calling the item function. Now comes the backward propagation. By default, PyTorch accumulates gradient over subsequent backward passes. And with every call adding on to the existing gradient instead of overwriting them. This is desirable in some specific cases where we want to compute the gradient of the loss summed over multiple batches. For every other case, we must zero to gradient before performing backward propagation for each batch. Here, based on the gradients, the optimizer update the model parameters, and finally, we calculate the accuracy of our protections to return the training loss and accuracy. Just like we use squeeze above to reduce a tensor dimension with only a single element, we can also unscrews to increase dimensions. We can unscrews along any dimension and unscrews as many times as needed. However, we can only squeeze dimensions with a single element. A plank squeezing along dimensions with multiple elements has no effect. You should also go through the appendix for more such tensor manipulations like flatten, permute, etc. Now, the evaluation function. We set the model to Eval mode and iterate through each batch in the valid data. Again, move the data to GPU as the model is in GPU. Note the use of torsh.insense mode. This is new error. In some other places, you might see torsh.no grad. This is used to locally disable gradients as you don’t need them during evaluation or testing. While we calculate validation loss to compare with training loss, we don’t do that for the training. Finally, we return the validation loss and accuracy. With the training and evaluation functions turned, the actual training look looks very simple. We arbitrarily decided to run the training called 100 epochs. This is a hyperparameter and you may need to run more for more epochs or implement only stopping based upon a problem. Also, we are using a fixed learning rate in this simple problem while you may need a scheduler to reduce the learning rate as your model improves for your projects. Finally, this model trains very quickly. For more complex problems, you should consider using pqdm for showing progress bars and print intermediate results. With everything done, let’s visualize our models performance. If all went well, the model should reach close to 100% accuracy. If not, train it for more epochs or make the network architecture beaver or wider. In the figure, you can see prominent decision boundaries fall back to the diagonal. Your decision boundaries could also be found into the other diagonal as seen in a slide. A challenge question. Can you explain why the decision boundaries sometimes along one diagonal and sometimes together? Can you force it to go one way or the other? For more complex network architectures, you will see curves around the clusters instead of straight lines. Why do you think that’s the case? Another challenge is to train a new network to model more complex decision boundaries. Like the Pentagon, you will be seeing in the first lecture. Again, you will have to generate synthetic label data using a custom dataset class and wrap a data loader around it. I suggest using a uniform distribution and labeling it before adding a small Gaussian noise. Figure out what network architecture works best for you. As it is still a binary classification, you can continue using the same loss function and optimize. Finally, you may have to save your full train model or sometimes save a checkpoint after every few epochs and load it in case your training gets interrupted. Because you know, life happens. Even though the extension here doesn’t matter, the common piter’s convention is to use.pt or.pth. You can even store the optimizer and scheduler’s state dictionary along with the model. This is recommended for intermediate checkpoints so that you can resume training exactly as you left before the introduction. An important point to note here for Google Hullab users. By default, collapse stores the checkpoint in the hosted runtime. This storage gets deleted. If the runtime gets disconnected or if collapse crashes for some reason. To ensure you don’t lose hours of your training, you should always save a checkpoint in Google Drive. This code snippet amounts the Google Drive after requesting user authentication. Subsequently, you can use this path to save model checkpoints into collapse notebooks folder in your Google Drive. The registration ends here, but I strongly recommend you go through the follow-in appendix to improve your understanding of how piter’s chances work. We will also provide a cheat sheet for piter’s chances on the course of site for easy events. Thank you for your time. If you have any questions, please reach out to the course staff for Piazza. We will meet again in the first session after a couple of lectures and explore the training of MLP in much more detail. See you there.