Encoding a Feature Vector for PyTorch Deep Learning (4.1)
Welcome to applications of deep neural networks and PyTorch with Washington University. In this part, we’re going to take a look at how to encode tabular data for PyTorch. Tabular data is data that easily fits in something like Microsoft Excel. It’s not the slickest application of deep neural networks that are often doing computer vision with images, video, and other things like that, but a lot of data is in that form. It’s not an area that deep learning always is the best choice for. You may want to look at something else like XG Boost or a sport vector machine when you’re looking for what to best fit tabular data with, but deep learning can certainly deal with it. So I have a link to this Jupyter Notebook in the description, but I’m going to go ahead and open and collapse so that we can actually execute it. And I’m going to run this little part here just to start it up and initialize. We’re going to make use of this simple data set that I created. It has an ID column and it has a product column. And the product column is what we’re going to try to predict. So column is categorical. So we’re trying to predict a categorical value. We’ll see that we could also maybe try to predict income if we wanted to deal with something more like a regression type neural network. ID usually want to strip IDs because the ID field should not give you any information to help you predict because ID really just represents the order of the rows coming in. This is not time series. So the order that shouldn’t matter here. I am going to run this part. And what I’ve done now is generated the dummy variables for the job. As we saw before, there’s two ways you can really do dummy variables. You can make the 0, 0, 0, the straight zeros count for something. I’m just doing the very simple case where f1 value amongst all the other zeros is going to be 1. And that is the value or the category that it represents. So like there’s a one here, this would have been VV. So now we’re going to drop the job field and concatenate these dummies into the master data set. So now you can see the data set does have these jobs in there. Always make sure you drop the original categorical field because it’s not numeric. It’s going to give you trouble if you leave that in there when you try to convert this into an umpire and train. Here we’re going to do exactly the same thing with area. So you can see the area dummies are now in there. And also we dropped the dummy column. Income does have some missing values. So we’re going to get the median of the incomes and fill the remainder. You use median that way outliers do not mess it up. Like if Elon Musk was in here, his income would just blow everything away. I would want my income to be missing and filled in with the average income with Elon in the data set for sure. It would definitely result in a raise for me. Now we can look at the columns that we have. We see all of these jobs and areas and in there as well. We’re going to get the columns that we want for the X, which are the predictors. To do that, we’re going to drop product and we’re going to drop ID. If we’re predicting product, we don’t want to use product to predict itself that that would be target leak. So we’ll just go ahead and do that. These are the columns that we’re basing it on. And now we can convert the X and Y for a classification. We also need to do something with the Y. The Y is the products. We don’t want to convert the Y to dummies. And this may seem a little inconsistent because on the X side, we’re converting categorical stood dummies, but on the Y side, we are going to convert them to index. So there’s like 30 some out of those. So the first one will be the integer 0, 1, 2, 3, as we go up through them. That’s the way that PyTorch by default likes to have classification represented. Platforms like Kira’s, it wants to actually use dummy variables. So that is one of the differences between some of these. And I just use a label and coder to do that, just like we saw before. And now we can see our X and our Y. Now that was regression. We might also want to do this with classification. If we want to do a classification, we can do say income was what we were trying to predict. We could get the Y for income and it would just be the values. Just like we did, the values here is the dot values is converting this to a numpy array. Thanks for watching this video. And please like and subscribe and click the bell icon so that you don’t miss anything in this course. And thank you to all the Patreon and YouTube members for your support. Thanks very much appreciated.