# Stanford CS221 AI Lecture 2: Machine Learning 1 – Linear Classifiers, SGD (Stochastic Gradient Descent)

## In this AI video ...

Okay, so let’s get started with the actual technical content. So remember from last time, we gave an overview of the class. We talked about different types of models that we’re going to explore. Reflex models, state-based models, variable-based models, and logic models, which we’ll see throughout the course. But underlying all of this is machine learning. Because machine learning is what allows you to take data and tune the parameters of the models so you don’t have to work as hard designing the model. So in this lecture, I’m going to start with the simplest of the models, of reflex-based models, and show how machine learning can be applied to these type of models. And throughout the class, we’re going to talk about different types of models and how learning will help with those as well. So there’s going to be three parts. Classification regression, loss minimization, which is basically a set of objective function of how you want to train your machine learning model, and then sarcastic gradient descent, which is an algorithm that allows you to actually do the work. So let’s start with perhaps the most cliched example of machine learning. So you have, we don’t want to do spam classification. So the input is x, an email message. And you want to know whether an email message is spam or not spam. So we’re going to denote the output of the classifier to be y, which is in this case, either spam or not spam. And our goal is to produce a predictor. So a predictor in general is going to be a function that maps some input x to some output y. In this case, it’s going to take an email message and map it to whether the email message is spam or not. So there’s many types of prediction problems. Binary classification is the simplest one where the output is one of two possibilities, either yes or no. And we’re going to usually denote this as plus one or minus one. Sometimes you also see one in zero. There’s regression where you’re trying to predict a numerical value. For example, let’s say housing price. There’s a multi-class classification where y is not just two items, but possibly 100 items, maybe cat, dog, truck, tree, and different categories. There’s ranking where the output is permutation of the input. This could be useful, for example, if the input is a set of articles or products or webpages and you want to rank them in some order to show to a user. Such a prediction is where y, the output is an object that is much more complicated. Perhaps it’s a whole sentence or even an image. And it’s something that you have to kind of construct. You have to build the same from scratch. It’s not just a labeling. And there’s many more types of prediction problems. But underlying all of this, whenever someone says I’m going to do machine learning, the first question is asking, okay, what’s the data? Because without data, there’s no learning. So we’re going to call an example, x, y, pair, is something that specifies what the output should be when the input is x. And a training data or a set of examples, the training set, is going to be simply a list or a multi-set of examples. So you can think about this as a partial specification of behavior. So remember, we’re trying to design a system that has certain types of behaviors. I’m going to show you examples of what that system should do. If I have some email message that has c just to 21, then it’s not spam, but if it has lots of dollar signs, then it might be spam. And so remember, this is not a full specification of behavior. These 10 examples or even a million examples might not tell you what exactly this function is supposed to do. It’s just examples of what the function could do on those particular examples. OK, so once you have this data, so we’re going to use detrain to denote the data set. Remember, it’s a set of input output pairs. We’re going to push this into a learning algorithm or a learner. And what is the learning algorithm going to produce? It’s going to produce a predictor. So predictors are f. And the predictor remember is what? Actually, itself a function that takes an input x and maps it to an output y. So there’s kind of two levels here. And you can understand this in terms of the modeling inference learning paradigm. So modeling is about the question of what should the types of predictors f you should consider are. Infraints is about how do you compute y given x? And learning is about how you take data and produce a predictor so that you can do inference. OK, any questions about this so far? So this is pretty high level in abstract and generic right now. And this is kind of on purpose because I want to highlight how general machine learning is before going into the specifics of linear predictors. So this is an abstract framework. OK, so let’s dig in a little bit to this actual problem. So just to simplify the email problem, let’s consider a task of predicting whether a string is an email address or not. So the input is a string. And the output is a binary classification problem. It’s either one if it’s an email or minus one if it’s not. That’s where you want. So the first step of doing linear prediction is known as feature extraction. And the question you should ask yourself is what properties of the input x might be relevant for predicting the output of y? So I say it really highlights might be. At this point you’re not trying to encode the actual set of rules that solves the problem. That would involve no learning. That would just be trying to do it directly. But instead for learning, you’re kind of taking a more of a backseat and you’re saying, well, here’s some hints that could help you. So formally a feature extractor takes an input and outputs a set of feature name, feature value pairs. Right? So I’ll go through an example here. So if I have abc.gmail.com, what are the properties that might be useful for determining whether a string is an email address or not? Well you might consider the length of the string if it’s greater than 10. Maybe long strings are less likely to be email addresses than shorter ones. And here the feature name is length greater than 10. So that’s just kind of the label of that feature. And the value of that feature is one representing its true. So it will be zero if it’s false. Here’s another feature, the fraction of alpha numeric characters. So that happens to be 0.85, which is the number. There might be features that test for particular letters. For example, it does a container at sign. Well that as a feature value of one, because there isn’t at sign, ends with dot com is one, ends with dot org is zero because that’s not true. So, and you could have many, many more features. And we’ll talk more about features next time. But the point is that you have a set of properties. You’re kind of distilling down this input, which is, could be a string, it could be an image, or it could be something more complicated, into kind of a ground up fashion that later will see how machine learning algorithm can take advantage of. So you have this feature vector, which has is a list of feature values, and they’re associated names or labels. But later we’ll see that the names don’t matter to the learning algorithm. So actually what you should also think about the feature vector is simply a list of numbers. And just kind of on the side, make a note that, oh, this position number three corresponds to contains at and so on. So I’ve distilled the email address, abc.gmail.com, into the list of numbers, 0 or 1, 0.8, 5, 1, 1, 0. So that’s the feature extraction. It’s kind of distilling, complex objects into lists of numbers, which we’ll see is what the kind of the lingua franca of these machine learning algorithms says. Okay, so I’m going to write some concepts on the board. There’s going to be a bunch of concepts that I’m going to introduce, and I’ll just keep them up on the board for reference. So feature vector is a kind of important notion, and it’s a noted fee of x, an input. So fee itself, sometimes you think about, you call it the feature map, which takes an input and returns a vector, and this notation means that returns in general a d-dimensional vector, so a list of d numbers. And the components of this feature vector, we can write on as phi1, phi2, all the way to phi d of x. Okay, so this notation is convenient, because we’re going to start shifting, and we’re going to be able to bring our focus from thinking about the features as properties of input to features as kind of mathematical objects. So in particular, phi of x is a point in a high-dimensional space. So if you had two features, that would be a point in two-dimensional space, but in general you might have a million features, so that’s a feature, it’s a point in a million-dimensional space. So it might be hard to think about that space, but we’ll see how we can deal with that later in a bit. So that’s a feature vector, you take an input and return a list of numbers. And now the second piece is a weight vector. So let me write down a weight vector. So a weight vector is going to be known as w, and this is also a list of d numbers, it’s a point in a d-dimensional space, but we’re going to interpret it differently as we’ll see later. So the way to think about a weight vector is that for each feature j, so for example, frack of alpha, we’re going to have a real number wj that represents the contribution of that feature to the prediction. So this contribution is 0.6. So what does this 0.6 mean? So the way to think about this is that you have your weight vector, and you have a feature vector of a particular input, and the score of your prediction is going to be the dot product between the weight vector and the feature vector. So that’s written w dot a phi of x, which is written out as basically looking at all the features and multiplying the feature value times the weight of that feature and summing up all those numbers. So for this example, it would be minus 1.2, that’s the weight of the first feature times 1, that’s the feature value plus 0.6 times 0.85, and so on. And they’ll you get this number, 4.51, which happens to be the score for this example. Question? So the feature extraction with phi of x is that supposed to be like an automated process or does it require manual extraction or specification of features? Yeah, so the question is, is the feature extraction manual or automatic? So phi is going to be implemented as a function, like in code, right? You’re going to write this function manually, but the function itself is run automatically on examples. Later we’ll see how you can actually learn features as well. So you can slowly start to do less of the manual effort, but we’re going to hold off them there next time for that. Question? So we’re talking about weight meaning, and I know that in certain types of directions, like the weights mean a percentage change at this variable, each two percentage change of the outcome. It doesn’t mean this. Yeah, so the question is about interpretation of weights. Those weights can have a more precise meaning. You can try to read the T-leaves, but I don’t think there is maybe a, in general, mathematically precise thing you can say about the meaning of individual weights. But intuitively, and the intuition is important, is that you should think about each feature as a little person that’s going to make a vote on this prediction, right? So you’re voting either plus, yay or nay. And the weight of a particular feature is specifies both the direction of the, whether if positive weight means that that little person is voting positive and negative weight means that it’s moving negative. And the magnitude of that weight is how strongly that little person feels about the prediction. So contains add is three, because maybe like at signs generally do occur in you know addresses, but you know the fraction of alpha in numeric characters it’s you know less. So at that level you can have some intuition, but the precise numbers and y is 0.6 versus 0.5, that’s, you can’t really say much about that. Yeah, another question. Does a lot of, is this a sync dot product for the e-print networks? They can be like four weight vectors afterwards. It’s still like, like just points. So right now we’re focusing on linear classifiers. So the question is what happens if you have a neural net with more layers? There’s going to be more dot products, but there’s also good, it’s not just adding more features. There’s going to be other components which we’ll get to in a later lecture. Yeah. Yeah. Yeah. So the question is do the weights have to add up to something? Because short answer is no, there’s obviously restricted settings where you might want to normalize the weights or something, but we’re not going to consider that right now. Later we’ll see that the magnitude of weight does tell you something. Okay, so just to summarize, it’s important to note that the weight vector, there’s only one weight vector, right? You have to find one set of parameters for everybody, but the feature vector is per example. So for every employee you get a new feature vector, and the dot product of those two weight accommodation features is a score. So now let’s try to put the pieces together and define of the actual predictor. So remember we have this box with an F in it which takes X and returns Y. So what is inside that box? And I’ve hopefully given you some intuition, let me go to a board and write a few more things. So the score, remember is w dot fx, and this is just going to be a number. And the predictor, so linear predictor, actually let me call this linear, to be more precise, as a linear classifier, not just a predictor. Classifier is just a predictor that does classification. So a linear classifier denoted f of w. So f is what we’re going to use to denote predictors. W just means that this predictor depends on a particular set of weights. And this predictor is going to look at the score and return the sign of that score. So what is the sign? The sign looks at the score and says is it a positive a number? If it’s positive then we’re going to return plus one. If it’s a negative a number I’m going to return minus one. And if it’s zero then, I don’t care. You can return plus one if you want. It doesn’t matter. So what this is doing, remember the score is either is a real number. So it’s either going to be like kind of leaning towards large value, large positive values or leaning towards large negative values. And the sign basically says, OK, you’ve got to commit. Are you, which side are you on? Are you on the positive side or are you on the negative side and just kind of discretizes it? That’s what the sign does. OK. OK, so let’s look at a simple example. Because I think a lot of what I’ve seen before is kind of more of the formal machinery behind and the math behind how it works. But it’s really useful to have some geometric intuition. Because then you can draw some pictures. OK, so let’s consider this. So we have a wave vector which is 2 minus 1 and a feature vector which is 2 0 and another feature vector which is 0 2 and 2 4. OK, so there’s only two dimensions, so I can try to draw them on a board. So let’s try to do that. OK, so here is a two-dimensional plot. And let’s draw the wave vector first. OK, so the wave vector is going to be at 2 minus 1. OK, so that’s this point. And the way to think about the wave vector is not that point, but actually the vector going from the origin to that point. For reasons I’ll become clear later. OK, so that’s the way to. OK, and then what about the other points? So we have 2 0 0 2. So 2 0 is here. 0 2 is here. And 2 4 is here. OK, so we have three points here. OK, so how do I think about what? What this wave vector is doing? So just for reference, remember the classifier is looking at the sign of w dot of phi of x. OK, so let’s try to do a classification on these three points. OK, so w is, let me write it off for me. So 2 1 and this is 0 2. So what’s the score when I do w dot phi of x here? It’s 4, right? Because this is 2 0, 2 4. So this is just a dot product that’s 4. And take the sign. What’s the sign of 4? 1. OK, so that means I’m going to label this point as a positive, right? Positive point. OK, what about 0 2? Actually, sorry, this is just to be a minus 1, right? OK, this is 2 minus 1. OK, so if I take the dot product between this, I get minus 2. And then the sign of minus 2 is minus 1. OK, so that’s minus. And what about this one? So what’s the dot product there? It’s going to be 0, OK? So this classifier will classify this point as a positive. This is a negative and this one, I don’t know. OK, so we can fill in more points. But does anyone see kind of maybe a more general pattern? I don’t want to have to fill in the entire board with the classifications. Yeah? So orthogonal or a thing to the right of it is positive or a thing to the left of it is negative. Yeah, so let’s try to draw the orthogonal. Oh, this needs to go to that line. OK. OK, so let’s draw the orthogonal. So this is a right angle. And what that gentleman said is that the points, any point over here, because it has a cute angle with w, is going to be classified as positive. So all of this stuff is positive, positive, positive, positive, positive. And everything over here, because it has a obtuse angle with w, it’s going to be negative. So everything over here is negative. And then everything on this line is going to be 0, OK? So I don’t know. OK, and this line is called the decision boundary, which is a concept not just for linear classifiers, but whenever you have any sort of classifier, the decision-zero boundary is the separation between the regions of the space where the classification is positive versus negative. And in this case, it separates, because we have linear classifiers, the decision boundary is straight. And we’re just separating the space into two halves. If you were in three dimensions, this vector would still be just a vector. But this decision boundary would be a plane. So you can think about it as coming out of the board, if you want. But I’m not going to try to draw that. And that’s kind of the geometric interpretation of how linear classifiers work here. Question? Yeah. It seems like your weight could be in magnitude. Like, we have one plus three at the moment. Yeah. This is the boundary’s gradient. Yeah, so that’s a good point. So the observation is that no matter if you scale this weight by two, it’s actually going to still have the same decision boundary. So the magnitude of the weight doesn’t matter. It’s the direction that matters. So this is true for just making a prediction. When we look at learning, the magnitude of the weight will matter because we’re going to consider other more nuanced loss functions. OK, so let’s move on. Any questions about linear predictors? So so far, what we’ve done is we haven’t done any learning. Right, if you’ve noticed, we’ve just simply defined the set of predictors that we’re interested in. So we have feature vector. We have wave vectors multiplying together, get a score. And then you can send them through a sign function and you get these linear classifiers. Right, there’s no specification of data yet. OK, so now let’s actually turn to do some learning. So remember this framework, learning needs to take some data and return a predictor. And our predictors are specified by a weight vector. So you can equivalent when I think about the learning algorithm as outputting a weight vector, if you want, for linear classifiers. And let’s unpack the learner. So the learning algorithm is going to be based on optimization, which we started reviewing last lecture, which separates what you want to compute from how you want to compute it. So we’re going to first define an optimization problem, which specifies what properties we want a classifier to have in terms of the data. And then we’re going to figure out how to actually optimize this. And this modularity is actually really, really powerful. And it allows people to go ahead and work on different types of criteria for and different types of models separately from the people who actually develop general-purpose algorithms. And this has served to the fewer the machine learning quite well. OK, so let’s start with optimization problem. So there’s an important concept called a loss function. And this is a super general idea that’s used in the machine learning and statistics. So a loss function takes a particular example, x, y, and a weight vector, and returns a number. And this number represents how unhappy we would be if we use the predictor given by w to make a prediction on x when the correct output is y. So it’s a little bit of a mouthful. But this basically is trying to characterize, if you had a me a classifier, and I go onto this example and try to classify it, is it going to get it right or is it going to get it wrong? So high loss is bad. You don’t want to lose. And low loss is good. So normally, zero loss is the best you can kind of hope for. OK, so let’s do a figure out the loss function for binary classification here. So just some notation. The correct label is denoted y. And the predicted label, remember, is the score sent through the sine function, and that’s going to give you some particular label. And let’s look at this example. So w equals 2 minus 1, phi of x equals 2, 0, and y equals minus 1. So we already defined the score as, on example, is w dot phi of x, which is how confident we are particularly my plus 1. That’s the way to interpret this. So what’s the score for this particular example again? It’s 4, which means that we’re kind of positive that it’s a plus 1. Yeah, question. I was wondering, is the loss function generally one dimensional or the output of the loss function is even? Yeah, so the question is whether the output of loss function is usually a single number or not. In most cases, it is. For basically, all practical cases, you should think about the loss functions operating at a single number. The inputs can be a crazy high dimensional. Yeah? What is it? Not one dimensional. There are cases where you might have multiple objectives that you’re trying to optimize at once. But in this class, it’s always going to be one dimensional. Maybe you care about both time and space or accuracy, but robustness or something. Sometimes you have multi-objective optimization. But that’s way beyond the scope of this class. OK, so we have a score. And now we’re going to define margin. So let me, OK, so let’s actually do this. So we’re talking about classification. I’m going to sneak regression in a bit. So score is w, dot, f of x. This is how confident we are about plus 1. And the margin is the score times y. And this relies on y being plus 1 or minus 1. So this might seem a little bit mysterious, but let’s try to decipher that here. So in this example, the score is 4. So what’s the margin? You multiply by minus 1, so the margin is minus 4. And the margin’s interpretation is how correct we are. So imagine the correct answer is, if the score and the margin have the same sign, then you’re going to get positive numbers. And then the more confident you are, then the more correct you are. But if y is minus 1 and the score is positive, then the margin is going to be negative, which means that you’re going to be confidently wrong, which is bad. OK, so just to see if we kind of understand what’s going on. So when does a binary classifier make a mistake on a given example? So I’m going to ask for a show of hands. How many people think it’s when the margin is less than 0? OK, I guess we can kind of stop there. I used to do these online quizzes where it was anonymous, but we’re not doing that this year. OK, so yes, the margin is less than 0. When the margin is less than 0, that means y and the score are different signs, which means that you’re making a mistake. OK, so now we have the notion of a margin. Let’s define something called the 0, 1 loss. And it’s called 0, 1 because it returns either a 0, 1. OK, very creative. So the loss function is simply, did you make a mistake or not? So this notation, let’s try to decipher a bit. So if f of x here is a prediction when the input is x, and not equal to y is saying, did you make a mistake? So that’s think about as a Boolean. And this one bracket is just notation. It’s called an indicator function that takes a condition and returns either 1 or 0. So if the condition is true, then it’s going to return a 1. And the condition is follows it returns a 0. So all this is doing is basically returning a 1 if you made a mistake and 0 if you did a make a mistake. And we can write that as follows. We can write that as the margin less or equal to 0, because on the previous slide, if the margin is less or equal to 0, then we’ve made a mistake and we should incur a loss of 1. And if the margin is greater than 0, then we did make a mistake and we should incur a loss of 0. All right, so it will be useful to draw these loss functions pictorially like this. So on the x axis here, we’re going to show the margin. Remember the margin is how correct you are. And on the y axis, we’re going to show the loss function, which is how much you’re going to suffer for it. So remember, if the margin is positive, that means you’re getting a right, which means that the loss is 0. But if the margin is less than 0, that means you’re getting it wrong and the loss is 1. So this is a 0 and loss. That’s the visual that you should have in mind when you think about 0 and loss. Yeah. OK, like less than 0, because we are not to find the element of r to 0 rather that the plus part is correct. Yeah, so there’s this kind of boundary condition of what happens exactly at 0 that I’m trying to sweep under the rug because it’s not terribly important. Here, it’s less we go 0 to be kind of on the safe side. So if you don’t know, you’re also going to get it wrong. Otherwise, you could always just return 0 and then do that. That’s you don’t want that. So is it any questions about a binary classification so far? So we’ve set up these linear predictors. And I’ve defined the 0 and loss as a way to capture how unhappy we would be if we had a classifier that was operating on a particular data point. Next, why? So I’m going to go on a little bit of a digression and talk about linear regression. And the reason I’m doing this is that loss minimization is such a powerful and general framework. And it transcends all of these linear classifiers regressions and ups. So I want to emphasize the overall story. So I’m going to give you a bunch of different examples, classification regression side by side. So we can actually see how they compare. And hopefully, the common denominator will emerge more clearly from that. So we talked a little bit about linear regression in the last lecture. So linear regression, in some sense, is simpler than classification. Because if you have a linear predictor, and you get the score, w dot phi of x, it’s already a real number. So in linear regression, you simply return that real number, and you call that your prediction. OK, so now let’s move towards defining a loss function. So there’s going to be a concept that’s going to be useful. It’s called the residual, which is, again, trying to capture how wrong you are. So here is a particular linear predictor, linear regressor. And it’s making predictions all along for different values of x. And here’s a data point of phi of x, y. So the residual is the difference between the true value y and the predictive value y. And in particular, it’s the amount by which the prediction is overshooting the target. So this is a difference. And if you square the difference, you get something called the square loss. So this is something we mentioned in last lecture. Residual can be either negative or positive, but errors, either if you’re very positive or very negative, that’s bad. And squaring it mixes so that you’re going to suffer equally for errors in both directions. So the square loss is the residual square. So let’s do a simple example. So here we have our wave vector 2 minus 1. The feature vector is 2, 0. What’s the score? It’s 4, y is minus 1. So the residual is 4 minus minus 1, which is 5. And 5 squared is 25. So the square loss on this particular example is 25. OK, so let’s plot this. So just like we did it for a 0 and loss, so let’s see what this loss function looks like. So the horizontal axis here, instead of being the margin, is going to be this quantity for regression called the residual. It’s going to be the difference between the prediction and the true target. And I’m going to plot the loss function. And this loxed function is just the squared function. So if the residual is 0, then the loss is 0. If as the residual grows in either direction, then I’m going to pay something for it. And it’s a quadratic penalty, which means that it actually grows pretty fast. So if the residual is 10, then paying 100. OK, so that’s the squared loss. There’s also another loss I’ll throw in here called the absolute deviation loss. And this might actually be the loss that if you didn’t know about regression, you might immediately come to. It’s basically the absolute difference between the prediction and the actual true target. Turns out the squared loss, there’s a kind of a longer discussion about which loss function makes sense. The salient points here are that the absolute deviation loss is kind of has this kink here. And so it’s not smooth. Sometimes it makes it harder to optimize. But the squared loss also has this kind of thing that blows up, which means that it really doesn’t like having outliers or really large values, because you’re going to pay a lot for it. But at this level, just think about this as different losses. There’s also something called the hubo loss, which kind of combines both of these is smooth and also grows linearly and subtratically. So we have both classification and regression. We can define margins and residuals. We get either different loss functions out of it. And now we want to minimize the loss. So it turns out that for one example, this is really easy. So if I told you, OK, how do I minimize the loss here? Well, OK, it’s zero done. So that’s not super interesting. And this corresponds to the fact that if you have a class firing you’re just trying to fit one point, it’s really not that hard. So that’s kind of not the point. The point of machine learning is that you have to fit all of them. Remember, you only get one way vector. You have all these examples. You have a million examples. And you want to find one way vector that kind of balances errors across all of them. And in general, you might not be able to achieve loss of zero. So tough luck. Life is hard. So you have to make trade-offs. Which examples are you going to sacrifice for the good of other examples? And this is actually a lot of where issues around fairness of machine learning actually come in. Because in cases where you can’t actually make a prediction, that’s equally good for everyone. How do you actually responsibly make these trade-offs? But that’s a broader topic. Let’s just focus on trade-offs defined by the simple sum over all the loss examples. So let’s just say we want to minimize the average loss over all the examples. So once we have these loss functions, if you average over the training set, you get something which we’re going to call the train loss. And that’s a function of W. So loss is, on a particular example, train loss is on the entire data set. So any questions about this so far? So there is this discussion about which regression loss to use, which I’m going to skip. You can read it in the notes if you’re interested. The punchline is that if you want things that look like the mean square loss, if you want things that look like the median, use absolute deviation loss. But I’ll skip that for now. Yeah. You look for a thing here. We’re going to start thinking of regression in terms of loss minimization. So regression, at least squares regression, is from the early 1800s. So it’s been around for, you know, you could call it the first machine learning that was ever done if you want. I guess the loss minimization framework is it’s hard to pinpoint a particular point in time. It’s kind of not a terribly, you know, it’s not like an innovation in some sense. It’s just more of a, at least right now, it’s kind of a pedagogical tool to organize all the different methods that exist. Yeah. So you’re training on median, median, median. Do you mean that like the, is that particular trend said to median, what is the most high-cycle single-seeing loss confidence, whereas like with loss absolute deviation, what do you mean? Yeah. So I didn’t want to get into this example. But briefly, if you have three points that you can’t exactly fit perfectly, you, if you use absolute deviation, then you’re going to find a median value. You’re going to basically predict the median value. And if you use the square loss, you’re going to predict the mean value. But I’m happy to talk offline if you want. OK, so what we’ve talked about so far is we have these wonderful linear predictors, which are driven by feature vectors and wave vectors. And now we can define a bunch of different loss functions that capture how we care about regression and classification. And now let’s try to actually do some real machine learning. How do you actually optimize these objectives? So remember, the learner is going, so now we’ve talked about the optimization problem, which is minimizing the training loss. We’ll come back to that next lecture. And then now we’re going to talk about optimization algorithm. So what is the optimization problem? Now, remember last time we said, OK, let’s just abstract away from the details a little bit. Let’s not worry about if it’s the square loss or some other loss. Let’s just think about as a kind of abstract function. So one dimension, the training loss might look something like this. You have a single weight, and for each weight, you have a number of which is your loss on your training examples. And you want to find this point. So in two dimensions, it looks something like this. And let me try to actually draw this, because I think it will be useful in a bit. Just let me put all this up. OK, so in two dimensions, what optimization looks like is as follows. So I’m now plotting w1 and w2, which are the two components of this two-dimensional weight vector. For every point, I have a weight vector, and that value is going to be the loss of the training loss. And it’s pretty standard in these settings to draw what are called level curves. So let’s do this. So each curve here is a ring of points where the function value is identical. So if you look at terrain maps, those are level curves. So you know, I’m talking about. So this is the minimum. And as you grow out, you get larger and larger. OK, I’ll keep on doing this for a moment. OK. All right? And the goal is to find the minimum. All right, so how are we going to do this? So yeah, question. Assuming that there is a single minimum. Yeah, why am I assuming there is a single minimum? In general, for arbitrary loss functions, there’s no necessary a single minimum. I’m just doing this for simplicity. It turns out to be true for many of these linear classifiers. OK, so last time we talked about gradient descent. And the idea behind gradient descent is that, well, I don’t know where this is. So let’s just start at 0 as good as any place. And what I’m going to do at 0 is I’m going to compute the gradient. So the gradient is this vector that’s perpendicular to the level curves. So the gradient is going to point in this direction that says, hey, in this direction is where the function is increasing the most dramatically. And gradient descent goes in the opposite direction, because we remember we want to minimize loss. So I’m going to go here. And now I’m hopefully reduced my function value, not necessarily, but we hope that’s a case. Now we compute the gradient again. The gradient says, maybe it’s pointing this way. So I go in that direction. And maybe now it’s pointing this way. And I keep on going. This is a little bit made up, but hopefully eventually I get to the origin. And I’m kind of simplifying things quite a bit here. So there’s a whole field of optimization that studies exactly what kind of functions you can optimize and how gradient descent when it works and when it doesn’t. I’m just going to go through the mechanics now and defer the formal proofs of when this actually works until later. So that’s kind of the schema of how gradient descent works. So in code, this looks like this. So initialize that 0. And then loop in some number of iterations, which for simplicity, just think there’s a fixed number of iterations. And then I’m going to pick up my weights, compute the gradient, move in the opposite direction, and then there’s going to be a step size that tells me how fast I want to make progress. And we’ll come back to what the step size does later. So let’s specialize it to a least squares regression. So we kind of did this last week, but just to kind of review. So the training loss for least squares of regression is this. So remember, it’s the average over the loss of individual examples. And the loss of a particular example is the residual squared. So that’s this expression. And then all we have to do is compute the gradient. And if you remember your calculus, it’s just use a chain rule. So this 2 comes down here. You have the residual times the derivative of what’s inside here. And the gradient with respect to W is a phi of x. So last time we did this in Python in one dimension. So in one dimension, hopefully all of you should feel comfortable doing this because this is just kind of basic calculus. Here we have W is a vector. So we’re not taking derivatives, but we’re taking gradients. So there’s some things to be wary of. But in this case, it’s often kind of useful to double check that, well, the gradient version actually matches the single-dimensional version as well. Because last time, remember, we have the x out here. And one thing to note here is that there’s the prediction minus target. That’s the residual. So the gradient is driven by this quantity. So if the prediction equals the target, what’s the gradient? It’s going to be 0, which is kind of what you want. If you’re already getting the answer correct, then you shouldn’t want to move your weights. So often, we can do things in the abstract, and everything will work. But it’s often a good idea to write down some objective functions, take the gradient, and see if gradient descent on using these gradients that you computed is kind of a sensible thing. Because there’s many layers you can understand and get intuition for this stuff at the abstract level, optimization, or kind of at the algorithmic level. You pick up an example. Is it sensible to update when the prediction equals the target? OK, so let’s take the code that we have from last time. And I’m going to expand on it a little bit, and hopefully set the stage for doing stochastic gradient. OK, so last time, we had gradient descent. So remember, last time we defined a set of points, we defined the function, which is the train loss here. We defined the derivative of the function, and then we had gradient descent. OK, so I’m going to do a little bit of house cleaning. I’m just going to make this a little bit more explicit what this algorithm is. Green descent depends on a function, a derivative of a function, and let’s say the dimensionality. And I can call this gradient with f, df, and in this case, it’s d, where d equals 2. OK, you know what I want to kind of separate? This is the kind of algorithms, and this is modeling. So this is what we want to compute, and this is how we compute it. OK, and this code should still work. All right, so what I’m going to do now is upgrade this to vectors. So remember, the x here is just a number, right? But we want to support vectors. So in Python, I’m going to import numpy, so which is this nice vector and matrix library. And I’m going to make some arrays here, and this is just going to be a one-dimensional array, so it’s not that exciting. So this w.x becomes the actual dot, I need to call. And I think nw needs to be np.0’s d. OK, all right, so that should still run, actually, sorry, this one-dimensional. OK, so remember, last time we ran this program, and it starts out with some weights, and then it converges to 0.8, and the function value keeps on going down. All right, so let’s try to, it’s really hard to kind of see whether this algorithm is doing any interesting, because we only have two points. It’s kind of trivial. So how do we go about, because I’m going to also implement the Cassegrine descent, how do we have kind of a test case to see if this algorithm is working? So there’s kind of this technique, which I really like, which is to call generate artificial data. And the idea is that, what is learning? You’re learning as you’re taking a data set, and you’re trying to find the weights that best fit that data set. But in general, if I generate some arbitrary, if I download a data set, I have no idea what the core-in-prope right answer is. So there’s a technique where I go backwards. I say, OK, let’s decide what the right answer is. So let’s say the right answer is 1, 2, 3, 4, 5. So it’s a five-dimensional problem. And I’m going to generate some data based on that so that this wave vector is kind of good for that data. I’m going to skip all my breaks in this lecture. So I’m going to generate a bunch of points. Let’s generate 10,000 points. The nice thing about artificial data is you can generate as much as you want. It was a question, yeah. It’s true W. Oh, so true W just means like the correct, the ground truth, the W. But true W, true output. So W is a wave vector. So this is going backwards. Remember, I want to fit the wave vector, but I’m just saying this is the right answer. So I want to make sure that the algorithm actually recovers this wave. OK, so I’m going to generate some random data. So there’s a nice function, random dot random n, which generates a random, you know, a dimensional vector, and why I’m going to set what should I set Y to? Which side of W does that sound? Yeah, so I’m going to do regression. So I want to do true W dot X, right? So I mean, if you think about it, if I took this data, and I found the true W is the right thing that will get zero loss here. But I’m going to make life a little bit more interesting. I’m going to add some noise. So let’s print out what that looks like. Also, I should add it to my data set. So I’m pretty dead. OK, so this is my data set. I mean, can’t really tell what’s going on, but you can look at the code and you can assure yourself that this data has structure in it. OK, so let’s get rid of this print statement. And let’s train and see what happens. So let’s, OK, oh, one thing I forgot to do. So if you notice that the objective functions that I’ve written down, they haven’t divided by the number of data points. I want the average loss, not the sum. It turns out that if you have the sum, then things get really big and blow up. So let me just normalize that. OK, so let me activate that. OK, so it’s training. It’s training. Actually, so let me do more iterations. So I did 100 iterations, let’s do 1000 iterations. So the function value is going down. That’s always something good to check. And you can see the weights are slowly getting to what appears to be 1, 2, 3, 4, 5. So this is not a hard proof, but it’s evidence that this learning algorithm is actually doing the right thing. OK, so now let’s see if I add more points. So I now have 100,000 points. Now, obviously, it gets slower. And hopefully, you get one day, but I mean, it’s going to kill it. Any questions about my terminal got screwed up? So what I do here, I define. Lost functions took their derivatives. The gradient descent is what we implemented last time. And the only thing different I did, this time is generating a data set so I can kind of check whether gradient descent is working. Yeah, question? The gradient is just as a lot. It’s just as much as it was to learn. Is that what you need to see algorithm to learn from both predictions versus like the predictions? The question is whether the fact that the gradient is residual allows the algorithm to learn from under over predictions. Yeah, so the gradient is, if you think about it, yeah, that’s good intuition. So if you look at, if you’re over predicting, that means the gradient is kind of assumed that this is like one. So that means this is going to be positive, which means that, hey, if you up that weight, you’re going to over predict more and more and curve more lost. So my subtracting the gradient, you’re kind of pushing the weight down in the other direction. And same for when you’re under predicting. Yeah, so that’s good intuition to have. Yeah. What is the effect of the noise? The effect of noise, it makes the problem a little bit harder so that it takes more examples to learn. If you shut off the noise, then we can try it. I haven’t ever done this before. But presumably, you’ll learn faster, but maybe not. The noise isn’t that much. But OK, so let’s say you have 500,000 examples. That’s quite a few examples. And now this algorithm runs pretty slowly. And in modern machine learning, you have millions, 100 millions of examples. So gradient descent is going to be pretty slow. So how can we speed things up? Hello, that. And what’s the problem here? Well, if you look at what the algorithm is doing, it’s iterating. And each iteration, it’s computing the gradient of the training loss. And the training loss is just average of all the points, which means that you have to go through all the points, and you compute the gradient of the loss, and you add everything up. That’s what is expensive in a text time. So you might wonder, well, how can you avoid this? I mean, if you want to do gradient descent, you have to go through all your points. And the key insight behind the CASI gradient descent is that, well, maybe you don’t have to do that. So maybe here’s some intuition. So what is this gradient? So this gradient is actually the sum of all the gradients from all the examples in your training set. So we have 500,000 points adding to that. So actually, what this gradient is, it’s actually some of different things, which are maybe point in in slightly different directions, which are all average out to this direction. So maybe you can actually not average all of them, but you can average just a couple, or maybe even in extreme case, you can just take one of them and just march in that direction. So here’s the idea behind the CASI gradient descent. So instead of doing gradient descent, we’re going to change algorithm to say for each example in the training set. I’m just going to pick it up and just update. It’s instead of sitting down and looking all over all the training examples and thinking really hard, I’m just going to pick up one training example and update right away. So the key idea here is it’s not about quality. It’s about quantity. Maybe not the world’s best life lesson, but it seems to work in here. And then there’s also this question of what should the step size be? And generally in CASI gradient descent, it’s actually even a bit more important because when you’re updating on each individual example, you’re getting kind of noisy estimates of the actual gradient. And people often ask me like, oh, how should I set my step size? And the answer is there is no formula. There are formulas, but there’s no definitive answer. Here’s some general guidance. So if step size is small, so I really close to 0, that means you’re taking tiny steps. That means that it’ll take longer to get where you want to go. But you’re kind of proceeding cautiously, so it’s less likely you’re going to, if you mess up and go in the run direction, you’re not going to go too far in the run direction. Conversely, if you have AWS really, really large, then you, as I go, race car, you kind of drive really fast, but you might just kind of bounce around a lot. So pictorially, what kind of this looks like is that here’s maybe a moderate step size, but if you’re taking steps, really big steps, you might go over here and then you jump around, and then maybe you end up in the right place, but maybe sometimes you can actually get flung off out of orbit and diverse infinity, which is a bad situation. So there’s many ways to set the step size. You can set it to a constant. You could usually have to tune it, or you can set it to be decreasing the intuition being that as you optimize and get closer to the optimum, you kind of want to slow down. If you’re coming on a freeway, you’re driving really fast, but once you get to your house, you probably don’t want to be driving 60 miles an hour. OK, so actually I didn’t implement the cacid gradient. So let me do that. So let’s try to get cacid gradient up and going there. OK, so the interface to cacid gradient changes. So in gradients, all we need is a function, and it just kind of computes the sum of all the training examples. So in cacid gradient, I’m just going to just know SF or cacid gradient. I’m going to take an index i, and I’m going to update on the i-point only. So I’m going to only compute the loss on the i-point. And same for the its derivative. I’m going to look at the i-point, and just compute the gradient on that i-point. And this should be called SDF. OK, so now instead of doing gradient descent, let’s do cacid gradient descent. And I’m going to pass an sdf, sdf, d, and the number of points, because I need to know how many points there are now. Copy gradient descent is basically kind of the same function. I’m just going to stick another forward in there. So cacid gradient descent, it’s going to take the cacid functions cacid gradient, the dimension out in n. So now before I was just going through number of iterations, and now I’m not going to try to compute the value of all the training examples. I’m going to loop over all the points, and I’m going to call just evaluate the function at that point i. And compute the gradient at that point i instead of the entire data set. And then everything else is the same. I mean, one other thing I’ll do here is that I’ll use a different step size schedule, so one divided by number of updates. So I want it so that the number of the step sizes kind of decrease over time. OK, so I start with 8 equals 1, and then it’s half, and then it’s a third, and it’s a fourth, and it keeps on going down. Sometimes you can put a square root, and that’s more typical in some cases. But I’m not going to worry about the details too much. Question? The point i shows you now that we’ve got to give you this value range. Yeah, so the question is, the words cacid means that there should be some randomness here. And technically speaking, the cacid gradient descent is where you’re sampling a random point, and then you’re updating on it. I’m cheating a little bit because I’m iterating over all the points. In practice, if you have a lot of points and you randomize the order, it’s similar. But there is a kind of a technical difference that I’m trying to hide. OK, so this is cacid gradient descent. Do iterate, go over all the points, and just update. So let’s see if this works. I don’t think that worked. Maybe let’s see what happened here. I did try it on 100,000 points. Maybe that works. I don’t know if that didn’t work either. So I’m printing this out at the end of each iteration. So that should be fine. Really, this should work. So gradient descent was working, right? Maybe I’ll try, it’s probably not the best idea to be debugging this live. Let’s make sure gradient descent works. So that was working, right? So it’s a cacid gradient descent. I mean, it’s really fast and converges. But it doesn’t converge to the right answer. Yeah, but that should get recommended to 1. So that it might be true. OK, so I do have a version of this code that does work. So what am I doing here? That’s different. OK, have some water. Maybe I need some water. OK. So this version works. Yeah. Yeah, that’s probably good. That’s a good call. Yeah. OK. All right, now it works. Thank you. So yeah, yeah, this is a good lesson. Is that when you’re dividing, this needs to be 1. Actually, in Python 3, this is not a problem. But I’m so on Python 2 for some reason. But this should be 1.0 divided by num updates. Otherwise, I was getting. OK, so why is it faster? Yeah. OK. OK, let’s go back to 500,000. OK, so one full sweep of the data is the same amount of time. But you notice that immediately already converges to 1, 2, 3, 4, 5. So this is way, way faster than gradient descent. Remember, just to compare, gradient descent is you run it. And after one step, it’s not even close. Yeah. Where is gradient descent? What noise levels do you have to have until gradient descent becomes better? So it is true that if you have more noise, then gradient descent might be, the castor gradient descent can be unstable. There might be ways to mitigate that with step size choices. But yeah, probably you have to add a lot of noise for the castor gradient to be really bad. I mean, this isn’t in some sense. If you take a step back and think about what’s going on in this problem, it’s a five-dimensional problem. There’s only five numbers. And I’m feeding it half a million data points. There’s not that much to learn here. And so there’s a lot of redundancy in the data set. And generally, actually, this is true. I go into a large data set. There’s going to be a lot of redundancy. So going through all the data and then trying to make a informed decision is pretty wasteful where sometimes you can just kind of get a representative sample from one example or more as common to do like kind of many batches where you maybe grab 100 examples and you update on that, which is, so there’s a way to be somewhere between the castor gradient descent. OK, let me move on. OK, summary so far, we have linear predictors, which are based on scores. So linear predictors include both classifiers and regressors. We can do loss minimization. And if we implement it correctly, we can do SGD. OK, so I’m kind of switching things. I hope you’re following along. I introduce binary classification. And then I did all the optimization for linear regression. So now let’s go back to classification and see if we could do stochastic gradient descent here. OK, so for classification, remember we decided that the zero on loss is the thing we want. We want to minimize the number of mistakes. Who can argue with that? So remember what does zero on loss look like? It looks like this. So what happens if I try to run stochastic gradient descent on this? I mean, I can run the code. But yeah, it won’t work. And why won’t it work? Yeah, so two popular answers are, it’s not differentiable. That’s one problem. But I think the deeper problem is that what is the gradient? Zero. It’s like zero basically everywhere, except for this point, which doesn’t really matter. So as we learned, if you try to update with a gradient of zero, then you won’t move your weights. So gradient descent will not work on the zero on loss. So that’s kind of unfortunate. So how should we fix this problem? Yeah. Do you have a little bit of a gradient? Yeah, let’s make the gradient not zero. Let’s skew things. So there’s one loss, which I’m going to introduce called the hinge loss, which does exactly that. So let me write the hinge loss down. And the hinge loss is basically is zero here when the margin is greater equal 1. And rises linearly. So if you’ve gotten it correct by a margin of 1, so you’re kind of pretty safely on the side of, I’m getting it correct, then we won’t charge you anything. But as soon as you start dipping into this area, we’re going to charge you kind of a linear amount. And your loss is going to grow linearly. So there’s some reasons why this is a good idea. So it upper bounds the zero on loss. It has a property called known as convexity, which means that if you actually run the gradient descent, you’re actually going to convert it to the global optimum. I’m not going to get into that. And so that’s a hinge loss. So what remains to be done is to compute the gradient of this hinge loss. So how do you compute this gradient? So in some sense, it’s a trick question, because the gradient doesn’t exist, because it’s not differentiable everywhere. But we’re going to pretend that little point doesn’t exist. So what is this hinge loss? The hinge loss is actually two functions. There’s a zero function here. And then there’s this 1 minus x function. So what am I plotting here? I’m plotting the margin and the loss. So this is the zero function. And this is 1 minus w dot p of xy. And the hinge loss is just the maximum of these two functions. So at every point, I’m just taking the top function. So that’s how I’m able to trace out this curve. All right, so if I want to take the gradient of this function, you can try to do the math. Well, let’s think through it. What should the gradient be? Well, we’re here. What should the gradient be? It’s zero. And if I’m here, what should the gradient be? It should be the whatever the gradient of this function is. So in general, when you have a gradient of this kind of max, you have to break it up into cases. And depending on where you are, you have a different case. So loss is equal to if I’m over here, and what’s the condition for being over here? If the margin is greater than 1, right? And then otherwise, I’m going to take the gradient of this with respect to w, which is going to be minus p of xy, otherwise. OK, so again, we can try to interpret the gradient of the hinge loss. So remember you’re using the caster gradient descent. You have a wave vector, and you’re going to pick up an example, and you say, oh, let’s compute the gradient, move away from it. So if you’re getting an example right, then the gradient zero, don’t move, which is the right thing to do. And otherwise, you’re going to move in the direction because you’re minus, minus of p of xy, which imprints this example into your wave vector. And you can formally show that. It actually increases your margin after you do this. Yeah. What’s the significance of the margin being 1? This is a little bit arbitrary. You just kind of send you a non-zero value. And in support vector machines, you set it to 1, and then you have regularization on the weights, and that gives you some interpretation. So I don’t have time to go over that right now, but I have to feel free to ask me later. There’s another loss function. Do you have a question? Yeah. But why do we choose the margin as a loss function, as opposed to the squared error or another rate? Yeah. So why do you choose the margin? So in classification, we’re going to look at the margin, because that tells you how confidently you’re predicting correctly. In regression, you’re going to look at residuals and square losses. So it depends on what problem you’re trying to solve. Just really quickly, some of you might have heard of logistic regression. Logistic regression is this yellow loss function. So the point of this is saying that this loss minimization framework is really general. And a lot of things that you might have heard of, least squares, logistic regression, are kind of special cases of this. So if you kind of master how to do lots of minimization, you kind of can do it all. OK. So summary, basically what’s on the board here. If you’re doing classification, you take the score, which comes from w.fiavx, and you drive it into the sign, and then you get either plus 1 minus 1. Regression, you just use a score. Now to train, you have to assess how well you’re doing. Classification, there’s a notion of a margin in regression. It’s the residual. And then you can define loss functions. And here, we only talk about five loss functions, but there’s many others, especially for a kind of structure prediction or ranking problems. There’s all sorts of different loss functions. But they’re kind of based on these simple ideas of where you have a hinge that upper bounds the 0, 1 if you’re doing classification and some sort of square like error for regression. And then once you have your loss functions, provide it’s not a 0, 1. You can optimize it using SGD, which turns out to be a lot faster than gradient descent. OK. So next time, we’re going to talk about fiavx, which we’ve kind of left as someone just hands it to you. And then we’re also going to talk about what is a really true objective of machine learning? Is it really to optimize the training costs? OK. Until next time.