# Stanford CS221 AI Lecture 4: Machine Learning 3 – Generalization, K-means

## In this AI video ...

We’re going to go through the back propagation example, which I went through very briefly in the last lecture, talk about newters’ neighbors, which I did within one minute. And also, they’re going to talk about a scikit-learn, which is this really useful tool for doing machine learning, which might be useful for your final projects. Good to know. So, please come to section. All right, let’s do a little bit of review of where we are. So, we’ve talked about, we’re talking about machine learning, in particular supervised learning, where we start with feature extraction, where we take examples and convert them into a feature vector, which is more amenable for a learning algorithm. We can take either linear predictors or neuron networks, which gives us scores. And the score is either defined via simple dot product between the weight vector and the feature vector, or some sort of more fancy nonlinear combination. The end of the day, we have these model families that give us score functions, which then can be used for classification regression. We also talked about loss functions as a way to assess the quality of a particular predictor. So, in linear classification, we had the zero loss and the hinge loss, as example of loss functions that we might care about. The training loss is an average over the losses on individual examples. And to optimize all of this, we can use the sarcastic gradient algorithm, which takes an example, x, y, and a computer gradient on that particular example, and then just updates the weights based on that. So, hopefully this should be all review. Now, I’m going to ask the following question. Let’s be a little bit philosophical here. So what is the true objective of machine learning? So how many of you think it is to minimize the error on the training set? Show of hands. No one? This is what we’ve been talking about, right? We’ve been talking about minimizing error on training sets. Okay, well, maybe that’s not right then. What about minimizing error training error with regularization? Because regularization is probably a good idea. How many of you think that’s a goal? What about minimizing error on the test set? Okay, seems like it’s closer, right? Test sets, test accuracies are things maybe you care about. What about minimizing error on unseen future examples? Okay, so the majority of you think that’s the right answer. And what about learning about machines? That’s the true objective. Who doesn’t want to learn about machines? That’s actually the true objective. Now, so the correct answer is minimizing error on unseen future examples. So I think all of you have the intuition that we are doing some machine learning. We’re learning on data, but we really care about is how this predictor performs on in the future. Because we’re going to deploy this in a system and it’s going to be the future. It’s going to be unseen. Okay, so then how do we think about all these other things? Training said regularization and test. So that’s going to be something we’ll come back to later. Okay, so there’s two topics today. I want to talk about generalization, which is I think of pretty subtle but an important thing to keep in mind when you’re doing machine learning. And then we’re going to switch gears and talk about unsupervised learning where we don’t have labels, but we can still do something. So we’ve been talking about training laws, right? We’ve made a big deal about you write down what you want and then you optimize. So the question is is this training laws a good objective function? Well, let’s take this literally. Suppose we really wanted to just minimize the training laws. What will we do? Well, here’s an algorithm for you. So you just store your training examples. Okay, and then you’re going to define your predictors as follows. So if you see that particular example in your training set, then you’re just going to output the output that you saw in the training set. And then otherwise you’re just going to sick form. You’re just going to crash, right? So this is great. It minimizes the training error perfectly. It gets zero loss, assuming your training examples don’t have conflicts. But you know, you’re all laughing because this is clearly a bad idea. So somehow purely following this minimizing training set objective is not really the right thing. So this is an example, a very extreme example of overfitting. So overfitting is this phenomena that you see where you have some data and usually the data has some noise and you are trying to fit a predictor, but you’re really trying too hard, right? So if you’re fitting this green squiggly line, you’re her fitting the data and getting zero training error, but you’re kind of missing the big picture, which is this black curve. Or in regression, some of you probably’ve seen examples where you have a bunch of points usually with noise. And if you really try hard to fit the points and you’re going to get zero error, but you’re kind of missing this general trend. And overfitting can really kind of bite you, if you’re not careful. So let’s try to formalize this a little bit more. How do we assess whether a predictor is good? Because if we can’t measure it, we can’t really optimize it. Okay, so the key idea is that we really care about error on unseen future examples. Okay, so this is great as, you know, aspiration to write down. But the question is how do we actually optimize this? Right? Because it’s the future. It’s also unseen. So I definition we can’t get a handle on this. So typically what people do is the next best thing, which is you gather a test set, which is supposed to be represented with the types of things you would see. And then you guard it carefully and make sure you don’t touch it too much. Right? Because, you know, what happens if you start looking at a test set and, you know, the worst case is you train on a test set. Right? So, you know, the test set being a surrogate for unseen future examples just completely goes away. Right? And even if you start looking at it and you’re, you know, really trying to optimize it, you can get, you know, into this overfitting regime. Right? So we really be careful that, and I want to emphasize that the test set is really a surrogate for what you actually care about. So don’t blindly just, you know, try to make test accuracy numbers go up at all cost. Okay? Okay, so, but for now, let’s assume we have a test set that, you know, we have to work with. So, you know, there’s this kind of really peculiar thing about machine learning, which is this leap of faith, right? You, the training algorithm is only operating on the training set. And then all of a sudden, you go to these unseen examples or the test set and you’re expected to do well. So, you know, why is there, why would you expect that, you know, to happen? And as I alluded to on the first day of class, there’s some kind of actually pretty deep mathematical reasons for why this might happen. But, you know, rather than getting into the math, I just kind of want to give you a maybe intuitive picture of how to think about this, this gap. Okay? So remember, we have this picture that it’s of all predictors. So these are all the functions that you possibly want in your wildest dreams. And then when you define a feature extractor or a neural network architecture or any sort of, you know, structure, you basically say, hey, I’m only interested in these set of functions, not all functions. Okay? And that learning is trying to find some element of the class of functions that you set out to find. Okay? So there’s the decomposition which is useful. So let’s take out this point, g. So g is going to be the best function in this class, the best predictor that you can possibly go. So if some Oracle came and set your neural networks to something, how well could you do? Okay? So now there’s two gaps here. One is the approximation error. The approximation error is the difference between f star which is the true predictor. So this is, you know, the thing that always gets the right answer. And g, which is the best thing in your class. Okay? So this really measures how good is your hypothesis class? Remember, last time we said that we want hypothesis class to be expressive. If you only have linear, you know, functions and your data looks sinusoidal, then that’s not expressive enough to capture the data. Okay? So the second part is estimation error. This is the difference between the best thing in your hypothesis class and the function actually find. Right? And this measures how good is a learn predictor, kind of relative to the potential of the hypothesis class. You define this hypothesis class as, here are things that I’m willing to consider. But at the end of the day, based on a finite amount of data, you can’t get to g. You only can kind of estimate, you know, some, you do learning and you get to some f hat. So in kind of more mathematical terms, if you look at the error of the thing that you’re learning algorithm actually returns minus the error of the best thing possible, which, you know, many cases is, you know, zero. Then this can be written as follows. So all I’m doing is subtracting error g and adding error g. So this is the same quantity as this. But then I can look at these two terms. So the estimation error is the difference in error between the thing that you’re learning algorithm produces minus the best thing in the class g and then the difference, approximation error is the difference between error of g and error of f. Sorry. Okay? So this is kind of a useful as a way to kind of conceptualize the different trade-offs. Right? So, you know, just to kind of explore this a little bit, suppose I increase the hypothesis class size. Right? So I add more features or I, you know, added, increase the dimension of my neural networks. What happens? So the approximation error will go down. And why is that? Because we’re taking a minimum of a larger set. So, g is always the minimum possible error over the set f. And if I make the set larger, I have just more possibilities of driving the error down. Okay? So the approximation error is going to go down. But the estimation error is going to go up. Right? As I make my hypothesis class more expressive. And that’s because it’s harder to estimate something more complex. So I’m leaving it kind of vague right now. There’s mathematical way to formalize this, which you can ask me about offline. Okay? So you can see this kind of tension here. Right? You really want to make your hypothesis class large so you can drive down the approximation error. But you don’t want to make it too large that it becomes impossible to estimate. Okay. So now we have this kind of abstract framework. What are some kind of knobs we can tune? How do we control the size of the hypothesis class? So we’re going to talk about two eccentric classes of types of ways. So strategy one is dimensionality. So remember for linear classifiers, a predictor is specified by a way vector. So this is D numbers. Right? And we can change D. We can make these smaller by removing features. We can make the larger by adding features. And pictorially you can think about as reducing D as reducing the dimensionality of your hypothesis class. So if you’re in three dimensions, you have three numbers, three degrees of freedom. You have this kind of ball. And if you remove one of the dimensions, now you have this ball or circle in two dimensions. Okay? So concretely what this means is you can manually add a, you know, this is a little bit heuristic. You can add features if they seem to be helping and remove features if they don’t help. So you can kind of modulate the dimensionality of your way vector. Or there are also automatic feature selection methods such as boosting or L1 regularization, which are outside the scope of this class. Maybe if you take a machine learning class, you’ll learn more about this the stuff. But the main point is that you can determine by setting the number of features you have. You can vary the expressive power of your hypothesis. Okay, so the second strategy is looking at the norm or the norm or the length of a vector. So this one is maybe a little bit less obvious. So again for linear predictors, the way vector is just a denimensional vector. And you look at how long this vector is. So what is, and the length pictorially looks like this. So if you have, let’s say, all the way vectors in each w can be thought about as a point as well. So this circle contains all the way vectors up to a certain length. And if by making the smaller, now you’re considering a smaller number of way vectors. Okay, so at that level, it’s perhaps intuitive. So what does this actually look like? So let’s suppose we’re doing one-dimensional linear regression. Here’s the board. And we’re looking at xy. So remember what, and in one dimension, where all we’re looking at is, you know, w is just a single number. Right? And the number represents the slope of this line. So by saying, no, let’s draw some slopes here. Okay? So by saying that the weight vector or the weight is a smaller magnitude, that’s basically saying the slope is, you know, smaller or closer to zero. So if you think about, you know, slope equals, let’s say this is slope equals one. So anything, anything, let’s say, plus and one or greater than minus one is ferricate. And now if you reduce this to half, now you’re looking at a kind of a smaller window here. And if you keep on reducing it, now you’re basically converging to, you know, essentially very flat and constant functions. Okay? So you can understand this two ways. One is just that the total number of possible weight vectors you’re considering is just shrinking because you’re putting more constraints that have to be smaller. From this picture, you can also think about it as what you’re really doing is making the function, you know, smoother. Right? Because a flat function is the kind of the smoothest function. It doesn’t kind of, you know, vary too much. And a complicated function is one that can go, you know, very jump up very steeply. And, you know, for quadratic functions, can also come down really quickly. So you get kind of very wiggly functions. Those are, tend to be more complicated. Okay? Any questions about this so far? Yeah? Kind of like not overfit. So like, what if we have like, like instructors within the data sense that, that sense, we try to like not overfit, we’re really just kind of like, be shrinking ourselves like a particular set of like distributions. So we say, okay, this data must have like come from like something normal, must have come from something reasonable. But by saying that we’re like not really capturing the full, the full like scope of our data sense. I’m not sure. So let’s see. So the, so the question is if there’s particular structure inside your data set, for example, if some things are sparse or, you know, low rank or something. You know, how do you capture that with a regularization? Regularization. But you have like, perhaps not even just like, like, smart circles, like if you have like, calls and models inside the site between your like parameters. Like, how would you like, would a regularization like impede some of those relations? Yeah. So, so all of this is kind of very generic, right? You’re not making any assumptions about what the classifier is or the features. So they’re kind of like big hammers that you can just apply. So if you have models where you have more structure domain knowledge or if you do, which we’ll see, you know, for example, if you have, you know, Bayesian networks later in the class, then there’s much more you can do. And this is just kind of, you know, two techniques for as a kind of generic way of controlling for overfitting. Yeah. Make sure I’m understanding correctly. This approach is actually putting constraints on each element in vector w, like the magnitudes of it versus the other one is actually on the elements or in potential vector w. Yeah. So, so let’s look at w here. So let’s say you’re in three dimensions. So w is w1, w2, w3. So the first method just says, okay, let’s just kill some of these elements and make it smaller. This one is saying that, I mean, formally it’s looking at the squared values of each of these and looking at the square root. That’s what the norm is. So it’s saying that each of these should be, you know, small according to this particular metric. Yeah. Yeah. So that’s what I’m going to get to. So this is just kind of giving you intuition for in terms of hypothesis classes and how you want them to be small. How do you actually implement this? You know, there’s several ways to do this, but the most popular way is to add regularization. And by regularization, what I mean is take your original objective function, which is train loss of w. And you just add this penalty term. So lambda is called a regularization strength. It’s just a positive number, let’s say, let’s say one. And this is the squared length of w. Okay. So what this is doing is by adding this, it’s saying that, okay, optimizer, you should really try to make train loss small, but you should also try to make this small as well. Okay. And there’s a, if you study convex optimization, there’s kind of this, the duality between this, this is called the Lagrangian form where you have a penalized objective where you add a penalty on the wave vector and the constraint form where you just say that I want to minimize train loss with subject to the norm of w being less than some value. But this is more the kind of the typical one that you’re going to see in practice. Okay. So here’s objective function. Great. How do I optimize it? Yeah. Okay. So it’s important that these be the same w and you’re optimizing the sum. So the optimizer is going to make these trade-offs. If it says, oh, okay, I can drive the training loss down, but if this is shooting up, then that’s not good. And they’ll try to balance these two. Yeah. It’s basically saying try to fit the data, but not at the expense of having huge wave vectors. Yeah. So if there’s another way to say it is that kind of think about Occam’s razor, it’s saying if there’s a simple way to fit your data, then you should just do that instead of finding some really complicated way back to that fits your data. So prefer simple solutions. Okay. So once you have this objective, we have a standard crank we can turn to turn this into algorithm. You can just do gradient descent. And if you just take the derivative of this, then you have this gradient and then you also have lambda w, which is the gradient of this term. So you can understand this as basically you’re doing gradient descent as we were doing before. And now all you’re doing is you’re shrinking the weights towards zero by lambda. So lambda is a regularization strength. If it’s large, that means you’re trying to really kind of push down on the magnitude of the weights. So the gradient optimizer is basically going to say, hey, I’m going to try to step in the direction that makes the training loss small, but then I’m going to also push the weights towards zero. In neural nets literature, this is also known as weight decay. In optimization statistics, it’s known as L2 regularization because this is the Euclidean or 2 norm. Okay, so here is another strategy which intuitively gets at the same idea, but it’s in some sense more crude. So it’s called early stopping. And the idea is very simple. You just stop early. Instead of going and training for 100 iterations, you just train for 50. Okay? So why does this idea? The intuition is that if you start with a weights at zero, so that’s the smallest you can make the norm of W. So every time you update on the training set, generally the norm goes up. There’s no guarantee that it will always go up, but generally this is what happens. So if you stop early, that means you’re giving less of an opportunity for the norm to grow. So fewer updates translates to generally lower norm. You can also make this formal mathematically, but the connection is not as tight as the explicit regularization from the previous slide. Okay, so the lesson here is try to minimize the training error, but don’t try too hard. Yeah, question. Depends on how we initialize the weights. Question is does this depend on how we initialize the weights? Most of the time you’re going to initialize the weights from some sort of weights which is kind of a baseline, either zero or for neural nets, maybe like random vectors around zero, but they’re pretty small weights. And usually the weights grow from outside, from small to large. There’s other cases where if you think about pre-training, you have a pre-trained model, you start with some weights and then you do gradient descent from that, then you’re saying basically don’t go too far from your initialization. Yeah. The thing is we want to, like, the patient make a focus on the arm of the weight, so actually, we minimize the training loss. So why are we not really focusing on training loss? Right, so the question is why aren’t we focusing on minimizing the training loss or why focus on W? It’s always going to be a combination. So the optimizer is still trying to push down on the training loss by taking these gradient updates. Right, notice that the gradient with respect to the regularizer actually doesn’t come in there. It kind of comes in explicitly through the fact that you’re stopping it early. But it’s always kind of a balance between minimizing the training loss and also making sure your classifier weights doesn’t get too complicated. Yeah. I think you said what value of lambda or t to set? Yeah, so the question is how do you decide the value of t here and how you decide the value of lambda? So we call it hyper-premeters and I’ll talk a little bit more about that later. Okay, so here’s the kind of the general philosophy that you should have in machine learning. You should try to minimize the training error because really that’s the only thing you can do. That’s your data and that’s, you know, you have your data there. But you should try to do so in a way that keeps your hypothesis small. So try to minimize the training error but don’t try too hard. I guess it’s the lesson here. Okay, so now going back to the question earlier, if you notice through all of these, my presentation, there’s all sorts of properties of the learning algorithm, which feature you have, which regularization parameter you have, the number of iteration, the step size for gradient descent. These are all considered hyper-premeters. So so far, they’re just magical values that are given to the learning algorithm and the learning algorithm runs with them. But someone has to set them and how do you set them? Yeah, you can ask me, I don’t know the answer to that. Okay, so here’s an idea. So let’s choose a hyper-premeter to minimize the training error. So how many of you think that’s a good idea? Okay, not too many. So why is this a bad idea? Yeah, you can overfit, right? So suppose you took a lambda and you say, hey, you know, let’s choose a lambda that will minimize the training error. Okay, and the the learning algorithm says, well, okay, you know, I want to make this go down. What is this doing in the way? Let’s just set lambda to zero and then I don’t have to worry about this. So it’s kind of, you know, cheating in a way. And also early stopping would say, like, don’t stop. Just keep on going because you’re always going to drive the training error lower and lower. Okay, so that’s not good. So how about choosing hyper-remeters to minimize the test error? I mean, you say, yeah, it’s a good idea. Yeah, not so good, it turns out. So why? And this is again stressing the point that the test error is not the thing you care about because what happens when you look at the, we try to use a test set? Then it becomes a unreliable estimate of the actual unseen error because if you’re tuning hyper-premeters on the test set, that means that it’s no longer, it becomes less and less unseen and less future. Yeah. If you like the set to the cap and model it, it could stick across the addition. Yeah, so we could do cross validation, which I’ll describe in a second. Okay, so I want to emphasize this point. When you’re doing your final project, you have your test set, you have it sitting there and you should not be, you know, failing with it too much or else it becomes less reliable. Okay, so you can’t use a test set. So what do you do? So here’s idea behind a validation set. Is that you take your training set and you sacrifice some amount of it. Maybe it’s 10%, maybe 20%. And you use it to estimate the test error. So this is a validation set. The test set set is, off to the side, it’s locked in the safe. You’re not going to touch it. And then you’re just going to tune hyper-premeters on the validation set. And use that to guide your model development. Yeah. The proportion itself is not hyper-premeter. The proportion itself is a hyper-perparameter. No. Usually, yeah, you don’t tune bad. I mean, usually, how you choose it is kind of this balance between you want the validation set to be large enough so it gives you reliable estimates. But you also want to use most of your data for training. Yeah. So how do you choose the hyper-premeters? So the answer is you try particular values. So for example, try, let’s say lambda equals 0.01, 0.1, 1. And then you run your algorithm and then you look at your validation error. And then you just choose the one that has the lowest. Yeah. It’s pretty crude, but yeah. I wanted to choose the linear rates like a hyper-premeters without just doing like a search. I tried this one, then tried this one. Yeah. So is there a better way to search for hyper-premeters? You could do a grid search generally as far as random sampling is fine. There’s fancy things based on Bayesian optimization, which might give you some benefits, but it’s actually the jurors out on that and they’re more complicated. There’s also you can use better learning algorithms, which are less sensitive to the step size. So you don’t have to nail, it’s like 0.1 works, but 0.1, 1, 1 doesn’t. So you don’t want that. But in all the high-level answers that there’s no real kind of principle way of like here’s a formula that lambda equals and you just evaluate that formula and you’re done. Because this is you know the kind of the 30 side of machine learning. There’s always this tuning that needs to happen to get your good results. Yeah, question over there. Turning to ground below and just kind of trying to evaluate a follow-up if this process usually automated or is too manual. So the question is is this process automated? Increasingly it becomes much more automated. So it requires a fair amount of compute, right? Because usually if you have a large list that even training one model might take a while and now you’re talking about you know training let’s say 100 models. So it can be very expensive and there’s things that you can do to make it faster. But I mean in general I would advise that don’t hyper-prometune kind of blindly, especially when you’re kind of learning the ropes. I think doing it kind of manually and getting intuition for what a step size like effective step size algorithm is still valuable to have. And then once you kind of get a hang of it then maybe you can automate but I wouldn’t try to automate too you know. Yeah. Small changes of hyper-prometurs needs to be very big changes in prediction accuracy. Is that considered a big indicator of the model not being very robust? Yeah so the question is if you change the hyper-prometurs a little bit and that causes your training or model performance to change quite a bit. Does that mean your model is not robust? Yeah it means your model is probably not as robust and sometimes you actually don’t choose hyper-prometurs at all and you still get varying you know model performances. So you should always check that first because there could be just inherent randomness especially if you’re doing neural networks or could get stuck in local optimum. There’s all sorts of you know things that can happen. Okay final question now should we found the optimal hyper-prometur? So how do you choose of optimal hyper-prometur? So you basically have like a for loop that says for labna in you know 0.1, 0.01, 1 and whatever values for t equals you know something. You train on this all these training examples minus validation and then you test the model on the validation you get a number and you just use whichever setting gives you the lowest number. I’m sorry? We do know the numbers. It’s not just like we’re going to keep looking at a half or one sequence then the one. Yeah usually you just have to be in the ballpark. You don’t have to get like 99 versus 100. The things I would just advise is like you know let’s say what kind of orders of magnitude because if it really matters like being down to the precise number then you probably have other things to worry about. Okay let’s move on. So what I’m going to do now is go through a kind of a sample problem right because I think the theory of machine learning and the practice is that are actually kind of quite different in terms of the types of things that you have to think about. So here’s a simplified name-dentity recognition problem. So name-dentity is this recognition is this why popular task in NLP where you’re trying to find names of people and locations and organizations. So the input is a string where which has a particular potentially name with the left and right context words. Okay and the goal is to predict whether this x contains either a person which is plus one or not. Okay so here’s the recipe for success. When you’re doing your final project or something you get a data set. It hasn’t been already split split it into training, validation and test and lock the test set away and then first I would try to look at the data to get some intuition. You know always remember you want to make sure that you understand your data. Don’t just immediately start coding up the most fancy algorithm you can think of. And then you repeat. You implement some you know feature maybe change architecture of your network and then you tune some you set some hyper parameters and you run the learning algorithm and then you look at the the training error and validation error rates to see you know how they’re doing if you’re underfitting or overfitting. In some cases you can look at the weights for the your classifiers and for neural nets it might be a little bit harder and then you I would recommend look at the predictions of your model. I always have I always try to log as much information as I have you can so that you can go back and understand what the model is you know trying to do and then you brainstorm some improvements and you kind of do this until you either are happy or you run out of time and then you run it on the final test set and you get your final error rates which you put in your report. Okay so let’s go through an example of what this might look like. So this is going to be based on the code base for the sentiment homework. So okay so here’s where we’re starting. We’re reading a training set. Let’s look at this training set. So there are you know 7,000 lines here each line contains the label which is minus one or plus one along with the input which is going to be you know remember the left context the actual entity and the right context. Okay all right so you also have a development or validation set and what this code does is it’s going to learn a predictor which takes the training set and a feature extractor which we’re going to fill out and then it’s going to output either both the weights and some error analysis which we can look used to look at the predictions and finally there’s this test which I’m going to not do for now. Okay so so the first thing is let’s define this feature extractor. So this feature extractor is a fee of x and we’re going to use the sparse you know map representation of features. So fee is there’s this really nice a community structure called default dict. So this is kind of like saying you have a you know a map but you can you know access it and if the element isn’t there then return zero. Okay so fee here goes that and then return fee. Okay so this is the simplest feature vector you can come up with. The dimensionality is zero because you have no features. Okay so but you know we can run this and see how we do on this. Okay so let’s run this. Okay so over a number of iterations you can see that learning isn’t doing anything because there’s no weight still update. Okay so but you know it doesn’t crash which is good. Okay so I’m getting 72% error which is you know pretty bad but I haven’t really done anything so that’s to be expected. Okay where my window go. Okay so now let’s start defining some you know features. Okay so remember what is x. x is something like a took Mauritius into right so there’s this entity and left and right so let’s break this up. So I’m going to tokens equals x dot split so that’s going to give me a bunch of tokens and then I’m going to define left entity right equals so this is the token zero is the left that’s going to be took tokens one through minus one is going to be everything until the last token and then tokens minus one is the last one. Okay so now I can define a feature template so remember a good nice way to go about is to find a feature template so I can just say entity is and blank that’s how I would have written it as a feature template in code. This is actually pretty no transparent it’s saying I’m defining a feature which is going to be one for this no feature template so entity is going to be some value I plug it in I get a particular feature value or feature name and I’m going to set that feature name to be have the feature value one. Okay so let’s run this okay so let’s go over here run it our roops so entity is a list so I’m going to just going to turn it into a string. Okay so now I’m getting what happens so the the training error is pretty low right I’m basically fitting the training error pretty trains are pretty well but you know notice I don’t remember I don’t care about the trainings or I care about the test error so just one note it says test here but it’s really the the validation which probably changed that it’s just whatever non-training set you passed in. Okay so this is still a 20% error which is not correct. Okay so at this point remember I want to go back and look at some you know get some intuition for what’s going on here so let’s look at the weights okay so this is the weight vector that’s learned so for this weight feature the weight is one and all of these are one and this you know corresponds to the names that the people names that have been seen at training time because whenever you see a person name then I’m going to you know give that feature a one so I can get that training example right and if you look at the bottom these are the entities which are not people nice okay so this is sanity check that it’s doing what it’s you know supposed to do so let’s say about with these kind of really interval features that you can kind of almost compute the what the weight should be in your in your head yeah yeah so I have essentially one feature for every entity which is almost you know number so yeah so there’s a 3,900 feature set so we’re going to change that but we’ll get we’re not done okay so okay so the other thing we want to look at is the error analysis okay so this shows you here’s an example that we’re going to a Romero the ground truth is positive but we predicted minus one and why do we predict minus one it’s because this feature has weight zero why does it have weight zero because we never saw this name at training time okay but you do get some right we saw Senate at training time and we rightly predict that that was minus one okay but no you look at these errands it’s like okay well you know this is maybe the we should add more features okay so if you look remember this your example here maybe the context helps right because if you have governor blank then you probably know it’s a person because only people can you know be governors so let’s add a feature so I’m going to add a feature which is left is left and for symmetry I’ll just add right is right okay so this defines I’m indicated features on you know the context so in this case you’ll be taken into okay so now three feature templates let’s go and train this model and now I’m down to I just love and percent okay so I’m making some progress oops let’s look at the air analysis okay so now I’m getting this correct and let’s look at what else am I getting wrong so Smith is blamed you know Felix Mantilla and you know again it hasn’t seen this exact actually maybe did see this string before but it’s so got wrong you know I think there’s kind of a general intuition though that well if you have you know Felix you know even if it you’ve never seen Felix Mantilla if you see Felix something you know chances are probably as a person not always but as as we you know noted before features are not meant to be like deterministic rules there’s pieces of information which are useful so let’s go over here and we want to define let’s say a feature for every possible word that’s in entities a word in entity remember entity is a list of tokens which occur between left and right and I’m going to say entity contains word okay so now let’s run this again and now I’m down to you know six percent error which is you know a lot better if you look at the error analysis so I think the maybe the Felix example now I get this right and you know what else what else can I do so you know what I’m kind of this general strategy here I’m following here is you know which is not always the necessary the right one but you start with kind of very very specific features and then you try to kind of generalize you know as you go so how can I generalize this more right so if you look at worker so Kurdistan right if your word ends in stand or then I mean maybe it’s less likely to be a person I actually don’t know but you know maybe like suffixism prefixes are helpful too so I’m going to add features let’s say entity contains prefix and then I’m going to let’s say just you know heuristically look at the first four tokens and suffix the last four tokens and then you know run this again and now I’m down to you know four percent error okay I’m probably going to stop right now at this point you can actually run on your test set and we get four percent error as well yeah oh yeah I guess this was all planned out so that the tester would go down but actually more often than not you’re out of feature that you really really think should help but it doesn’t help for whatever reason so yeah you yeah some of the time you it doesn’t move that’s kind of probably the more of the time but sometimes it can go up if you add a really you know bad feature so the more features you add generally the trainier will go down right so all the algorithm and those is like it’s driving trainier down so it doesn’t know that it doesn’t you know generalize yeah okay so this is definitely the happy path I think when you go and actually do machine learning it’s going to be more often than not the test error will not go down so don’t get too frustrated just keep on trying yeah are you expecting to keep optimizing after like five percent error are you expected to optimize after five percent error it’s it really depends on you know there’s kind of a limit to every data set so data sets have noise so sometimes you you should definitely not optimize below the noise limit so one thing that you might imagine is for example you have an oracle which let’s say it’s human agreement like if your data set is annotated by humans and if humans came agree like three percent of the time then you can’t really do better than three percent of time as a general rule or exception okay any other question yeah you kind of do all your training happy and then you see the test error then say a real application say in the end then you try on the test set and you find it’s not good and oh yeah what happens if you’re actually you try on the test set and it’s not good um that’s you just say that it’s not good in some level so there’s many things I could happen one is that your test set might actually be different for whatever reason maybe it was collected in a different day and your performance just doesn’t hold up on that test set in that case well that’s your test error right remember the test error is just if you didn’t look at it it’s really honest representation of how good this model is and if it’s not good well that’s just the truth there wasn’t your model is not that good in some cases there’s some like bug like something was misprocessed in a way and it wasn’t really fair so you know there are cases where you want to like investigate if it’s like way off the mark if I had gone like 70% error then maybe something was wrong and you would have to go investigate but if it’s in the ballpark and whatever it is that’s kind of what you have to deal with right so what you want to do also is make sure your validation error is kind of representative of your if your test error so that you don’t have you know surprises at the end of the day right I mean it’s I think fine to run it on the test set just to make sure that there’s no catastrophic problems but the kind of aggressive tuning on the test set is something that would you know have a way to warn against yeah is there any sort of standard as to how you should split the data into train and validation testing generally like what percentage of your data you should allocate to each one just randomize it or yeah so question is how do you split into train validation and it depends on how large your data is so generally people you know shuffle the data and then randomly split it into test validation and train maybe let’s say like 80% 10% 10% just as a kind of a real fun there are cases where you don’t want to do that there’s cases where you for example want to train on the past and test on the future because that simulates the more realistic settings remember the test set is meant to kind of be representative as possible of the situations that you would see at in the in the real world yeah I have like 7000 example of something all like label possible in my son do you have to do that manually so the question is the data set was labeled there’s 7000 of them I personally did not label this data set this is a standard data set that someone labeled you know sometimes these data sets come from you know crowd workers sometimes they come from you know experts yeah sometimes they come from grads it’s actually a good exercise to go and like I’ve labeled a lot of data you know also in my life yeah exactly okay let’s go on so switching gears now let’s talk about unsupervised learning so so far we talk about supervised learning where the training set contains input output pairs so you’re given the input and this is the output that your predictors should output but you know on this is very timely we’re just talking about how fully labeled data is very expensive to obtain because you know 7000 is actually not that much you know you can often have you know 100000 or even a million examples which you do not want to be sitting down and annotating yourself so here’s another possibility so unsupervised learning unsupervised learning the training data only contains inputs and unlabeled data is much cheaper to obtain in certain situations so for example if you’re doing text classification you have a lot of text out there people write a lot on the internet and you can easily download you know gigabytes of text and all that is unlabeled and yeah you can do something with it that would be you know you turn that into gold or something and also images videos and so on you know it’s not always possible to obtain unlabeled data for example if you have you know some device that is producing data and you only have one of that device that you built yourself and you know you’re not going to be able to get that much data but we’re going to focus on the case where you do have basically infinite amount of data and you want to do something with it so here’s some examples I want to share with you this is a classic example from NLP that goes back to you know the early 90s so this idea is word clustering the input you have a bunch of raw text lots of news articles and you put it into this algorithm which I’m not going to describe but I’m going to look at we’re going to look at the output so what is this output it returns a bunch of clusters where for each cluster it has a certain set of words associated with that cluster okay and you look at the clusters they’re pretty coherent so this is roughly the first cluster is days of a week second cluster is months third cluster is some sort of you know materials fourth cluster is synonyms of like you know big and so on and you know one one thing the critical thing to note is that the input was just raw text no one did someone say hey these these are days of a month learn them and I’ll go test you later it’s all unsupervised so this is actually you know on a personal note the kind of the example when I was doing a masters that got me into doing NLP research because I was looking at this and was like wow you can actually take on label data and actually mine really interesting kind of signals out of it more recently there’s these things called word vectors which do something very similar instead of clustering words they embed words into a vector space so if you zoom in here each word is associated with a particular position and words which are similar actually happen to be close by in vector space so for example these are country names these are pronouns these are you know years months and so on okay so this is kind of operating on a very similar principle there’s also contextualized word vectors like Elmo and Bert if you’ve you know heard of those things which have been really taking the NLP community by storm more recently on on the vision side you also have the ability to do unsupervised learning so this is example from 2015 where you run a clustering algorithm which is also jointly learning the features during this kind of deep neural network and it can identify different types of digits zeros and nines and fours that look like nines threes and or fives that look like threes and so on so remember this is not doing classification right you’re not telling the algorithm here’s our five series of twos it’s just looking at examples and finding the structure that oh these are kind of the same thing and these are all the so the same thing and sometimes but not always these clusters actually correspond to labels so here’s another example of ships planes and birds that look like planes so you can see kind of this is not doing classification it’s just kind of looking at visual similarity okay all right so the general idea behind unsupervised learning is that you know data has a lot of rich latent structure in that and that by that mean I mean there’s there’s kind of patterns in there and we want to develop methods that can discover this structure you know automatically so there’s multiple types of unsupervised learning there’s clustering dimensionality reduction but we’re going to focus on you know clustering in particular k means clustering for this lecture okay so let’s get into it more formally so the definition of clustering as it follows I give you a set of points so x1 through xn and you want to output an assignment of each point to a cluster and assignment variables are going to be z1 through zn so for every data point I’m going to have a zi that tells me which of the k clusters I’m in one through k okay so particularly this looks like this on the board here where I have let’s say let’s say I have seven points okay and if I give you only these seven points and I tell you hey I want you to cluster them into two clusters you know intuitively you can kind of see maybe there’s a left cluster over here and the right cluster over here okay but how do we formulate that kind of mathematically so here’s the k means objective function so this is the principle by which we’re going to derive clustering okay so k means says that every cluster there’s going to be two cluster is going to be associated with a central okay so I’m going to draw a centroid in a red square here and the centroid is a point in this space along with the you know the data points and I’m going to this is kind of representing where the cluster is and then I’m going to associate each of the points with a particular centroid so I’m going to denote this by a blue arrow pointing from the point into the centroid and you know these two quantities are going to kind of represent the cluster I have the locations of the clustering in red and also the assignments of the points into the clusters in blue okay so of course neither the red or the blue are known and that’s something we’re going to have to optimize okay so but first we have to define what the optimization objective function is so intuitively what do we want we want each point a few of x i to be close to the centroid right for the centroid to be really representative of the points in that cluster that centroid should be close to all the points in that cluster okay so this is captured by this objective function where I look at all the points for every point I measure the distance between that point and the centroid that that point is associated with it so remember z i is a number between one and k so that indexes which of the mu mu one or mu two I’m talking about I’m looking at the square distance between those two the centroid and the point yeah yeah how does each point get to a sign to a centroid so that’s going to be specified by the z’s which is going to be optimized over every April you don’t know yeah so we always have a pretty good idea of how many like four I guess how many clusters clusters it could be yeah the question is do we know how many clusters there are in general though so there are ways to select it’s another hyper parameter so it’s a something that you have to set before you run the k-means object function so when you’re tuning you try different number of clusters and see which one kind of works better okay so we need to choose the centroid and assignments jointly so you know this this hopefully is clear you just want to find the assignment z and the centroid’s mu to make this number as small as possible so how do you do this well let’s let’s look at a simple one-dimensional example and let’s build up some question okay so we’re on one D now we have four points and the points are at there’s going to be at 0, 2, 10 and 12 okay so I have points four points at these locations okay I want a cluster and intuitively you think I want two clusters here there’s going to be two centroid and suppose I know the centroid’s okay so I just someone told you magically that the centroid this example is are going to be at one and a lot of them okay so someone told you that and now you have to figure out the assignments you know how would you do this well let’s assign this point where should it go you look at this distance which is one you look at this distance which is 11 which are smaller one is smaller so you say okay that’s where I should go same with this point one is smaller for these 11 is smaller and that’s okay so mathematically you can see it’s comparing the distance from each point to each of the centers and choosing the center which is closest okay and you can convince yourselves that that’s the way to if the cluster centers were centralised were fixed how you would minimise the objective function because if you choose a center which is farther away then you get just you know a larger value and you want the value to be as small as possible okay I don’t know why this is true I think this should be one here I’ll fix that okay so let’s do it the other way now suppose I now have the the assignments so I know that these two should be in some cluster these two should be in a different cluster cluster two and now I have to place the the centers you know where where should I place that should I place it here um share place it here share place it here you know where should I place it and if you look at the the slide here what you’re doing is you’re saying okay for the first cluster I know two and zero are assigned to that cluster and I know that the the some of the distances to this um the centroid mu uh is this and I want this number to be as small as you know possible okay and if you do the first homework you know that you know whenever you have one of these kind of squared up some objectives um you should be averaging the points um here so you can actually solve that in close form and you given the assignments here you know the center should be there which is the average of zero and two and for these this cluster uh you should average the two points here and that should be at you know 11 yeah okay so what’s the difference between centroid and assignment so when you’re clustering you have a k clusters so there’s k centroid so in this case there’s two centroid there there those are the the red the assignments are the association between the points and the centroid so you have n assignments and these are the things yeah um is the k a hyperparameter or is that um yeah so k here is a hyperparameter which is a number of clusters which you can turn okay so here’s a chicken and egg problem right um if I knew the centroid I could pretty easily come up with assignments and if I knew the assignments I could come up with the centroid but I don’t know either one so how do I get started so the key idea here is alternating minimization which is this general idea and optimization which is usually uh um you know not a bad idea um and the principle is well you have a hard problem maybe you can solve it by tackling kind of two easy problems um here so here’s a k means algorithm so step one is you’re going to uh you’re given the centroid now kind of going to more general notation mu 1 to mu k and um I want to figure out the assignments so for every data point I’m going to assign that data point to the cluster with the closest centroid so here I’m looking at all the clusters 1 through k and I’m going to test how far is that point from that centroid and I’m just going to take the smallest um value and that’s going to be where I assign that point okay step two flip it around you’re given the cluster assignments now as you went through zn and now we’re trying to find the best centroid so what centroid should I pick so now you go through each cluster 1 through k and you’re going to set the centroid of the kth cluster to the average of the points assigned to cluster right so mathematically this looks like that you’re just sum over all the points i which have been assigned uh to cluster k and you add you basically add up all the uh the feature vectors and then you just divide by the number things you uh sum over okay so putting it together if you want to optimize this objective function the k means reconstruction and loss um first you initialize mu 1 to mu k randomly um there’s many ways to do this um and then you just iterate set the assignments given the cluster centroid and then set the centroid given the cluster assignments it’s alternate yeah see how this makes sense for like coordinates for like images where like if you read in a similar image by bytes it looks the same but for like words where words that are spelled totally differently can have like these same like semantic meanings how do you accurately map them to like a same location to cluster a centroid around yeah so the question is like maybe for images um distances in you know pixel space makes kind of more more sense but if you have words then um you know two words which uh you shouldn’t be looking at like the edit distance between you know those are the words and two uh symptoms like a big and large looked very different but they’re somehow similar um so this is um something that word vectors you know address um which we’re not gonna you know talk about basically you want to um capture the representation of a word by its context so the context in which big and large occur uh is going to be kind of similar and you can construct these context vectors that give you a better representation but we can talk more offline yeah one of those things where you can get stuck get like a local rhythm or you guarantee if you do it in a time together yeah uh you can get stuck and I’ll show you an example okay when you have the questions about the the general rhythm yeah yeah yeah maybe I’ll answer that uh I’ll show you an example yeah I think here you show that you do a fixed number of iterations but can you some kind of criteria like doesn’t change anymore as in stock and condition yeah so uh this is going up to a fixed number of iterations t um typically you would have some sort of you would monitor this objective function and once it gets below stops changing very much then you just stop um actually this so the k means algorithm is guaranteed to always converge to a um local minimum so I when I just show you this demo and I think it will be make some things clear okay so here I have a bunch of points um so this is a JavaScript demo you can go in play around and change the points if you want it’s linked off the the course website um and then I’m going to run k means okay so I initialize with these three uh centroids and um these regions are basically the points that would be a sign to that uh uh centroid so this is a vulnerability diagram of uh these these centroids okay and this is the loss function which hopefully should be going down um okay so now I iterate so iteration one I’m going to assign the points to uh the clusters so these get assigned to blue this one gets assigned red these gets assigned to green um and then uh the step two is going to be optimizing the centroids so given all the blue points I put the center in the smack in the middle of these um blue points uh and then same with green and red um notice that now these points are in the red region so if I reassign then these become red and then I can iterate and then you know keep on going and you can see that the algorithm you know eventually you know converges to a clustering where these points are blue these points are red and these are green um and if you keep on running you’re not going to make any progress because if the assignments don’t change then the cluster centers are going to change either okay um so um let me actually you know skip this since I’m I was just going to do it on the board but I think you kind of get the idea um so let’s talk about this the local minimum problem so k means it’s not guarantee is is is guaranteed to converge to a local minimum but it’s not guaranteed to confine a global minimum so if you think about this as a toy uh visualization of the objective function you know by going downhill you can get stuck here but it won’t get uh to that point so just to take example for different random seeds um you can let’s say you initialize here okay so now all the three centers are here and if I run this and I run this now I get this other solution which is actually a lot worse remember the other one was 44 and this is 114 and that’s where the algorithm converges and you’re just stuck so in practice um people typically try different initializations run it from different random points and then just take the best um there’s also a uh a particular way of initialization called k means plus plus where you put down a point and you put down a point which is as far this as way as possible and then as far as way as possible and then that kind of spreads out the centers so they don’t kind of interfere with each other and that uh generally works pretty well but still there’s no necessary guarantee of converging to the global optimum okay any questions about k means yeah how do you choose k you guys love these hyperparameter tuning questions um so one thing you can kind of draw is the photo picture um so k and then you’re a loss that you get from k and usually if you have one cluster the loss is going to be very high and that at some point it’s you know going to go down and you generally uh you know lock it off when it’s you know not going down by very much so you can monitor that curve another thing you can do is you have a validation set um and you can measure reconstruction error on that you know validation set and choose the minimum based on that just just another hyperparameter that you can do yeah how’s the training loss calculated uh so the training loss is this quantity um so you sum over all your your points and then you look at the the distance between that point and the assigned centroid and you square that and you just add all those numbers up okay so to wrap up um oh actually I have more slides here um so um unsupervised learning you’re trying to leverage a lot of data and we can kind of get around this difficult optimization problem by you know doing this alternating minimization so these will be quick um so just to kind of summarize the learning section we’ve talked about feature extraction and I want you to think about the hypothesis class that’s defined by a set of features um prediction which uh boils down to kind of what kind of model you’re looking at for classification and regression supervised learning you have linear models or neural networks and for clustering you have you know with the k-means objective you have uh lost functions which you know in many cases all you need to do is compute the gradient um and then there’s generalization which um is what we talked about for the first half of this lecture which is really important to think about you know the test set remember is kind of only a surrogate for unseen future examples um so a lot of these ideas that we presented are actually quite old so the idea of uh least squares you know the for regression goes back to you know Gauss when he was you know self-trained assault and astronomy problem logistic regression was you know from statistics um in AI there was actually some learning that was done even in the you know in the 50s for playing checkers um as I mentioned the first day of class um there was a period where learning kind of fell out of favor um but it came back with um back propagation in the much of the 90s actually a lot more kind of rigorous treatment of optimization and you know formalization of when algorithms are guaranteed to you know converge um that that happened in the 90s and then um in the 2000s you know we know that uh people looked at kind of structure prediction and um there was a revival of neural networks um some things that we haven’t covered here are you know feedback loops right so learning assumes kind of the static view where you take data you train a model and then you go in generate predictions but if you deploy the system in real world those predictions are actually going to come around and be data and those feedback loops can also cause you know problems that you might not be aware of if you’re only thinking about ah here’s I’m doing my machine learning thing um you know how can you build classifiers that don’t discriminate so um we uh often have classifiers you’re minimizing the training set average of a training set so by by a kind of construction you’re trying to drive down the losses of you know but kind of common examples but often you get these situations where minority groups actually get you know pretty high loss because they look different and almost look like outliers but you’re not really able to um fit them but um the training loss doesn’t kind of you know care so there’s other ways um there’s techniques like distribution and robust optimization that tries to um you know get around some of these issues um there’s also privacy concerns how can you learn actually if you don’t have access to entire data set so there’s some techniques based on randomization that can help you and then in turbability how can you understand what you know the algorithms are doing especially if you have a deep neural network you’ve learned a model um and there’s you know work which um I’m happy to discuss with you offline so the general so we’ve concluded three lectures on machine learning um but I wanted you kind of to think about learning in the most general way possible which is that you know programs should improve with you know experience right so I think we’ve talked about you know linear classifiers and all these kind of nuts and bolts of basically reflex models but in the next uh lectures we’re going to see how learning can be used in state-based models and also you know variable based models okay with that so that concludes um next week uh Dorser will be giving the lecture on state-based models.