Stanford CS221 AI Lecture 3: Machine Learning 2 – Features, Neural Networks
In this AI video ...
Okay, welcome back everyone. This is the second lecture on machine learning. So just before we get started, a couple of announcements. Homework 1, foundations is due tomorrow at 11 p.m. Note that it’s 11 p.m. not at 11.59. And please, I would recommend everyone try to do a test mission early, right? It would be unfortunate if you wait until 10.59 and you realize that your computer, you can log into the website. If that happens, please don’t just bombard me or with emails. Just one, so you can resubmit as much as you want before that deadline. So there’s no penalties to just submitting something and checking to make sure it works. Yeah. So just to remind you are responsible for any technical issues you encounter. So please do the test mission early. So you have a piece of mine and then you can go back to finishing your homework. Okay. Homework 2 sentiment is out. This is the homework on machine learning and it will be due next Tuesday. And finally, there’s a section this Thursday which we’ll talk about back propagation and nearest neighbors and maybe overview of scikit-learn which might be useful for your project. So please come to that. Okay. So let’s jump in. I’m going to spend a few minutes reviewing what we did last time. Kind of starting at the very abstract level and drawing down into the details. So abstract level learning is about taking a data set and outputting a predictor f which will be able to take inputs x, for example, an image and output a label or output y, for example, whether it’s a cat or a truck or so on. And if you unpack the learner, we talked about how we want to frame it as an optimization problem which captures what we want to optimize, what properties the predictor f should satisfy, apart from the optimization algorithm which is how we accomplish our objective. So the optimization problem that we talked about last time was minimizing the training loss. And in symbols, this is the training loss which depends on a particular weight vector is the average over all examples in the training set of the loss of that particular example with respect to the weight vector w. Okay. And we want to find the w that minimizes the training loss. So we want to find the single w that makes sure on average all the examples have low loss. Okay. So looking at the loss functions, now this is where it depends on what we’re trying to do. So if we look at the loss function, the model is the prediction, then the pertinent thing to look at is the residual which remember is the model’s prediction minus the true label. So this is kind of how much we overshoot. And the loss is going to be zero if the residual is zero. And it increases either quadratically for the square loss or linearly for the absolute deviation depending on how much we want to penalize large deviations. Okay. The pertinent quantity look at is the margin which is the score times the label y which remembers plus one or minus one. So the margin is a single number that captures how correct we are. So large margin is good. In that case we obtain either a zero or near zero loss. And margin less than zero means that we’re making a mistake. So the zero and loss captures that we’re making a mistake of the loss one. But the hinge loss and the logistic loss kind of grow linearly because it allows us to optimize the function better. Question? I know that I see the regression, the loss squared curve there. What would the residual look like on the graph here? It would just be a point away from the regression curve or what would the residual look like on this graph? You would put it. So there are multiple graphs here. So remember last time we looked at the residual, if you look at x or rather f of x over y. So here’s a line. Here is a particular point, f of x, f of x y. And the residual is basically the difference between the model’s prediction and the actual point here. This graph is different. This graph is visualizing in a different space. I’ll show you another graph that might make some of these things a bit clearer in a second. I guess one way to think about the residual is the residual is a number. So if you’re residual is two, then you’re kind of here. This is the loss you pay, which is two in this case. And if the residual is minus two, then you pay two. The residual is the x axis here. Yeah, the residual is the x axis here. And the margin is the x axis over here. Yeah. Okay. Any other questions about this? When would you use the absolute value? Yeah, the question is when would you use absolute value versus the square loss? There is a slide from the previous lecture, which I skipped over, which talks about when you would want it. Most of the time people you tend to use the square loss because it’s as exerting optimized. But you also see absolute deviation. The square loss will penalize large outliers a lot more, which means that it has kind of mean like qualities, whereas the absolute deviation penalizes it less. So it’s more like a media, just for kind of intuition. But the general point is that all of these loss functions capture properties of a desire predictor. They basically say, hand me a predictor, and I’ll try to assess for you how good this is. This is kind of establishing what we want out of it. And also another comment is that I’m presenting this loss minimization framework because it is so general. Anything basically that you see in this, you learn it can be viewed as some sort of loss minimization. If you think about PCA or deep neural networks, different types of auto encoders, they can all be viewed as some sort of loss function, which you’re trying to minimize. So that’s why I’m kind of keeping this framework somewhat general. So let’s go the opposite direction of generality. Let’s look at a particular example and try to put all the pieces together. So suppose we have a simple regression problem. We have three training examples. One zero, the output is two, one zero, the output is four, and zero one, the output is minus one. So how do we visualize what learning on this training set looks like? So let’s try to form the training loss. The training loss, remember, is the average over the losses on the individual examples. So let’s look at the losses on individual examples. So we’re doing linear regression. So an X is two-dimensional and phi of X equals X. So in this example, so we’re basically trying to fit two numbers, W1 and W2. So if you plug in these values for X and Y into this loss function, then you get the following quantity. So the dot product between W and X is just W1, because X2 is zero. And you minus two and you square it because we’re looking at the square loss. The same thing for this point, instead of two, you have a four. And then for this point, W dot phi of X minus Y is W2 now because now the X2 is active, is minus 1 squared. So these are the individual loss functions, each of which tells what I want out of W. So if here I’m looking at this, if W1 is two, then that’s great. I get a loss of zero. This one says, if W1 is four, then that’s great, and I get a loss of zero. And obviously you can’t have both. And the goal of the training loss is trying to look at the average so that you can pick one W that works for as kind of on average is good for all the points. So now this is a function in two dimensions. It depends on W1 and W2. So let me try to draw this on the board to give you some more intuition what this looks like. So I’m going to draw W1 and W2. And so the first function is W1 minus 2. So what does this function want to do? It wants W1 to be close to two. And it doesn’t care about W2, right? So I’m not really sure how to draw this function, but it really requires something in 3D. So you can think about a bowl shape kind of coming out of the board like this. If this direction is meant to be the loss. So I’m going to try to do, well, let’s try it this way. So it’s going to be kind of a bunch of parabolas that look like this coming out of the board. So what about the second one? The second one is W1 minus 4 squared. So that’s going to be basically the same thing, but kind of centered around 4, so around this axis. So again, this is going to be some parabola coming out of the board. And then finally, the other point is W2 minus 1. So it’s going to be happiest when W2 is minus 1. So it’s going to be kind of a bunch of, you know, parabolas coming out of the board here. So you add all three functions up. And what do you get? You get something that is, first of all, where do you think the minimum should be? One of the two intersections of the one of like two intersections. The first, like the first, very silent or the second, very silent or the red lines. Oh, yeah, it’s going to be some sort of intersection here. So if you look at the W2 axis, right, it should definitely be minus 1, because this is the only function that cares about W1. So it’s going to be somewhere here. And both by symmetry, well, this one wants it to be a 2. This one wants to be a 4, so the average is somewhere between. You can work all of this kind of, actually, mathematically out. I’m just kind of giving the rough intuition. And now let me draw the level curves here. The level curves are going to be something like this, where, again, if you’re drawing in 3D, it’s like a parabola or coming out of the board here, where here is the lowest point. And as you venture away from this point, your loss is going to increase. Right? Okay? Yeah. How do I get this middle point? So one way is that if you add these two functions up and just plot it, it turns out to be at 3. Includively, the square loss, when you average, it acts like a mean. So it’s going to be somewhere between. There’s also related to one of the homework problems, so hopefully you’ll have a better appreciation for that. Okay? So I guess, yeah, question. Once we have the 3, how do we merge it with the negative 1 as well? Do we need to do another addition? So question is once we have the 3, how do you merge it with a minus 1? So the 3 is regarding W1 and the minus 1 is regarding W2. So you just add them together. They kind of don’t, in this particular example, they don’t interact. In general, they will. You see this example. Could you quickly summarize exactly what’s going on with this example? Yeah. So this plot shows, for every possible way vector, W1, W2, you have a point. And the amount that the function comes out of the board is the loss. And the loss function is defined in the slides right there. And all I’m doing is trying to plot the loss function. Okay. So the interaction that we wanted to do, the 2 points, the loss that you’re coming out of the board. More plot, yes? No? So unfortunately it’s hard to kind of draw in 3D here. So what I’m trying to do here is taking each of the pieces and trying to explain what each piece is trying to do. All right. Okay. So in general, the training loss, you don’t have to think about kind of how exactly it composes the individual losses. This is probably as complex as an example. We’ll have to get to right-trade understand it. But this kind of gives you an idea of how you connect these pictures where you see kind of these parabolas with the picture which is actually the training loss. Okay. But for now, let’s assume you have the training loss. It’s a function of the parameter. It’s some function. And how do you optimize this function? So you do some sort of gradient descent. So last time we talked about how you can just do vanilla gradient descent where you initialize with 0 and then you compute the gradient of the entire training loss and then you update once. And the problem with that is the up to computing the gradient requires going through all the training examples. You have a million training examples that’s really slow. So instead, we looked at sarcastic gradient descent which allows you to pick up an individual example and then make a gradient step right away. And empirically, we saw in code how it can be a lot faster. Of course, there are cases where it can also be less stable. So there’s kind of in general going to be some trade off here. But by and large, sarcastic gradient descent, it kind of really dominates machine learning applications today because there’s the only way to really scale to large data sets. Okay. Yeah. So apart from being able to scale up, is there any advantage of sarcastic gradient descent? Another besides computation, another advantage might be that your data might be coming in an online fashion, like over time. And you want to update kind of on the fly. So there are cases where you don’t actually have all the data at once. Okay. So that was a quick overview of the general concepts. Now to set the stage for what we’re going to do in this lecture, I want to ask you guys a following question. So can we obtain decision boundaries? Remember a decision boundary is the kind of line that or the curve that separates the region of the space which is classified positively versus negatively. Can we obtain decision boundaries which are circles by using linear classifiers? Okay. So does that make sense? So we want to get something like this where you have now we’re going into phi 1 of x, you know, phi 2 of x, and we want to decision boundaries that look like this where you classify maybe these as positive and these as negative. Okay. Is that possible? Yeah. If you like to square of those inputs, then you get something to be linear. You get like a linear. Yeah, yeah. Okay. So you’re saying yes. Okay. Okay. Okay. Okay. Well, there is a punch line there. So it turns out that you can actually do this which maybe on the surface seems kind of surprising, right? Because we’re talking about linear classifiers. But as we’ll see, it really depends on what you mean by linear classifiers and hopefully that will become clear. So. Okay. So we’re going to start by talking about features which is going to be able to answer this question. Then we’re going to shift gears a little bit and talk about neural networks which is in some sense, an automatic way to learn features. Then we’re going to show you how to train neural networks using back propagation, hopefully without tears. And then talk about nearest neighbors which is another way to get really expressive models which is going to be much simpler in a way. Okay. So recall that we have the score. So the score is a dot product between the wave vector and a feature vector. And the score drives prediction. So if you’re doing regression, you just output the score as a number. If you’re doing classification, binary classification, then you output the sign of the score. And so far we focus on learning which is how you choose the wave vector based on a bunch of data and how you optimize for that. And so now what we’re going to do is focus on fee effects and talk about how you choose these features in the first place. And this actually feature engineering feature extraction is such a really critical and important part of a machine learning pipeline which often gets neglected because when you take a class you say, okay, well, there’s some feature vector and then let’s focus on all these algorithms. But whenever you go and apply machine learning in the world, feature extraction turns out to be the main bottleneck. And neural nets can mitigate this to some extent but it doesn’t completely make feature extraction obsolete. So recall that a feature extractor takes an input such as the string and outputs a set of properties which are useful for prediction. So in this case, it’s a set of named feature values. And last time we didn’t really say much about this. We just kind of waved our hands and said, okay, here’s some features. So you know, in general, how do you approach this problem? What features do you include? Do you just start making them up and how many features do you have? We need maybe a better organizational principle here. And you know, in general, feature engineering is going to be a somewhat of an art so I’m not going to give you a recipe but at least some framework for thinking about features. So the first notion is a feature template. And a feature template is informally just a group of features are all computed in the same way. This is kind of a somewhat patented but kind of a terminology point that I want you all to be aware of. So a feature template is basically a feature named with holes. So for example, length greater than blank. So remember the concrete feature is length greater than 10. Now we’re going to say length greater than blank where blank can replace with 10, 9, 8, or any kind of number. It’s a template that gives rise to multiple features. Last week, a character equals blank, contains character blank. These are all examples of feature templates. So when you go in your project or whatever and you describe your features, and when you think about kind of grouping these features in terms of these blanks. Another example is pixel intensity of position. So even if you have what you consider to be like a raw input, like an image, right, they’re still implicitly some sort of way to think about it as a feature template, which corresponds to like the pixel intensity of position blank, comma blank, is a feature template that gives a rise to the number of features equals to the number of pixels in the image. And this is useful because maybe your input isn’t just an image. Maybe it’s an image plus some metadata. And having this kind of language for describing all the features in a unified way is really important for clarity. So as I alluded to, each feature template maps to a set of features. So by writing last three characters equals blank, I’m implicitly saying, well, I’m going to define a feature for each value of blank. And that feature is going to be associated with a value, which is just the natural evaluation of that feature on the input. So all of these are zero except for nsput.com is one. So in general, you are going to have each feature template again might give rise to many, many features. The number of possible three characters is some number of characters to a cube, which is a large number. So one question is how do we represent this? Right? Yes, farce vector. Yeah, good answer. So mathematically, it’s really useful just to think about this vector as a denimensional vector, right? Just den numbers, just laid out. And because that’s mathematically convenient, but when you go to actually implement this stuff, you might not represent things that way. In particular, what are the ways you can represent a vector? Well, you can say I’m going to represent as a ray, which is just this list of numbers that you have. But this is inefficient if you have a huge number of features. But in the cases where you have a sparse feature, which means that only a very few of the feature values are non-zero, then you’re better off representing as a map or in Python Addictionary, which you specify the feature name is a key and the value is the value of that feature. And all the homework two will basically work in this sparse feature framework. And just a kind of a note, especially in NLP and we have discrete objects. Traditionally, it’s been common to use these sparse feature maps. One thing that has happened with the rise of neural networks is that often you take basically your inputs and embed them into some fixed dimensional vector space and dense feature representations have been more dominant. But sparse features, if you want to use linear classifiers, is still a good way to go. So it’s important to understand this. So now in story of storing, possibly a lot of features now you just store the key and the value. So this feature templates, the overall point is that it’s organizational principle. And now let’s switch gears a little bit. So what features or feature templates should you actually write down? And to get at that, I want to introduce another notion, which is pretty important, especially if you think about the theory of machine learning. That’s a notion of hypothesis class. So remember we have this predictor, so for a particular wave vector, that defines a function that maps inputs into some sort of score or prediction. And a hypothesis class is just the set of all predictors that you can get if you vary the wave vector. So let me give you, I’m going to come back to this slide. Let me give you kind of an example here. So suppose you’re doing regression and you’re doing linear regression in particular. So you’re on one dimension, here is x. And here is, I guess, y. So if your feature map is just the identity, so maps x to x, then this notation just means the set of all linear functions like this. Then the set of functions you can visualize as this. So you have one function here. And for every possible value of w1, you have a slope. You also have zero. This should all go through the origin. And so these are your functions. So your hypothesis class, f1 here, is essentially all lines that go through the origin. So just want to think about it when you write down a feature vector, you’re implicitly committing yourself to saying, hey, I want to think about all possible predictors defined by this feature map. So here’s another example. This I define the feature map to be x, x squared. So now what are the possible functions I’m going to get? So does anyone want to say what to read off the slide what it is? It’s going to be all quadratic functions, right? So in particular, because I don’t have a bias term, it’s going to be all quadratic functions that go through the origin. So let me actually draw another j. So it’s going to be all quadratic functions that go through the origin, which look like this. It could be upside down. And really like that, I’m not going to draw all of them. In particular, it also includes the linear functions, right? Because I can always set w2 equals 0 and very w1, which means that I also get all the linear functions too. So this means that f2, if you think about the set of functions, is a larger set than f1. It’s more expressive. That’s what we mean by expressive. That means that it can represent more things. So for every feature vector, you should think also about the set of functions that you can get by that feature vector. Okay, so let’s, is there a question? Yeah. We need to set the best set of w’s. Are the more expressive sets harder for optimizer? The question is, are the more expressive sets harder optimizer? In terms of, the short answer is not necessarily. In terms of sure, you have more features so that it is more expensive at that level. But the difficulty optimization depends on a number of different factors. And sometimes, adding more features can be easier to optimize because it’s easier to figure training data. Okay, so now let’s go back to this picture. So this is on the board, it’s concrete examples of feature or hypothesis classes. Now let’s think about this big blob as the set of all predictors. Any predictor in your wildest dreams, you know, they’re in this set. Okay, and whenever you go and you define a feature map, that’s going to carve out a much smaller set of, you know, functions, right? And then what is learning doing? Learning is choosing a particular element of that function family based on the data. Okay, so this picture shows you kind of the full pipeline of how you’re doing machine learning. Is, you know, there, you first declare structurally a set of functions that you’re interested in, and then you say, okay, now based on data, let me go and search through that set and find the one that is best for me. So now there are, you know, two places where things can go wrong. Well, for feature extraction, maybe you didn’t have enough features. So now your purple set is too small. And no matter how much learning you do, you’re just not going to get good accuracy, right? And then conversely, even if you define a nice, you know, hypothesis class, if you don’t optimize properly, you’re not going to find the element of that, you know, hypothesis class that fulfills your goals. Question? The function f, the feature function that’s extracting feature from the input, since that is a self as a function, you can assume that your weight would be able to compute that function also. So the question is, so you’re defining a function phi, but this is fixed. And then learning sets weights, and together jointly they specify a particular function or predictor. Don’t choose the appropriately, you’re limiting the space that you’ll be able to predict. But so I’m wondering why, but my tuition tells me that the point of learning is that regardless of the phi that you choose, the actual model that you choose, you know, you should be able to learn the function phi that you would have picked. Oh, I see. So the question is, does learning kind of compensate and just figure out the phi that you would have picked? So the answer is, join answer is no, the phi is really kind of a bottleneck here. For example, if you define phi to be x, so that’s linear function. Linear function is all you’re going to get, right? So if your data moves around in a sinusoidal way, you’re just going to like fit a line through that and you’ll get horrible accuracy. And no amount of learning can fix that. The only way to fix that is by changing your feature representation. So does that assume that w is a linear model? So yes, so all of this assumes that w, we’re talking about linear predictors here. But of course, the same general idea applies to any sort of function family, neural nets. So the equivalent there would be not just the feature map, but also the neural net architecture is a constraint on what kind of things you can express. So if you have only a two layer neural network, then there’s just some things that you just know with a fixed size, there’s just some things you just can’t express. Yeah, that’s a question. Just the following answer that as well as an alternative question. I thought it was more of a question of why perform featureization rather than taking in the raw data and having a neural net, it’s still a function of linear classifiers, but it has enough complexity that it can describe nonlinear behavior. Yeah, so the question is why bother doing feature engineering has a neural net, it’s kind of basically solved that. So to some extent, the amount of feature engineering you have to do today is no, much less. One thing that I think is still important to think about in feature engineering is it’s really think about is what sources of information you want to predict. For example, if you want to predict some property about a movie review, part of the first order bits are like what even goes into that? Does the text go into that? You have metadata, you have other star ratings, and those are features. I guess no such thing as raw, because there’s always some code that takes the world and distills it down into something that fits in memory. So that’s you can think about as feature extraction. Thank you. Yeah. Okay, one last question in the middle. What is the problem with too many features? Or why don’t you want your hypothesis class to be too big? Is it like the only fitting thing? Yeah, yeah. So the question is why don’t you just make fee as large as possible to throw on all the features? And overfitting is one of the main concerns there, which we’ll come back to in the next lecture. Okay, great questions. So let’s actually skip over this. So there’s another type of feature function you can define, but in interest some time, I’m going to skip over that. Okay, so now let’s come back to this question. I keep on saying linear predictors. What is linear? Right? So remember the prediction is driven by the score. So here’s the question. Is this score linear and W? Yes. Because what is a linear function? It’s basically some kind of weighted combination of your inputs. Okay, so is it a linear and fee of X? By symmetry should be because it’s just a dot product. So is it a linear and X? No. In fact, this question doesn’t even make sense because think about X. X, remember, is a string. It’s not even a number. So and that’s when you know the answer should be no because there’s a type error. Okay, so here’s kind of the cool thing now. These predictors can be expressive non-linear function and decision boundary of X. In the case where X is actually a real vector. But the score is a linear function of W. Okay, so this is cool because from a, there’s two perspectives. From the point of actually doing prediction, you’re thinking about like how does this function operate on X? And you can get all sorts of crazy functions coming out. We just looked at quadratic functions, which is clearly not a linear, but you can do all sorts of crazy things. But from the point of view of learning, it doesn’t care about X. All it sees is fee of X. In particular, you’re learning asked the question, how is this function depend on W? Because it’s tuning W. And from that perspective, it’s a linear function of W. And for reasons I’m not going to go into, these functions permit efficient learning because the loss function becomes convex. Which I’ll, that’s all I’ll say about that. Okay, so one kind of cool way to visualize what’s going on here is going back to our circles example. So remember, we want this two-dimensional classification problem where the true decision boundary is let’s say a circle. So how do we fit that? And what does it mean for a linear thing? Because when you think linear, it’s like should be a line. So here’s a kind of a cool graphic. So here is these points inside the circle. And it can’t be classified. But the point is when you look at the feature map, it actually lifts these points into a higher dimensional space. And I’ll have three features, right? And you know, in this higher dimensional space, I can actually, things are linear. I can slice it with a kind of a knife. And then, you know, in that high dimensional space, it thinks are cut. And what it induces in the lower dimensional space is, you know, the circle. Okay? Okay. I don’t want to. Okay. So hopefully that was a nice visualization that shows how you can actually get non-linear machine functions out of kind of essentially linear machinery. Right? So the next time someone says, well, you know, linear classifiers, they’re really limited. And you really need neural nets. You know, that’s technically false because you actually get really expressive models out of neural networks, or out of linear models. The point with neural networks is not that they’re not, you’re more, necessary, more, they can be more expressive, but the fact that they have other advantages, for example, the inductive bias that comes with the architectures and the fact that they’re more efficient when you go to a more expressive models and so on. Okay. So to kind of wrap up all things, I want to kind of do a simple exercise. So here’s a task. So imagine you’re doing your final project and you want to predict whether two consecutive messages in some forum or a chat are where the second one is a response to the first. So it’s binary classification, input is two messages, and you’re asked to predict whether the second is a response to the first. Okay. So we’re going to go through this exercise of coming up with features that might be, or feature templates might be useful to pick out properties of x might might be useful. And we’re going to assume that we’re dealing with linear predictors. So what are some features that might be useful? Let’s, you know, let’s, here’s, let’s start with a few. Okay. What about time elapsed between the two messages? Is that a useful feature or not? Let me just say yes. Okay. So this information is definitely good. One subtle point is that this time elapsed is a single number and this number is going to go into the score kind of in a linear fashion. Okay. So what does, what does that mean? That means, you know, if I double the time, then the score is going to, or that contribution to the score is going to like multiply by two. Right. So think about it. It’s kind of like saying, as I increase the time, you know, it becomes, you know, linearly more likely that I’m going to be, let’s say, not a response or response. So this is, you know, maybe kind of not what you want because, you know, the difference from that perspective, like if you, the time elapses like a year, then that really kind of dominates the score function. And it’s like, more likely that it’s going to be a response than if it were like one minute, which is, and not what you want. Yeah. Question. When does it use this feature? Yeah. So the question is, can you normalize it? So you have to be careful with normalization. So if you normalize, let’s say, the span of like over one year, now there’s no difference between like, you know, five seconds and one minute because everything gets squashed out of zero. Right. So one way to kind of approach that is to, you know, discretize the features. So one trick that people often do is if you have a numerical value, which you really kind of want to treat kind of in a sensitive way, you can kind of break up it into pieces. So the feature template would look something like time elapses is between blah and blah. So you can do things like, okay, is it between zero seconds and five seconds and is it between five seconds and like a minute and between a minute and an hour and an hour and a year or something and then after that, it doesn’t matter. Because that will give you kind of more, it’s more domain knowledge that tells you kind of what things to look out for. The difference between, let’s say a year and a year plus two seconds is really, you know, it doesn’t matter. Right. Where’s the difference between one second and five seconds might be significant. So this is all a long way of saying, you know, if you’re using linear classifiers or even if you’re using neural networks, I think it’s really important to think about how your raw features are kind of entering the system and think about like, hmm, if I change this feature by like scaling it up, does the prediction change in a way that, you know, I expect? Yeah, yeah, a question. If we approve that second feature right there, what prevents us from heading, let’s say, if we had, for example, from zero to five seconds and then from 30 to 40 seconds and then so what prevents us from heading just the entire grid. Yeah. So the question is, if you have every possible range, isn’t that like an infinite number of features. So there’s two answers to that. One is that even if you did that, you might still be okay because there’s probably some, if you think about like discretizing the space of, you know, your time elapsed, time elapsed and you’re basically saying, for every bucket, I’m going to have a feature. It is true that you have an infinite number of, you know, features, but, you know, at some point you might just cut it off and if you didn’t cut it off and use a sparse feature representation, you don’t have to have a preset, you know, maximum because remember, most of these features are going to be zero because the chances of some data point being like, you know, 10 years is going to be essentially, you know, you know. Another answer is that in general when you have features that have multiple time scales, you want to kind of space it out kind of logarithmic way. So you want it to, to four, four to eight so that you can have both kind of sensitivity and the lower ends but also kind of cover a large, you know, magnitude. Yeah, in the back. Is it possible to learn like how to discretize the features and make it the most important question? Is it possible to learn how to discretize the features? There are, there’s definitely more automatic things you can do besides, you know, just like spans specifying them. At some level though, you have to kind of input the value in a form like if you input it into X versus let’s say log of X. Those choices often can make a big difference but if you use more expressive models like neural networks, you can, you know, mitigate some of this. Yeah. I see the value in changing time elapsed from a number to like a Boolean or whether it calls between orange. When would you want to retain a numerical value from the future? When would you not want to discretize it? Yeah, good question. So when would you actually want to not discretize it? So there are essentially when you expect kind of the scale of that feature to really kind of matter in some sense. So certainly when you think that some things behave linearly, then you just want to preserve the linear. Or if you think that it behaves quadratically, then you want to keep the feature but also add like a squared term to it. I want to maybe move on. These are all good questions. Happy to discuss more offline. So some other features might include the first message contains blank, what blank is a string. So maybe things like question marks are more indicative of things being the second message being response, second message contains certain words. Two messages, both contain a particular word. There’s cases where it doesn’t really, it’s not the presence and absence of particular words in the individual messages but like the fact that they both share a common word. That might be useful. Here’s another feature which is two messages have some number of common words together. So this feature is kind of interesting because it’s, for example, you look at this feature, it’s had the number of, when I say feature, I actually mean feature template. So for this feature template, there are many, many features, one for possibly any number of words. And this again leads to cases where you might have a lot of, you know, sparsity and you might not have enough data to fit all the features. Whereas this one is very compact that says, I just have to look at the number of overlaps. So the two messages, my container word I’ve never seen before, but I know it’s the same word and I can kind of recognize that pattern. So there’s quite a bit of things you can do to play around with features that capture the intuitions about what might be relevant to your task. Question? I have a lot of these spars features like the third, fourth and fifth one here. Is that one we want to do like dimensionality reductions like knock out some of those many many features? So question is, when you have a lot of spars features do you want to do dimensionality reduction? Not necessarily. So in terms of computation, having spars features doesn’t necessarily mean that it’s going to be really slow because there’s efficient ways of representing spars features. In terms of expressivity, one thing that in a lot of NLP applications, you actually do want a lot of features and you can have a lot more features than you might think you can handle. And because you really want the first order bit is just to be expressive enough to even fit the data. Okay, let me move on since I’m running short on time. Okay, so summary so far, we’re looking at features. We can define these feature templates which organize these features in a kind of meaningful way. Then we talked about hypothesis classes which are defined by features and this defines what is possible out of learning. And all of this is in the context of linear classifiers which incidentally can actually produce these nice non-linear decision boundaries. So at this point you can actually have the kind of enough tools to do a lot. But in the next section I want to talk about neural networks because these are even more expressive models which can be more powerful. One thing I often recommend is that when you’re given a problem, always try this simplest thing. Try the linear classifier and just see where it gets. Sometimes you’d be surprised at how far you can get with linear classifiers. And then go and increase the complexity as you need it. I know there’s sometimes a temptation to try the fancy new hammer but sometimes keeping it as simple is really good. Okay, so neural nets, there’s a couple ways of motivating this. One motivation comes from the brain. I’m going to use a slightly different motivation which comes from this idea of decomposing a problem into parts. So this is a somewhat contrived example but hopefully it will allow us to build up the intuitions for what’s going on in a neural network. So suppose I am building some sort of system to detect whether two cars are going to collide. So the way it works is I have this car at position x1 and it’s driving this way. And then I have another car at position x2 and it’s driving this way and I want to determine whether it’s safe which is positive or if it’s going to collide. And let’s suppose for simplicity that the true function is as follows. So it’s just measuring whether the distance is at least one apart. Now this is kind of a little bit like what we did in the last lecture where we suppose there was a true function and then see if learning can recover that. Where in practice obviously we don’t know the true function but this is for kind of pedagogical purposes. Okay, so just to make sure we understand what function we’re talking about. So if x1 is 1 and x2 is 3 kind of like that on the board then your plus 1. So this is like driving in a US. This is like driving in a UK and that’s fine too. But if you’re too close together then that’s bad news. Okay. So let’s think about decomposing the problem because if you look at this this could be kind of a complicated function but let’s try to break it down into kind of linear functions. Because at the end of the day neural networks are just a bunch of linear functions which are stitched together with some non-minimary. So there are kind of linear components that are critical to neural nets. So one sub problem is detecting if car 1 is to the far right of car 2. So x1 as x2 is greater than equal to 1. Another problem is testing whether car 2 is a far right of car 1 and then you can put these together by saying if at least one of them is one then I’m going to predict safe otherwise and we’re not safe. So here’s a kind of concrete example. So for 1, 3, car 2 is a far right of car 1. So that’s a 1. You add these up, take the sign, that’s plus 1. In the opposite direction it’s still fine. And in this case both h1 and h2 are 0 so that’s bad news. So this is just kind of trying to take this expression which is true function and kind of write it in a kind of more modular way where you have different pieces corresponding to different computations. Okay so now we could just write this down obviously to solve this problem but we already knew what the right answer is. But suppose we didn’t know what the true function is and we just had data. So we don’t actually know what these functions are. So can we kind of learn these functions automatically? So what I’m going to do is I’m going to define a feature vector now of x which is going to be 1 x1 x2. And then I’m going to rewrite this intermediate sub problem as follows. So x1 is x2 greater than 1. It’s going to be represented as this vector v1 dot v of x where v1 is minus 1 plus 1 minus 1. So you pause for a second. You can verify that this is x1 minus x2 greater than equal to 1. So this is just another way of writing what we wanted. In terms of this dot product and you can see kind of how this is maybe moving more towards something that looks more general. So the question is why is there this 1 here? So this 1 typically is known as a bias term which allows you to not just threshold on zero but threshold on any arbitrary number. So in the linear classifiers that I’ve talked about, I’ve kind of swept this under the rug. Generally, you always have a bias term that allows you to kind of modulate how likely you’re going to predict 1 versus minus 1. So you can also do it for h2. It’s the same thing but just switching the roles of x1 and x2. And now for the final sign prediction, you can write it as follows. Now these are just weights on h1 and h2. So now here is the kind of the punch line. For a neural network, we’re just going to leave v1, v2, and w as unknown quantities that we’re going to try to fit through training. We motivated this problem by saying in this case, there’s some choice of v1, v2, w that works. But now we’re kind of generalizing. If we didn’t know this quantity, we just leave them as variables and we can actually still fit these parameters. So before we were just tuning w and now we’re tuning both v and w. This specifies the choice of the hidden problems that we’re interested in and w governs how do we take the results of the hidden problems and come to a final prediction. So there’s one problem here which is that if you look at the gradient of h1 with respect to v1, it happens to be zero. So if you look at that horizontal axis is v1 dot v of x and the vertical axis is h1, that function is looks like the step function because the indicator function of some quantity greater equals zero. It’s 1 over here, 0 over here. And remember we don’t like zero gradients because SGD doesn’t work. So the solution here is to take some sandpaper and you sand out this function to smooth it out and then you get something that is differentiable. So the logistic function is this function which is a smooth out version of this which rises so it doesn’t hit one or zero ever but it becomes extremely close but it kind of goes up in the middle. And you can think about this as a smooth version of the step function. So it kind of behaves and looks like the step function serves kind of the same intuition that you’re trying to test whether some quantity is greater than zero but it doesn’t have zero gradients anywhere. And you can double check if you take the derivative then this actually has this kind of really interesting and nice form which is the value of the function times 1 minus the value of the function. And the value of function never hits zero so this quantity never hits zero. So now we can define known ads in contrast to linear function. So remember the linear functions. We can visualize it as inputs go in and each of the inputs gets weighted by some w and you get the score. So this is a linear function that looks like. Now neural networks with one hidden layer and two hidden units, one two, looks something like this where you have these intermediate hidden units which are the sigmoid function applied or the logistic function in this case to be concrete. Apply to this wave vector Vj times phi of x. So h1 is going to be taking the input, multiply by a vector and you get some number here and then you send it through this logistic function to get some number. And then finally you take the output of h1, h2 and you take the dot product with respect to w and then you get the final score. So again the intuition is that neural nets are trying to break down the problem into a set of subproblems where the subproblems are the result of these intermediate computations. And you can think about these as h1 is really the output of a linear classifier. h2 is the output of a linear classifier and then you’re taking those outputs and then you’re sticking them through another linear classifier and getting the score. So this is what I mean by at the end of the day it’s linear classifiers packaged up and strung together and the expressive power comes from the composition. Yeah, question. The h sub j when there’s multiple e numbers. How do you combine? The question, how do you get h sub j when there’s multiple p’s? There’s only one phi of x. So this is a first component of phi of x. So this vector, this is a three-dimensional vector which is phi of x and it has three components. Yeah. So that’s an interaction type for it that hj is that effectively teaches kind of. Yeah. And it’s like a function of the original features that you put in and they make new features that are better than the ones you want to. Yeah, so that’s what kind of my next point which is that one way you can think about it is that the hj’s are actually just features which are learned automatically from data as opposed to having a fixed set of your features phi. Because at this layer, W always sees is these h’s which are coming through which look like features. And for deeper neural networks, you kind of just keep on stacking this. So the output of one set of classifiers becomes the features to the next layer and then the output of that classifier becomes the features to the next layer and so on. And the intuition for deeper networks is that as you proceed, you can derive more abstract features. For example, images, you start with pixels and then you find edges and then you define kind of object parts and then now you define kind of a things which are closer to the actual classification problem. Yeah. Yeah. One and h2 develop the exact same value. Do you have to have a bias to start with? Yeah, that’s a good question. So why don’t h1 and h2 basically end up in the same place because of symmetry? If you’re not careful, that will happen. So if you initialize all your weights to zero or initialize these weights the same way, then they will be kind of moving lock step. So what is typically done is you randomly initialize, so you kind of break the symmetry. And then what the network is going to do is it’s trying to use learn automatically learns the subproblems to be kind of complementary because you’re doing this joint learning. Yeah, find a question. How do I choose a sigma function? So in general, sigmoid functions are these or activation functions are these non-linear functions. So the important thing is it’s a non-linear function. I chose this particular logistic function because it’s kind of the classic neural net. And it looks like the step function which is kind of takes a score and outputs a classification result. I should, you know, responsibly note that these are maybe less in style than they used to be. And the cool thing to do now is to use what is called a relu or rectified linear which looks like this. And you might ask like why this one? Well, there’s no one reason but this function has less of a kind of this gradient going to zero problem. It’s also simpler because it doesn’t require exponentials. But there’s, I’m going to just leave it at that. What the benefit of this function is pedagogical reasons. It’s a little bit of a throwback too. Okay. Yeah. If you read the notes in the lecture slides, there’s more details on why you might change, choose one versus another. Okay. So now we’re kind of ready to do neural net learning. So remember, we have this optimization problem. It’s the training loss. Now it depends on both v and w. And the training loss remember is the average of losses of individual examples. The loss of the individual example, let’s say we’re doing regression, is the square difference between y and the function value. And remember, the function value is the summation over the weights at the last layer times the activations of the hidden layer. And that’s basically it. Okay. So what we’re going to do is compute this gradient. So you look at this and you say, okay, well, you know, if you get enough scratch paper, you can probably like work it out. I’m going to show you a different way to do this without grinding through the chain rule. So this is going to be based on the computation graph, which will give you, more additional insight into the kind of the structure of computations and visualize what it means, what does a gradient kind of mean in some sense. And it also happens that this computation graph is really at the foundation of all these modern deep learning frameworks like TensorFlow and PyTorch. So this is a real thing. It turns out that, you know, when I’ve taught this, it, many people still kind of prefer to grind out the math. I can’t really tell why except for maybe you’re more familiar with that. So I would encourage everyone to kind of at least try to think about the computation graph as a way to understand, you know, gradients, even though initially it might not be, you know, faster. And it’s not to say that you always have to draw a graph to compute gradients, but, you know, doing a few times might give you additional insight that you want otherwise to get. Okay. So here we go. So functions, we can think about them as just boxes, right? The boxes you have some inputs going in and then you get some output. That’s all a function is. Okay. And partial derivatives or no gradients, ask the question, the following question. How much does the output change if the input changes a little bit? Okay. For example, if we have this function that just computes two times in one plus into in three, you ask the question like you take input one and you just add a little epsilon. So like point zero zero to one. And you ask, hmm, and then you read out the output and you say, what happens to the output? On this case, the output changed by two epsilon additively. Okay. So then you conclude that the gradient of this function with respect to n1 is one. Two, right? Because the gradient is kind of the amplification. If I put in epsilon, then I get two epsilon out, the gradient is two or the partial derivative. So okay, let’s do this one. So if I add epsilon to in two, then I, simple algebra shows I get a change in in three epsilon. So what’s the partial derivative with respect to in two? In three, right? Okay. Good. So you could have done the basic calculus and gone that, but I really kind of want to stress the kind of interpretation of, you know, perturbing inputs and witnessing the output because I think that’s a useful interpretation. Okay. So now, all functions are, well, not all functions are made out of building blocks, but most of the functions that we’re interested in in this class are going to be made out of these five pieces. Okay. So for each of these pieces, it’s, you know, it’s a function. It has inputs, A and B, and you pump these things in and you get some output. So this is plus, minus, times, max, and the logistic function. Okay. So on these edges, I’m going to write down and green the partial derivative with respect to the input that’s going into that function. Okay. So let’s do this. So if I have the function A plus B, the partial derivative with respect to A is one, and the partial derivative with respect to B is one. Okay. And if you have minus, then it’s one and minus one. If you have times, then the partial is B and A. Okay. Everyone follows so far. Okay. Okay. So max, what is this? This is maybe a little bit trickier. So remember, we kind of experienced the max last time. So in the max example, you have, let me just refresh. So remember, last time we saw the max in the context of the hinge loss. Right. So you have the max of these two functions, which is this, which means that, you know, let’s say one is A and the other is B. So if A is greater than B, then we need to take the derivative with, sorry, then, okay. Let me do it this way. Okay. Ignore that thing on the board. So I just have max A of B. Okay. So suppose A is seven and B is three. Okay. So max, A and B. And let’s say this is seven and this is three. So that means A is greater than B. So now if I change A by a little bit, then that change is going to be reflected by an output of the max function, right? Is this the three that’s small and doesn’t matter? And in this case, if I change B by a little bit, then does the output change? No, because like, you know, 3.1, 2.9 is all, the output doesn’t change, so the gradient is going to be zero. So the max function, partial derivatives look like this. So if A is greater than B, then this is going to be a one. If A is less than a B, this is going to be a zero. And you know, conversely over here, if B is greater than A, then this is going to be a one. If B is less than A, then this is going to be a zero. Okay. So the partial of max one is always one or zero, depending on this particular condition. Okay. And then the logistic function, this is just a fact you can derive it in your free time, but I had on a previous slide. It’s just like the sigmoid logistic function times 1 minus logistic function. Okay. So now you have these building blocks. Now you can compose and you can build castles out of them. It turns out like all, basically all functions that you see and you know, you know, deep learning are just basically built out of these blocks. And how do you compose things? Which says that if you think about input going to one function and that output going to input a new function, then the partial derivative with respect to the input of the output is just the product of the partial derivative. This is just a chain rule. Right. And you can think about it as like, you know, think about amplification. So this function amplifies by two times and this amplifies, this function amplifies by five, then the total amplification is going to be two times five. Okay. All right. So now let’s take an example. We’re going to do binary classification with a hinge loss. I’m just as warm up. And I’m going to draw this computation graph and then compute the partial derivative with respect to W. Okay. So what is this graph? So I have W times V of X. That’s a score times Y. That’s a margin. One minus margin, max of one minus margin of zero is a loss. Okay. So now for every edge, I can draw the partial derivative. Okay. So here, remember the partial derivative here is left hand side greater than, you know, the right branch. So one minus margin greater than zero for minus, this is a minus one for a times, this is going to be whatever is over here for this times, it’s going to be whatever is over here. And by the chain rule, if you multiply what’s on all the edges, then you get the gradient of the loss with respect to W. Okay. So this is kind of a graphical way of doing what I did last time, which is if the margin is less than greater than one, then it’s everything zero. If the margin is less than one, then I perform this particular updating. Okay. So in the interest of time, I’m not going to do it for the simple neural network. You’ll do this in section, but at a high level, you basically do the same thing. You multiply all the blue edges and you get the partial derivatives. Okay. So now, you know, we’ve kind of done everything manually. I want to kind of systemize this and talk about an algorithm called back propagation that allows you to compute gradients for arbitrary computation graph. That means any kind of function that you can build out of these building blocks, you can actually just get the derivatives. So one nice thing about these packages like PyTorch or TensorFlow is that you actually don’t have to compute the derivatives on your own. It used to be the case that before these people would have to implement these derivatives by hand, which is really tedious and error prone. And part of why it’s been so easy to kind of develop new models is that all that’s done for you automatically. Okay, so the back propagation is going to compute two types of values, a forward value and a backward value. So Fi for every node i is the, simply the value of that expression tree. And the backward value GI is going to be the partial derivative with respect to output of that, the value of that node. So for example, Fi here is going to be W1 times sigma of V1 times V of X. And G of that node is going to be basically the product of all these edges. Basically, how much does this node change the output at the very top? Okay, so the algorithm itself is quite straightforward. There’s a forward pass which computes all the Fi’s and then there’s a backward pass that computes all the GI’s. So in the forward pass, you start from the leaves and you go to the root and you compute each of these values kind of recursively, where the computation depends on the sub expressions. And then the backward pass, you similarly have a recurrence that gives you the value of a particular GI of a particular node is equal to the GI of its parent times whatever is on this edge. So it’s like you take a forward pass, you fill in all the Fi’s and then you take a backward pass and you fill in all the GI’s that you care about. Okay, all right, so a section will go through this and detail I realized this might have been a little bit quick. One quick note about optimization is that now you have all the tools that you can do, you can run SUD on it which doesn’t really care about whether what the function is. It’s just like a function you have it, you can compute the gradient and that’s all you need. But one kind of important thing to note is that just because you can compute a gradient doesn’t mean you optimize the function. So for linear function, it turns out that if you define these loss functions on top, you get these convex functions. So convex functions are these functions that you can hold in your hand and they have a one global minimum and so if you think about SUD it’s going downhill, you’ll converge to a global minimum and you solve the problem. Whereas neural nets, it turns out that the loss functions are non-convex which means that if you try to go downhill, you might get suck in local optimum. In general optimization of neural nets is hard. In practice, people somehow manage to do it anyway and it works. There’s a gap between through in practice which is active area of research. So in one minute I’m going to do nearest neighbors. It will actually be fine because nearest neighbors is really simple so you can do it in one minute. So here it goes. So let’s throw away everything we knew about to linear classifying neural nets. Here’s algorithm. Your training is you store your training examples. That’s it. And then the predictor of a particular example that you get is you’re going to go through all the training examples and find the one which is closest has the input which is closest to your input, x prime. And then you’re just going to train, you’re going to return y. So an intuition here is that similar examples, similar inputs should get similar outputs. So here’s a pictorial example. So suppose we’re in two dimensions and you’re in classification and you have plus over here, let’s do this plus and you have a minus here. So if you are asking what is the label aside to that point, it should be plus because this is closer, this should be minus, this region should be minus, this should be plus. And one kind of cool thing is that if where is the decision boundary? So if you look at the point that’s equal distance from these and draw a predictor, that’s the decision boundary there, same thing over here. So you have basically a curve out this region where this is minus and everything here is now plus. In general, this is what I’ve drawn is an instance of a Vorneid diagram which if you’re given a bunch of points, the defined regions of points which are closest to that point and everything in a particular region like this yellow region is assigned the same label as this point here. And this is what is called a non-periodic model which means that the number, it doesn’t mean that there’s no parameters. It means that the number of parameters is not fixed. The more points you have, the more kind of each point is its own parameter. So you can actually fit really expressive models using that. It’s very simple, but it’s a kind of computation expensive because you have to store your entire training examples. Okay, so we looked at three different models and there’s a saying that, well, I guess in Squaw you, there’s three things, study, sleep, and party or something and you have to only pick two of them. Well, so for learning, it’s kind of the same, it can either be fast to predict for linear models and neural nets. It can be easy to learn for the linear models and neural snavers or it could be powerful, for example, like neural networks and neural snavers. But there’s always some sort of compromise in exactly what method you choose will depend on kind of what you care about. Okay, see you next time.