MIT 6.0002 11. Introduction to Machine Learning

In this AI video ...

The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu. Okay, welcome back. You know, it’s that time of term when we’re all kind of doing this. So let me see if I can get a few smiles by simply noting to you that two weeks from today is the last class. It should be worth at least a little bit of a smile, right? Professor Goodtag is smiling. He likes that idea. You’re almost there. What are we doing for the last couple of lectures? We’re talking about linear regression. And I just want to remind you, this was the idea of I have some experimental data, a case of a spring where I put different weights on it, I measure displacements, and regression was giving us a way of deducing a model to fit that data. In some cases, it was easy. We knew, for example, it was going to be a linear model. We found the best line that would fit that data. In some cases, we said we could use validation to actually let us explore to find the best model it would fit it, whether it’s a linear or a cubic, sorry, linear or a quadratic or cubic, some higher order thing. So we’d be using that to do something about a model. That’s a nice segue into the topic for the next three lectures. The last big topic of the class, which is machine learning. And I’m going to argue, you can debate whether that’s actually an example of learning, but it has many of the elements that we want to talk about when we talk about machine learning. So as always, there’s a reading assignment. Chapter 22 of the book gives you a good start on this, and it will follow up with other pieces. And I want to start by basically outlining what we’re going to do. And I’m going to begin by saying, as I’m sure you’re aware, this is a huge topic. I’ve listed just five subjects in course six that all focus on machine learning. And that doesn’t include other subjects where learning is a central part. So natural language processing, computational biology, computer vision, robotics, all rely today heavily on machine learning. And you’ll see those in those subjects as well. So we’re not going to compress five subjects into three lectures. But what we are going to do is give you the introduction. We’re going to start by talking about the basic concepts of machine learning. The idea of having examples, and how do you talk about features representing those examples? How do you measure distances between them, and use the notion of distance to try and group similar things together as a way of doing machine learning? And we’re going to look as a consequence of two different standard ways of doing learning. One we call classification methods. Example we’re going to see there’s something called K nearest neighbor. And the second class called clustering methods. Classification works well when I have what we would call labeled data. I know labels on my examples. And I’m going to use that to try and define classes that I can learn. And clustering working well when I don’t have labeled data. So let’s see what that means in a couple of minutes. But we’re going to give you an early view of this. Unless Professor Goodtag changes his mind, we’re probably not going to show you the current really sophisticated machine learning methods. Like convolutional neural nets are deep learning things you’ll read about in the news. But you’re going to get a sense of what’s behind those by looking at what we do when we talk about learning algorithms. Before I do it, I want to point out to you just how prevalent this is. I admit with my gray hair, I started working in AI in 1975 when machine learning was a pretty simple thing to do. And it’s been fascinating to watch over 40 years the change. And if you think about it, just think about where you see it. AlphaGo machine learning based system from Google, the world class level go player. Chess has already been conquered by computers for a while. Go now belongs to computers. Best go players in the world are computers. I’m sure many of you use Netflix any recommendation system. Netflix Amazon pick your favorite uses a machine learning algorithm to suggest things for you. In fact, you’ve probably seen it on Google, right? The ads that pop up on Google are coming from a machine learning algorithm that’s looking at your preferences. Scary thought. Drug discovery, character recognition, the post office does character recognition of handwritten characters using a machine learning algorithm and a computer vision system behind it. You probably don’t know this company. It’s actually now MIT Spinoff called Two Sigments, a hedge fund in New York. They heavily use AI and machine learning techniques. And two years ago, their fund returned a 56% return. I wish I’d invested in the fund. I don’t have the kinds of millions you need, but that’s an impressive return. 56% return on your money in one year. Last year, they didn’t do quite as well, but they do extremely well using machine learning techniques. Siri. Another great MIT company called Mobileye that does computer vision systems with a heavy machine learning component that is used in assistive driving and will be used in completely autonomous driving. It will do things like kicking your brakes if you’re closing too fast on the car in front of you. Which is going to be really bad for me because I drive like a Bostonian and it would be kicking in constantly. Face recognition, Facebook uses this. Many other systems do to recognize both detect and recognize faces. IBM Watson, Cancer Diagnosis. These are all just examples of machine learning being used everywhere. And it really is. I’ve only picked nine. So, what is it? I’m going to make an obnoxious statement. You’re now used to that. I’m going to claim that you could argue that almost every computer program learns something. But the level of learning really varies a lot. So, if you think back to the first lecture in 6th, 001, we showed you Newton’s method for computing square roots. And you could argue, you’d have to stretch it, but you could argue that that method learned something about how to compute square roots. In fact, you could generalize it to roots of any order power. But it really didn’t learn. I really had to program it. I think about last week, when we talked about linear regression. That starts to feel a little bit more like a learning algorithm. Because what did we do? We gave you a set of data points, mass displacement data points. And then we showed you how the computer could essentially fit a curve to that data point. And there was in some sense learning a model for that data that it could then use to predict behavior in other situations. And that’s getting closer to what we would like when we think about a machine learning algorithm. We’d like to have a program that can learn from experience, something that it can then use to deduce new facts. Now, it’s been a problem in AI for a very long time. And I’d love this quote, it’s from a gentleman named Art Samuel. 1959 is the quote, in which he says, his definition of machine learning is the field of study that gives computers the ability to learn without being explicitly programmed. And I think many people would argue he wrote the first such program. It learned from experience. In his case, it played checkers. It kind of shows you how the field is progressed. We started with checkers, we got to chess, we now do go. But it played checkers, it beat national level players. Most importantly, it learned to improve its methods by watching how it did in games and then inferring something to change what it thought about as it did that. Samuel did a bunch of other things. I’ve just highlighted one. You may see in a follow-on course, he invented what’s called alpha data pruning, which is a really useful technique for doing search. But the idea is how can we have the computer learn without being explicitly programmed. And one way to think about this is to think about the difference between how we would normally program and what we would like from a machine learning algorithm. Normal programming, I know you’re not convinced there’s such a thing as normal programming, but if you think of traditional programming, what’s the process? I write a program that I am put to the computer so that it can then take data and produce some appropriate output. And the square root finder really sits there, right? I wrote code for using Newton method to find a square root, and then it gave me the process of giving any number, I’ll give you the square root. But if you think about what we did last time, it was a little different. In fact, in a machine learning approach, the idea is that I’m going to give the computer output. I’m going to give it examples of what I want the program to do. Labels on data, characterizations of different classes of things. And what I want the computer to do is given that characterization of output and data, I want that machine learning algorithm to actually produce for me a program. A program that I can then use to infer new information about things. And that creates, if you like, a really nice loop, where I can have the machine learning algorithm learn the program which I can then use to solve some other problem. That would be really great if we could do it. And as I suggested, that curve fitting algorithms, a simple version of that, it learned a model for the data, which I could then use to label any other instances of the data, or predict what I would see in terms of spring displacement as I choose. So that’s the kind of idea we’re going to explore. If we want to learn things, we could also ask, so how do you learn? And how should a computer learn? Well, for you as a human, there are a couple of possibilities. This is the boring one. This is the old style way of doing it, right? Memorized facts. Memorized as many facts as you can, and hope that we ask you on the final exam instances of those facts, as opposed to some other facts you haven’t memorized. This is, if you think way back to the first lecture, an example of declarative knowledge. Statements of truth. Memorize as many as you can. Have Wikipedia in your back pocket. Better way to learn is to be able to infer, to deduce new information from old. And if you think about this, this gets closer to what we call imperative knowledge. Ways to deduce new things. Now, in the first cases, we built that in when we wrote that program to do square roots. But what we like in a learning algorithm is to have much more like that generalization idea. We’re interested in extending our capabilities to write programs that can infer useful information from implicit patterns in the data. So not something explicitly built in like that comparison of weights and displacements, but actually implicit patterns in the data and have the algorithm figure out what those patterns are and use those to generate a program you can use to infer new data about objects, about spring displacements, whatever it is you’re trying to do. Okay, so the idea then the basic paradigm that we’re going to see is we’re going to give system some training data, some observations. We did that last time with just the spring displacements. We’re going to then try and have a way to figure out how do we write code, how do we write a program a system that will infer something about the process that generated the data. And then from that, we want to be able to use that to make predictions about things we haven’t seen before. Again, I want to drive home this point. If you think about it, the spring example fit that model. I gave you a set of data, spatial deviations relative to massive spacings for different masses, how far did the spring move. I then inferred something about the underlying process. In the first case I said, I know what’s linear, but let me figure out what the actual linear equation is, what’s the spring constant associated with it. And based on that result, I got a piece of code I could use to predict new displacements. So it’s got all of those elements, training data, inference engine, and then the ability to use that to make new predictions. But that’s a very simple kind of learning setting. So the more common one is what I’m going to use as an example, which is when I give you a set of examples, those examples have some data associated with them, some features, and some labels. Where each example I might say, this is a particular kind of thing. This other one’s another kind of thing. And what I want to do is figure out how to do inference on labeling new things. So it’s not just what’s the displacement of the masses actually label. And I’m going to use one of my favorite examples. I’m a big New England Patriots fan if you’re not my apologies. But I’m going to use football players. So I’m going to show you in a second, I’m going to give you a set of examples of football players. The label is the position they play. And the data, well, it could be lots of things. We’re going to use height and weight. But what we want to do is then see if how would we come up with the way of characterizing the implicit pattern of how does weight and height predict the kind of position this player could play. And then come, we can come up with an algorithm will predict the position of new players. We do the draft for next year. Where do we want them to play? What’s the paradigm set of observations, potentially labeled potentially not. Think about how do we do inference to find a model. And then how do we use that model to make predictions. What we’re going to see and we’re going to see multiple examples of the day is that that learning can be done in one of two very broad ways. First one’s called supervised learning. And in that case, for every new example I give you as part of the training data, I have a label on it. I know the kind of thing it is. And what I’m going to do is look for how do I find a rule that would predict the label associated with unseen input based on those examples. And I’m going to just try and find what are the natural ways to group those examples together into different models. And in some cases I may know how many models are there. In some cases I may want to just say what’s the best grouping I can find. OK, what I’m going to do today is not a lot of code. I was expecting cheers for that job, but I didn’t get them not a lot of code. What I’m going to do is show you basically the intuitions behind doing this learning. And I’m going to start with my new England Patriots example. So here are some data points about current Patriots players. And I got two kinds of positions. I got receivers and I have linemen. And each one is just labeled by the name, the height and inches and the weight in pounds. OK, five of each. If I plot those on a two dimensional plot, this is what I get. OK, no big deal. I’m trying to learn are there characteristics that distinguish the two classes from one another. And in the unlabeled case, all I have are just a set of examples. So what I want to do is decide what makes two players similar with the goal of seeing, can I separate this distribution into two or more natural groups. Similar is a distance measure. It says, how do I take two examples with values or features associated with and decide how far apart are they. And in the unlabeled case, the simple way to do it is to say, if I know that there are at least K groups there, in this case, I’m going to tell you there are two different groups there. How could I decide how best to cluster things together so that all the examples in one group are close to each other, all the examples in the other group are close to each other and they’re reasonably far apart. There are many ways to do it. I’m going to show you one. It’s a very standard way and it works basically as follows. If all I know is that there are two groups there, I’m going to start by just picking two examples as my exemplars. Pick them at random. Actually, at random is not great. I want to pick two close to each other. I’m going to try and pick them far apart. But I pick two examples as my exemplars. And for all the other examples in the training data, I say which one is it closest to. What I’m going to try and do is create clusters with the property that the distances between all of the examples in that cluster are small. The average distance is small. And see if I can find clusters that gets the average distance for both clusters as small as possible. This algorithm works by picking two examples, clustering all the other examples by simply saying put it in the group to which it’s closest to that example. Once I’ve got those clusters, I’m going to find the median element of that group, not mean but median. What’s the one closest to the center and treat those as my exemplars and repeat the process. I’ll just do either some number of times until I don’t get any change in the process. So it’s clustering based on distance. And we’ll come back to distance in a second. So here’s what would happen with my football players. If I just did this based on weight, there’s the natural dividing line. And it kind of makes sense. These three are obviously clustered. And again, it’s just on this axis. They’re all down here. These seven are at a different place. There’s a natural dividing line there. If I were to do it based on height, not as clean. This is what my algorithm came up with as the best dividing line here, meaning that these four gain just based on this axis are close together. These six are close together. But it’s not nearly as clean. And that’s part of the issue we’ll look at is how do I find the best clusters. If I use both height and weight, I get that. Which is actually kind of nice, right? Those three cluster together. They’re near each other in terms of distance in the plane. Those seven are near each other. There’s a nice natural dividing line through here. And in fact, that gives me a classifier. This line is the equidistant line between the centers of those two clusters. Meaning any point along this line is the same distance to the center of that group as it is to that group. And so any new example, if it’s above the line, I would say gets that label. If it’s below the line, gets that label. In a second, we’ll come back to look at how do we measure the distances. But the idea here is pretty simple. I want to find groupings near each other and far apart from the other group. Now, suppose I actually knew the labels on these players. I shouldn’t call them objects. They’re people. These players. These are the receivers. Those are the linemen. And for those of you who are football fans, fans, you can figure out, right, those are the two tight ends. They’re much bigger. I think that’s Bennett and that’s Grunk. If you’re really a big Patriots fan, but those are tight ends. Those are wide receivers. And it’s going to come back in a second. But there are the labels. Now, what I want to do is say, if I could take advantage of knowing the labels, how would I divide these groups up? And that’s kind of easy to see. Basic idea in this case is if I’ve got labeled groups in that feature space, what I want to do is find a subsurface that naturally divides that space. Now, subsurface is a fancy word. It says, in the two-dimensional case, I don’t want to know what’s the best line. If I can find a single line that separates all the examples with one label from all the examples of the second label. We’ll see that if the examples are well separated, this is easy to do and it’s great. But in some cases, it’s going to be more complicated because some of the examples may be very close to one another. And that’s going to raise a problem that you saw last lecture. I want to avoid overfitting. I don’t want to create a really complicated surface to separate things. So we may have to tolerate a few incorrectly labeled things if we can’t pull it out. And as you already figured out, in this case, with the labeled data, there’s the best fitting line right there. Anybody over 280 pounds is going to be a great lineman. Anybody under 280 pounds is more likely to be a receiver. Okay, so I’ve got two different ways of trying to think about doing this labeling. I want to come back to both of them in a second. Now, suppose I add in some new data. I want to label new instances. Now, these are actually players of a different position. These are running backs. But I say all I know about is receivers and linemen. I get these two new data points. I’d like to know, are they more likely to be a receiver or lineman? And there’s the data for these two gentlemen. So if I go back to now plotting them, you notice one of the issues. So there are my linemen. The red ones are my receivers. The two black dots are the two running backs. Notice right here. It’s going to be really hard to separate those two examples from one another. They are so close to each other. And that’s going to be one of the things we have to trade off. But if I think about using what I learned as a classifier with unlabeled data. There were my two clusters. Now you see, I’ve got an interesting example. This new example, I would say, is clearly more like a receiver than a lineman. But that one there. Unclear. Almost exactly lies along that dividing line between those two clusters. And that would either say, I want to rethink the clustering or I want to say, you know what? As I know, maybe there aren’t two clusters here. Maybe there are three. And I want to classify them a little differently. So I’ll come back to that. On the other hand, if I’d used the label data, there was my dividing line. This is really easy. Both of those new examples are clearly below the dividing line. They are clearly examples that I would categorize as being more like receivers than they are like lineman. And I know it’s a football example. If you don’t like football, pick another example. But you get the sense of why I can use the data in the labeled case and the unlabeled case to come up with different ways of building the clusters. So what we’re going to do over the next two and a half lectures is look at how can we write code to learn that way of separating things out. We’re going to learn models based on unlabeled data. That’s the case where I don’t know what the labels are by simply trying to find ways to cluster things together nearby and then use the clusters to assign labels to new data. And we’re going to learn models by looking at labeled data and seeing how do we best come up with a way of separating with a line or a plane or a collection of lines examples from one group from examples of the other group. With the acknowledgement that we want to avoid overfitting, we don’t want to create a really complicated system. And as a consequence, we’re going to have to make some trade-offs between what we call false positives and false negatives. But the resulting classifier can then label any new data by just deciding where you are with respect to that separating line. So here’s what you’re going to see over the next two and a half lectures. Every machine learning method has five essential components. We need to decide what’s the training data and how are we going to evaluate the success of that system. We’ve already seen some examples of that. We need to decide how are we going to represent each instance that we’re giving it. I happen to choose height and weight for football players, but I might have been better off to pick average speed or I don’t know arm length, something else. How do I figure out what are the right features and associated with that? How do I measure distances between those features? How do I decide what’s close and what’s not close? Maybe it should be different in terms of weight versus height, for example. I need to make that decision. And those are the two things we’re going to show you examples of today. How to go through that? Starting next week, Professor Goodtag is going to show you how you take those and actually start building more detailed versions of measuring clustering, measuring similarities to find an objective function that you want to minimize to decide what is the best cluster to use. And then what is the best optimization method you want to use to learn that model? So, let’s start talking about features. I’ve got a set of examples labeled or not. I need to decide what is it about those examples that’s useful to use when I want to decide what’s close to another thing or not. And one of the problems is if it was really easy, it would be really easy. Features don’t always capture what you want. I’m going to belabor that football analogy, but why did I pick height and weight? Because it was easy to find. But what is, you know, if you work for the New England Patriots, what is the thing that you really look for when you’re asking what’s the right feature? It’s probably some other combination thing. But you as a designer have to say, what are the features I want to use? I quote, by the way, is from one of the great statisticians of the 20th century, which I think captures it well. So, feature engineering, as you as a programmer, comes down to deciding both what are the features I want to measure in that vector that I’m going to put together. And how do I decide relative ways to weight it? And Anna and I could have made our chore this chore, wrong terms on our job, this term really easy. If we had sat down at the beginning of the term and said, you know, we’ve taught this course many times, we’ve got data from I don’t know, thousands of students, probably over this time. Let’s just build a little learning algorithm. It takes us set a data and predicts your final grade. You don’t have to come to class, don’t have to go through all the problems, that’s will just predict your final grade. And you might make our job a little easier and you may or may not like that idea. But I could think about predicting that grade. Now, why am I telling you this example? I was trying to see if I could get a few smiles. I saw a couple of them there. But think about the features. What would I measure? Actually, I’ll put this on John because it’s his idea. What would he measure? Well, GPA is probably not a bad predictor performance. You do well in other classes. You’re likely to do well in this class. I’m going to use this one very carefully. Prior programming experience is at least a predictor, but it is not a perfect predictor. Those of you who haven’t programmed before in this class, you can still do really well in this class. But it’s an indication that you’ve seen other programming languages. On the other hand, I don’t believe in astrology. So I don’t think the month in which you’re born, the astrological sign under which you were born, has probably anything to do with how well you’d program. I doubt that I color has anything to do with how well you’d program. You get the idea. Some features matter. Others don’t. Now, I could just throw all the features in and hope that the machine learning algorithm sorts out those that wants to keep from those that doesn’t. But I remind you that idea of overfitting. If I do that, there is the danger that it will find some correlation between birth month, I color, and GPA. And that’s going to lead to a conclusion that we really don’t like. By the way, in case you’re worried, I can assure you that Stouche mill in the Dean of Admissions Department does not use machine learning to pick you. He actually looks at a whole bunch of things because it’s not easy to replace him with a machine. Yeah. So what this says is we need to think about how do we pick the features and mostly what we’re trying to do is to maximize something called the signal to noise ratio. Maximize those features that carry the most information and remove the ones that don’t. So I want to show you an example of how you might think about this. I want to label reptiles. I want to come up with a way of labeling animals as are they a reptile or not. And I give you a single example with a single example you can’t really do much, but from this example, I know that a cobra it lays eggs, it has scales, it’s poisonous, it’s cold blood, it has no legs, and it’s a reptile. So I could say my model of a reptile is, well, I’m not certain I don’t have enough data yet. But if I give you a second example, and it also happens to be egg laying, have scales, poisonous, cold blood, and no legs, there’s my model, right? Perfectly reasonable model, whether I designed it or a machine learning algorithm would do it says if all of these are true, label it as a reptile. Okay, and now I give you a boa constrictor. It’s a reptile, but it doesn’t fit the model. And in particular, it’s not egg laying, and it’s not poisonous. So I’ve got to refine the model or the algorithm’s got to refine the model. And this I want to remind you is looking at the feature. So I started out with five features. This doesn’t fit. So probably what I should do is reduce it. I’m going to look at scales, I’m going to look at cold blood, I’m going to look at legs. That captures all three examples. Again, if you think about this in terms of clustering, all three of them would fit with that. Okay, now I give you another example. Chicken, I don’t think it’s a reptile. In fact, I’m pretty sure it’s not a reptile. And it nicely still fits this model, right? Because while it has scales, which you may not realize, it’s not cold blooded and it has legs. So it is a negative example that reinforces the model. Sounds good. And now I’ll give you an alligator. It’s a reptile. And no fudge, right? It doesn’t satisfy the model. Because while it does have scales, and it is cold blooded, it has legs. I’m almost done with the example, but you see the point. Again, I’ve got to think about how do I refine this? And I could by saying, all right, let’s make it a little more complicated. Has scales, cold blooded, zero or four legs. I’m going to say it’s a reptile. I’ll give you the dart frog, an autoreptile, it’s an amphibian. And that’s nice because it still satisfies it. So it’s an example outside of the cluster that says no scales, not cold blooded, but happens to have four legs. It’s not a reptile. That’s good. And then I have to give you a python. There has to be a python in here. At least it’s grown at me when I say that. There has to be a python here. And I give you that in a salmon. And now I am in trouble. Because look at scales, look at cold blooded, look at legs. I can’t separate them. On those features, there’s no way to come up with a way that will correctly say that the python is a reptile and the salmon is not. And so there’s no easy way to add in that rule. And probably my best thing is to simply go back to just two features, scales and cold blooded. And basically say if something has scales and it’s cold blooded, I’m going to call it a reptile. If it doesn’t have both of those, I’m going to say it’s not a reptile. It won’t be perfect. It’s going to incorrectly label the salmon. But I’ve made a design choice here that’s important. And the design choice is that I will have no false negatives. What that means is there’s not going to be any instance of something that’s not a reptile that I’m going to call a reptile. I may have some false positives. So I did that the wrong way. A false negative says everything that’s not a reptile, I’m going to say I’m going to categorize that direction. I may have some false positives in that I may have a few things that I will incorrectly label as a reptile. And in particular, sorry, salmon is going to be an instance of that. This trade off of false positives and false negatives is something that we worry about as we think about it because there’s no perfect way in many cases separate out of the data. And if you think back to my example of the New England Patriots, that running back in that wide receiver, we’re so close together and height and weight, there’s no way I’m going to be able to separate them apart. And I just have to be willing to decide how many false positives or false negatives do I want to tolerate. Once I figured out what features to use, which is good, then I have to decide about distance. How do I compare two feature vectors? I’m going to say vector because there’ll be multiple dimensions to it. How do I decide how to compare them? Because I want to use the distances to figure out either how to group things together or how to find a dividing line that separates things apart. So one of the things I have to decide is which features. I also have to decide the distance. And finally, I may want to decide how to weigh relative importance of different dimensions in the feature vector. Some may be more valuable than others in making that decision. And I want to show you an example of that. So let’s go back to my animals. I started off with a feature vector that actually had five dimensions to it. It was egg laying, cold blooded, a half scales. I forget what the other one was a number of legs. So one of the ways I could think about this is saying, I’ve got four binary features and one integer feature associated with each animal. And one way to learn to separate out rep tiles from non-reptiles is to measure the distance between pairs of examples and use that distance to decide what’s near each other and what’s not. And I’ve said before, it’ll either be used to cluster things or to find a classifier surface that separates them. So here’s a simple way to do it. For each of these examples, I’m going to just let true be one, false be zero. So the first four are either zeros or ones. And the last one’s the number of legs. And now I could say, all right, how do I measure distances between animals or anything else, but these kinds of feature vectors. Here we’re going to use something called them in koski metric or the minkoski difference. Given two vectors and a power p, we basically take the absolute value of the difference between each of the components of the vector. Raise it to the pth power, take the sum and take the pth root of that. So let’s do the two obvious examples of p is equal to one. I just measure the absolute distance between each component, add them up, and that’s my distance. It’s called the Manhattan metric. The one you’ve seen more, the only saw last time if p is equal to two, this is Euclidean distance, right? It’s the sum of the squares of the differences of the components, take the square root, take the square root because it makes it have certain properties of a distance. That’s the Euclidean distance. So now, if I want to measure difference between these two, here’s the question. Is this circle closer to the star or closer to the cross? Unfortunately, I put the answer up here. But it differs depending on the metric I use. Euclidean distance, well that’s square root of two times two, so it’s about 2.8, and that’s three. So in terms of just standard distance in the plane, we would say that these two are closer than those two are. Manhattan distance, why is it called that because you can only walk along the avenues and the streets. Manhattan distance would basically say this is one, two, three, four units away, this is one, two, three units away. And under Manhattan distance, this is closer, this pairing is closer than that pairing is. Now, you’re used to thinking Euclidean, we’re going to use that, but this is going to be important when we think about how are we comparing distances between these different pieces. So typically, we’ll use Euclidean, we’re going to see Manhattan actually has some value. So if I go back to my three examples, boy, that’s a gross slide, isn’t it? But there we go. Raddlesnake, boa constrictor, and dart frog. There’s the representation. I can ask, what’s the distance between them? In the handout for today, we’re giving you a little piece of code that would do that. And if I actually run through it, I get actually a nice little result. Here are the distances between those vectors using Euclidean metric. I’m going to come back to that, but you can see the two snakes nicely are reasonably close to each other. Whereas the dart frog is a fair distance away from that. Nice, right? That’s a nice separation that says there’s a difference between these two. Now I throw in the alligator. It sounds like a Dungeons and Dragons game. I throw in the alligator, and I want to do the same comparison. And I don’t get nearly as nice a result. Because now it says, as before, the two snakes are close to each other. But it says that the dart frog and the alligator are much closer under this measurement than either of them is to the other. And to remind you, right, the alligator and the two snakes, I would like to be close to one another and a distance away from the frog, because I’m trying to classify reptiles versus not. So what happened here? Well, this is a place where the feature engineering is going to be important, because in fact the alligator differs from the frog in three features. But, and only in two features from, say, the boa constrictor. But one of those features is the number of legs, and there, while on the binary axes, the differences between a zero and one, here it can be between, can be between zero and four. So that is weighing the distance a lot more than we would like. The legs dimension is too large, if you like. How would I fix this? This is actually, I would argue, a natural place to use Manhattan distance. Why should I think that the difference in the number of legs, or the number of legs difference, is more important than whether it has scales or not? Why should I think that measuring that distance, Euclidean-wise, makes sense? They are really completely different measurements. And in fact, I’m not going to do it, but if I ran Manhattan metric on this, it would get the alligator much closer to the snakes, exactly because it differs only in two features, not three. The other way I could fix it would be to say, I’m letting too much weight be associated with the difference in the number of legs. So let’s just make it a binary feature. Either it doesn’t have legs, or it does have legs. Run the same classification. And now you see the snakes and the alligator are all close to each other. Whereas the dart frog, not as far away as before, but there’s a pretty natural separation, especially using that number between them. What’s my point? Choice of features matters. Throwing too many features in may, in fact, give us some overfitting. And in particular, deciding the weights that I want on those features has a real impact. And you as a designer or a programmer have a lot of influence on how you think about using those. So feature engineering really matters. How you pick the features, what you use is going to be important. Okay. The last piece of this then is we’re going to look at some examples where we give you data. Got features associated with them. We’re going to in some cases have them labeled in other cases, not and we know how now to think about how do we measure distances between them. John. You probably didn’t intend to say weights of features within the scale. Sorry, the scales that not though. Thank you, John. I did. I take that back. I did not mean to say weights of features. I meant to say the scale of the dimension is going to be important. Thank you for the amplification and correction. You’re absolutely right. And we’re going to see next time why we’re going to use weights a different way. So rephrase it, block that out of your mind. We’re going to talk about scales and the scale on the axes as being important here. And we already said we’re going to look at two different kinds of learning labeled and unlabeled clustering and classifying. And I want to just finish up by showing you two examples of that. How we would think about them algorithmically and we’ll look at them in more detail next time. As we look at it, I want to remind you the things that are going to be important to you. How do I measure distance between examples? What’s the right way to design that? What is the right set of features to use in that vector? And then what constraints do I want to put on the model? How do I decide how many clusters I want to have? Because I can give you a really easy way to do clustering. I’ve given you 100 examples. I say build 100 clusters. Every example is this own cluster. Distance is really good. It’s really close to itself. But it does a lousy job of labeling things on it. So I have to think about how do I decide how many clusters? What’s the complexity of that separating service? How do I basically avoid the overfitting problem? So I don’t want to have. So just remind you we’ve already seen a little version of this. The clustering method. This is a standard way to do it. Simply repeating what we had on an earlier slide. If I want to cluster it into groups, I start by saying how many clusters am I looking for? Pick an example. I take as my early representation. For every other example in the training data, put it to the closest cluster. Once I’ve got those, find the median, repeat the process. And that led to that separation. Now, once I’ve got it, I like to validate it. And in fact, I should have said this better. Those two clusters came without looking at the two black dots. Once I put the black dots in, I’d like to validate how well does this really work? And that example there is really not very encouraging. It’s too close. So that’s a natural place to say, okay, what if I did this with three clusters? That’s what I get. I like that. That has a really nice cluster up here. The fact that the algorithm didn’t know the labeling is relevant. There’s a nice grouping of five. There’s a nice grouping of four. And there’s a nice grouping of three in between. And in fact, if I looked at the distance, the average distance between examples in each of these clusters, it is much tighter than in that example. And so that, whoops, sorry, that leads to then the question of should I look for four clusters? Question, please. The question is, is the overlap between the two clusters a problem? No. I just drew it here so I can let you see where those pieces are. But in fact, if you like, the boundary is that the center is there. Those three points are all closer to that center than they are to that center. So the fact that they overlap, the good question is just the way I happen to draw them. I should really draw these not as circles, but as some little bit more convoluted surface. Having done three, I could say, should I look for four? Well, those points down there, as I’ve already said, are an example where it’s going to be hard to separate them out. I don’t want to overfit because the only way to separate those out is going to be to come up with a really convoluted cluster, which I don’t like. Let me finish with showing you one other example from the other direction. She’s supposed, I give you labeled examples. So again, the goal is I’ve got features associated with each example. I’m going to have multiple dimensions on it, but I also know the label associated with them. And I want to learn what is the best way to come up with a rule that will let me take new examples and assign them to the right group. Number of ways to do this, you can simply say, I’m looking for the simplest surface that will separate those examples. In my football case that we’re in the plane, what’s the best line that separates them? Which turns out to be easy. I might look for a more complicated surface, and we’re going to see an example in a second where maybe it’s a sequence of line segments that separates them out because there’s not just one line that’s going to be separated. And not just one line that does this separation. As before, I want to be careful. If I make it too complicated, I may get a really good separator, but I overfit to the data. And you’re going to see next time, I’m going to highlight it here. There’s a third way, which will lead to almost the same kind of result called K nearest neighbors. And the idea here is I’ve got a set of labeled data. And what I’m going to do is for every new example, say find the case, say the five closest labeled examples. And take a vote. If three out of five or four out of five or five out of five of those labels are the same, I’m going to say it’s part of that group. And if I have less than that, I’m going to leave it as unclassified. And that’s a nice way of actually thinking about how to learn them. And let me just finish by showing you an example. And I won’t use football players on this one. I’ll use a different example. I’m going to give you some voting data. I think this is actually simulated data, but these are a set of voters in the United States with their preference. They tend to vote Republican. They tend to vote Democrat. And the two categories are their age. And how far away they live from Boston. Whether those are relevant or not, I don’t know, but there are just two things I’m going to use to classify them. And I’d like to say, how would I fit a curve to separate those two classes? I’m going to keep half the data to test. I’m going to use half the data to train. So if this is my training data, I could say, what’s the best line that separates these? I don’t know about best, but here are two examples. This solid line has the property that all the Democrats are on one side. Everything on the other side is a Republican, but there are some Republicans on this side of the line. I can’t find a line that completely separates these as I did with the football players. But there’s a decent line to separate them. Here’s another candidate. That dash line has the property that on the right side, you’ve got, boy, I don’t think this is deliberate, John. But on the right side, you’ve got almost all Republicans. Seems perfectly appropriate. One Democrat, but there’s a pretty good separation there. And on the left side, you’ve got to mix of things, but most of the Democrats are on the left side of that line. The fact that left and right correlates with distance from Boston is completely irrelevant here, but it has a nice punch to it. But not accidental. Thank you. Now the question is, how would I evaluate these? How do I decide which one’s better? I’m simply going to show you very quickly some examples. First one is to look at what’s called the confusion matrix. What does that mean? It says, for one of these classifiers, for example, the solid line, here are the predictions based on the solid line of whether they would be more likely to be Democrat or Republican. And here’s the actual label. Same thing for the dash line. And that diagonal is important because those are the correctly labeled results. It correctly, in the dash line case, solid line case gets all of the correct labeling of the Democrats. It gets half of the Republicans, right? But it has some where it’s actually a Republican, but it labels it as a Democrat. That we’d like to be really large. And in fact, it leads to a natural measure called the accuracy, which is, just to go back to that, we say that these are true positives, meaning I labeled it as being an instance, and it really is. These are true negatives. I labeled it as not being an instance, and it really isn’t. And then these are the false positives. I labeled it as being an instance, and these are the false negatives. I labeled it as not being an instance, and it is. And an easy way to measure it is to look at the correct labels over all of the labels. The true positives and the true negatives. The ones I got right. And in that case, both models come up with a value of.7. So which one’s better? Well, I should validate that, and I’m going to do that in a second by looking at other data. We could also ask, could we find something with less training error? This is only getting 70% right. Not great. Well, here’s a more complicated model. And this is where you start getting worried about overfitting. Now what I’ve done is I’ve come up with a sequence of lines that separate them. So everything above this line, I’m going to say, is a Republican. Everything below this line, I’m going to say, is a Democrat. So I’m avoiding that one. I’m avoiding that one. I’m still capturing many of the same things. And in this case, I get 12 true positives, 13 true negatives, and only 5 false positives. And that’s kind of nice. You can see the five. It’s those five red ones down there. It’s accuracy is.833. And now if I apply that to the test data, I get an OK result. It has an accuracy of a.6. I could use this idea to try and generalize this. I say, could I come up with a better model? And you’re going to see that next time. There could be other ways in which I measure this. And I want to use this as the last example. Another good measure we use is called PPPV. So I’m going to use positive predictive value, which is how many true positives do I come up with out of all the things I labeled positively. And in the solid model, in the dash line, I get values about.57, the complex model on the training data is better. And finally, two other examples are called sensitivity and specificity. Sensitivity basically tells you what percentage did I correctly find? And specificity said what percentage did I correctly reject? And I show you this because this is where the tradeoff comes in. I could make sense of sensitivity is how many did I correctly label out of those that I both correctly labeled and incorrectly labeled as being negative. How many of them did I correctly label as being the kind that I want? I can make sensitivity one. Label everything is the thing I’m looking for. Great. Everything’s correct. But the specificity will be zero because I’ll have a bunch of things incorrectly labeled. I could make the specificity one. Reject everything. Say nothing is an instance. True negatives goes to zero, but the sensitivity. Sorry, true negatives goes to one and I’m in the great place there, but my sensitivity goes to zero. I’ve got a tradeoff. As I think about the machine learning algorithm I’m using, and my choice of that classifier, I’m going to see a tradeoff where I can increase specificity at the cost of sensitivity or vice versa. And you’ll see a nice technique called ROC or receiver operator curve that gives you a sense of how you want to deal with that. And with that, we’ll see you next time. We’ll take your question offline if you don’t mind because I’ve run over time. But we’ll see you next time where Professor Goodtag will show you examples of this.

AI video(s) you might be interested in …