#12 Machine Learning Specialization [Course 1, Week 1, Lesson 3]
We’ve seen the mathematical definition of the cost function. Now let’s build some intuition about what the cost function is really doing. In this video we’ll walk through one example to see how the cost function could be used to find the best parameters for your model. I know this video is a little bit longer than the others, but bear with me, I think it could be worth it. To recap, here’s what we’ve seen about the cost function so far. You want to fit a straight line to the training data, so you have this model, FWB of X is W times X plus B. And here the models parameters are W and B. Now depending on the values chosen for these parameters, you get different straight lines like this. And you want to find values for W and B so that the straight line fits the training data well. To measure how well a choice of W and B fits the training data, you have a cost function J. And what the cost function J does is it measures the difference between the models predictions and the actual true values for Y. What you see later is that linear regression would try to find values for W and B that make J of WB as small as possible. In math, we write it like this. We want to minimize J as a function of W and B. So now in order for us to better visualize the cost function J, let’s work of a simplified version of the linear regression model. We’re going to use the model FW of X is W times X. You can think of this as taking the original model on the left and getting rid of the parameter B or setting the parameter B equal to zero. So it just goes away from the equation. So F is now just W times X. So you now have just one parameter W and your cost function J looks similar to what it was before, taking the difference and squaring it. Except that now F is equal to W times X i and J is now a function of just W. And the go becomes a little bit different as well because you have just one parameter W, not W and B. So with this simplified model, the go is to find a value for W that minimizes J of W. To see this visually, what this means is that if B is set to zero, then F defines a line that looks like this. And you see that the line passes through the origin here because when X is zero, well, F of X is zero too. Now using this simplified model, let’s see how the cost function changes as you choose different values for the parameter W. In particular, let’s look at graphs of the model F of X and the cost function J. I’m going to plot these side by side and you’ll be able to see how the two are related. First notice that for F subscript W, when the parameter W is fixed, that is, is always a constant value, then FW is only a function of X. Which means that the estimated value of Y depends on the value of the input X. In contrast, looking to the right, the cost function J is a function of W, where W controls the slope of the line defined by FW. So the cost defined by J depends on a parameter, in this case, the parameter W. So let’s go ahead and plot these functions, FW of X and J of W side by side, so you can see how they are related. We’ll start to the model, that is, the function FW of X on the left. Here the input feature X is on the horizontal axis and the output value Y is on the vertical axis. Here’s a plot of three points representing the trading set at positions 1, 1, 2, 2 and 3. Let’s pick a value for W, say W is 1. So for this choice of W, the function FW looks like this straight line with a slope of 1. Now what you can do next is calculate the cost J when W equals 1. So you may recall that the cost function is defined as follows. It’s the squared error cost function. So if you substitute FW of XI with W times XI, the cost function looks like this, where this expression is now W times XI minus YI. So for this value of W, it turns out that the error term inside the cost function, this W times XI minus YI is equal to 0 for each of the three data points. Because for this data set, when X is 1, then Y is 1. When W is also 1, then F of X equals 1. So F of X equals Y for this first training example and the difference is 0. Plugging this into the cost function J, you get 0 squared. Similarly, when X is 2, then Y is 2 and F of X is also 2. So again, F of X equals Y for the second training example. In the cost function, the squared error for the second example is also 0 squared. Finally, when X is 3, then Y is 3 and F of 3 is also 3. In the cost function, the third squared error term is also 0 squared. So for all three examples in this training set, F of XI equals YI for each training example for Y. So F of XI minus YI is 0. So for this particular data set, when W is 1, then the cost J is equal to 0. Now what you can do on the right is plot the cost function J. And notice that because the cost function is a function of the parameter W, the horizontal axis is now labeled W and not X. And the vertical axis is now J and not Y. So you have J of 1 equals to 0. In other words, when W equals 1, J of W is 0. So let me go ahead and plot that. Now let’s look at how F and J change for different values of W. W can take on a range of values, right? So W can take on negative values, W can be 0, and it can take on positive value zoom. So what if W is equal to 0.5 instead of 1? What would these graphs look like then? Let’s go ahead and plot that. So let’s set W to be equal to 0.5, and in this case, the function F of X now looks like this is a line with a slope equal to 0.5. And let’s also compute the cost J when W is 0.5. Recall that the cost function is measuring the squared error or difference between the estimated value that is Y hat i, which is F of X i, and the true value that is Y i for each example i. So visually you can see that the error or difference is equal to the height of this vertical line here when X is equal to 1. Because this little line is the gap between the actual value of Y and the value that the function F predicted, which is a bit further down here. So for this first example, when X is 1, F of X is 0.5, so the squared error on the first example is 0.5 minus 1 squared. Remember the cost function will sum over all the training examples and training sets. So let’s go on to the second training example. When X is 2, the model is predicting F of X is 1. And the actual value of Y is 2. So the error for the second example is equal to the height of this little line segment here. And the squared error is the square of the length of this line segment. So you get 1 minus 2 squared. Let’s do the third example. During this process, the error here, also shown by this line segment, is 1.5 minus 3 squared. Next we sum up all of these terms, which turns out to be equal to 3.5. Then we multiply this term by 1 over 2m, where m is the number of training examples. Because there are three training examples, m equals 3. So this is equal to 1 over 2 times 3, where this m here is 3. If we work on the math, this turns out to be 3.5 divided by 6. So the cost j is about 0.58. Let’s go ahead and plot that over there on the right. Now, let’s try one more value for w. How about if w equals 0? What do the graphs for f and j look like when w is equal to 0? It turns out that if w is equal to 0, then f of x is just this horizontal line that is exactly on the x axis. And so the error of each example is a line that goes from each point down to the horizontal line that represents f of x equals 0. So the cost j, when w equals 0, is 1 over 2m times the quantity 1 squared plus 2 squared plus 3 squared. And that’s equal to 1 over 6 times 14, which is about 2.33. So let’s plot this point where w0 and j of 0 is 2.33 over here. And you can keep doing this for other values of w. Since w can be any number, it can also be a negative value. So if w is negative 0.5, then the line f is a downward sloping line like this. It turns out that when w is negative 0.5, then you end up with an even higher cost around 5.25, which is this point up here. And you can continue computing the cost function for different values of w and so on and plot these, right? So it turns out that by computing a range of values, you can slowly trace out what the cost function j looks like. And that’s what j is. To recap, each value of parameter w corresponds to a different straight line fit f of x on the graph to the left. And for the given training set, that choice for a value of w corresponds to a single point, a single point on the graph on the right. Because for each value of w, you can calculate the cost j of w. For example, when w equals 1, this corresponds to this straight line fit through the data. And it also corresponds to this point on the graph of j, where w equals 1 and the cost j of 1 equals 0. Whereas when w equals 0.5, this gives you this line, which has a smaller slope. And this line in combination with the training set corresponds to this point on the cost function graph at w equals 0.5. So for each value of w, you wind up with a different line and this corresponding cost j of w. And you can use these points to trace out this plot on the right. Given this, how can you choose the value of w that results in the function f fitting the data well? Well, as you can imagine, choosing a value of w that causes j of w to be as small as possible seems like a good bet. j is the cost function that measures how big the squared errors are. So choosing w that minimizes these squared errors makes them as small as possible, we give us a good model. In this example, if you were to choose the value of w that results in the smallest possible value of j of w, you then end up picking w equals 1. And as you can see, that’s actually a pretty good choice. This results in the line that fits the training data very well. So that’s how in linear regression, you use the cost function to find the value of w that minimizes j. Or in the more general case, when we had parameters w and b rather than just w, you find the values of w and b that minimize j. So to summarize, you saw plots of both f and j and work through how the two are related. As you vary w or vary w and b, you end up with different straight lines. And when that straight line passes close to the data, the cost j is small. So the goal of linear regression is to find the parameters w or w and b. That results in the smallest possible value for the cost function j. Now in this video, we worked through our example with a simplified problem using only w. In the next video, let’s visualize what the cost function looks like for the full version of linear regression using both w and b. And you see some cool 3D plots. Let’s go to the next video.