#16 Machine Learning Specialization [Course 1, Week 1, Lesson 4]
Let’s take a look at how you can actually implement the gradient descent algorithm. Let me write down the gradient descent algorithm. Here this, on each step, w, the parameter, is updated to the old value of w, minus alpha times this term d over dw of the constant j of wb. So what this expression is saying is update your parameter w by taking the current value of w and adjusting it a small amount, which is this expression on the right, minus alpha times this term over here. So if you feel like there’s a lot going on in this equation, it’s okay, don’t worry about it, we’ll unpack it together. First, this equal notation here. Notice I said we’re assigning w of value using this equal sign. So in this context, this equal sign is the assignment operator. Specifically in this context, if you write code that says a equal c, it means tick the value of c and store it in your computer in the variable a. Or if you write a equals a plus 1, it means set the value of a to be equal to a plus 1, or increment the value of a by 1. So the assignment operator in colon is different than true facertions in mathematics. Where if I write a equal c, I’m asserting that is, I am claiming that the values of a equals a plus 1, because that just can’t possibly be true. So in Python and another programming languages, true facertions are sometimes written as equals equals. So you may see code that says a equals equals c if you’re testing whether it is equal to c. And math notation as we conventionally use it, like in these videos, the equal sign can be used for either assignments or for true facertion. And so our schedule is clear when I write an equal sign, whether we’re assigning a value to a variable or whether we’re asserting the truth of the equality of two values. Now let’s dive more deeply into what the symbols in this equation means. The symbol here is the Greek alphabet alpha. And in this equation, alpha is also called the learning rate. The learning rate is usually a small positive number between 0 and 1. And it might be say 0.01. What alpha does is it basically controls how big of a step you take downhill. So if alpha is very large, then that corresponds to a very aggressive gradient descent procedure where you’re trying to take huge steps downhill. And if alpha is very small, then you’ll be taking small baby steps downhill. We’ll come back later to delve more deeply into how to choose a good learning rate alpha. And finally, this term here, that’s the derivative term of the cost function j. Let’s not worry about the details of this derivative right now, but later on, you get to see more about the derivative term. But for now, you can think of this derivative term that I drew a magenta box around as telling you in which direction you want to take your baby step. And in combination with the learning rate alpha, it also determines the size of the steps you want to take downhill. Now I do want to mention that derivatives come from calculus and even if you aren’t familiar with calculus, don’t worry about it. Even without knowing any calculus, you’ll be able to figure out all you need to know about this derivative term in this video and the mix. One more thing. Remember your model has two parameters, not just W, but also B. So you also have an assignment operations update the parameter B that looks very similar. B is assigned the old value of B minus the learning rate alpha times this slightly different derivative term D over DB of j of WB. So remember in the graph of the surface plot where you’re taking baby steps until you get to the bottom of the value? Well for the gradient descent algorithm, you’re going to repeat these two update steps until the algorithm converges. And by converges, I mean that you reach the point at a local minimum where the parameters W and B no longer change much with each additional step that you take. Now there’s one more subtle detail about how to correctly implement gradient descent. You’re going to update two parameters W and B, right? So this update takes place for both parameters W and B. One important detail is that for gradient descent, you want to simultaneously update W and B, meaning you want to update both parameters at the same time. What I mean by that is that in this expression, you’re going to update W from the old W to a new W. And you’re also updating B from this old value to a new value of B. And the way to implement this is to compute the right side, computing this thing for W and B and simultaneously at the same time update W and B to the new values. So let’s take a look at what this means. Here’s the correct way to implement gradient descent, which does a simultaneous update. This sets a variable temp W equal to that expression, which is W minus that term here. Let’s also set another variable temp B to that, which is B minus that term. So you compute both right hand size, both updates, and store them into variables temp W and temp B. Then you copy the value of temp W into W, and you also copy the value of temp B into B. Now one thing you may notice is that this value of W is from before W gets updated. Here notice that the pre-update W is what goes into the derivative term over here, okay? In contrast, here’s an incorrect implementation of gradient descent that does not do a simultaneous update. In this incorrect implementation, we compute temp W same as before, so far that’s okay. And now here’s where things start to differ. We then update W with the value in temp W before calculating the new value for the other parameter B. Next we calculate temp B as B minus that term here, and finally we update B with the value in temp B. The difference between the right hand side and left hand side implementations is that if you look over here, this W has already been updated to this new value, and it’s this updated W that actually goes into the const function j of WB. It means that this term here on the right is not the same as this term over here that you see on the left. And that also means this temp B term on the right is not quite the same as the temp B term on the left, and thus this updated value for B on the right is not the same as this updated value for variable B on the left. The way the gradient descent is implemented in code, it actually turns out to be more natural to implemented the correct way with simultaneous updates. When you hear someone talk about gradient descent, they always mean the gradient descent where you perform a simultaneous update of the parameters. If however, you were to implement non-synomogeneous update, it turns out it will probably work more or less anyway, but doing it this way isn’t really the correct way to implement it, it’s actually some other algorithm with different properties. So I would advise you to just stick to the correct simultaneous update and not use this incorrect version on the right. So that’s gradient descent. In the next video, we’re going to details of the derivative term, which you saw in this video, but that we didn’t really talk about in detail. Dervatives are a part of calculus, and again, if you’re not familiar with calculus, don’t worry about it. You all need to know calculus at all in order to complete this course or this specialization, and you have all the information you need in order to implement gradient descent. Coming up in the next video, we’ll go over derivatives together, and you come away with the intuition and knowledge you need to be able to implement and apply gradient descent yourself. I think that’ll be an exciting thing for you to know how to implement. So let’s go on to the next video to see how to do that.