MaDL – Kullback-Leibler Divergence
We’ve seen how we can measure uncertainty of a distribution or the expected amount of information in a distribution. Let’s now look at how we can measure distances between distributions. In particular, we’re going to look at the so-called Kulbach-Lyplei divergence. Suppose we have two probability distributions, P of X and Q of X over the same random variable X. We can use the Kulbach-Lyplei divergence, which is also in short-called KL divergence as a measure of distance. And I put the quotes here because it’s not a real distance metric. As we’ll see. We write this mathematically as follows, D as a symbol for divergence, KL for Kulbach-Lyplei of distribution P and distribution Q, where one is in front and one is after these vertical bars. This is defined as the expectation of the logarithm of P of X over Q of X. It’s very similar to how the Shannon entropy was defined, except that now we have two distributions, in particular the ratio of two distributions. And by the laws of the logarithm, we can also write this as the expectation of the logarithm of P of X minus the logarithm of Q of X. In the case of discrete variables, this measures the extra amount of information required to send a message containing symbols drawn from P of X, when we use a code that was actually designed to minimize the length of messages drawn from Q of X. Think about the previous unit, how we looked at optimal messages. If we derive the messages to be optimal for one distribution, but actually send a message containing symbols drawn from another distribution. This is how you can think of in the discrete case what the Kulbach-Lyplei divergence measures. And this is actually also where we’ll use it in lecture 9 of the deep learning lecture series in the context of evaluating language models. A property of the K-L divergence is that it’s non-negative and it’s 0 only if P of X equals Q of X. However, the K-L divergence is not a true distance measure as we’ll see, which means it’s not symmetric. In other words, D-K-L-P-Q is not the same or not necessarily the same, not always the same, as a general, not the same, as D-K-L-Q and P. Also, there’s more divergence than just the K-L divergence. The K-L divergence in particular is a special case of a larger class of so-called F-divergiances. But we’ll not cover them here in this short recap. Now, of course, we can use these divergence as both in the discrete and the continuous case. So in the discrete case, we again, as before, for the entropy, just write now the expectation as a summation over the state space of X, P of X, the probability of X times log P over Q. Similarly, in the continuous case, we just replace the summation with an integration. And again, by convention, we treat 0 log 0 to be equal to 0. Here’s an example for two Gaussian distributions and measuring the K-L divergence between them. What we can see here on the left is that these distributions don’t overlap. They are quite dissimilar. And in this case, the K-L divergence will be large. And contrary, here on the right, we have two distributions that overlap, in which case the K-L divergence will be small. Here’s another example. With discrete distributions on the left, we have a polynomial distribution with parameter 0.5, 0.4, sorry, n equals 2. And that gives rise to this distribution here. So we have a non-uniform distribution on the left. And then we have a distribution of the same domain that is uniform, where we have equal probability 1-3 over all of these three possible states. Now we can numerically calculate the value of the K-L divergence. So what we have to do is we have to do the summation over the probability times logarithm P over Q. And so here, this will amount to the following. So we’ll have 0.36, this value times lock 0.36. So again, P here, right? Over 0.33. And then we have 0.48 times logarithm 0.48 over 0.33. And we have 0.16. The logarithm is missing here, times logarithm of 0.16 over 0.33. And that’s roughly 0.085. But if we now swap the two, we compute the K-L divergence between Q and P. Instead of the K-L divergence between P and Q, we obtain a value of 0.097. So we can see that the K-L divergence is not symmetric. And this manifests itself also here, where we can look at this asymmetry a bit more explicitly. Again, we observe that minimizing, in this case, we’re trying to minimize the divergence P and Q or Q and P with respect to Q, does not lead to the same result. So what we’re trying to do here is now we’re given the distribution in blue. That’s fixed. That’s an input. We call that P. And it’s the same here on the left and on the right. And now we’re trying to find a parametric distribution, in this case, a Gaussian in green, that minimizes the K-L divergence between the two. So we’re solving this minimization problem here. We’re minimizing for the value, for the distribution Q, that minimizes the K-L divergence to P. And that’s what we call Q star and what we draw here in green. But on the left hand side, we are minimizing D, P, and Q. And on the right hand side, we’re minimizing D, Q, and P. And you can see how the green curves differ, despite P being the same on the left and right. So what happens intuitively is that on the left, we select a Q, or Q star, that has a high probability where P has a high probability. So in other words, this leads to a blurring of multiple modes to put high probability to all of them. We want to have a distribution where there’s high probability on the first mode and high probability on the second mode. And because we assume it’s a Gaussian, the best thing we can do is to spread probability mass, somehow uniformly everywhere on these two modes, the first one and the second one. And we don’t care so much about this fact that we’re here. In this case, we have assigned a lot of probability mass, where there’s actually none. However, if we turn this around and we minimize D, Q, and P, what happens is that we, in this case, select a Q that has a low probability where P has low probability. So now we want to find a Q that still fits this distribution, the blue one well. But we want, it’s more important for us that where P has a low probability, then Q also should have a low probability. So this is not a viable solution. And what happens in this case is that the minimizer, then depending on the initialization picks one of the modes. Instead of putting probability mass more uniformly as in the previous case. Just to make sure that there’s no probability mass in the low probability region. So this is a solution that has the same Q that the divergence as if we would put the mode on the second, if we would put this distribution over the second mode. So this illustrates a little bit of asymmetry between swapping the arguments of the Q that the divergence operator. There’s also a close relationship to the so-called cross entropy that’s pervasive in machine learning and deep learning. And that we use often as a loss function. So the cross entropy is denoted as hp, q. And this is the expected value with x now drawn from p, but of the logarithm of q of x. That’s why it’s a cross entropy. So in contrast to the entropy, now we have different symbols here, p and q. It’s not the same. Now what we can do here, so this is the red term is the red term that’s why I highlighted it here. What we can do here is we can do a simple extension by adding some term and subtracting the same term again. So nothing has happened. And because the expectation is a linear operator, we can put these together. So we have expectation of logarithm px minus logarithm qx. And that is exactly the callback lival divergence as we’ve seen before. And what remains is this term here on the left, which as we see now is the entropy of p. So the cross entropy of p or between p and q is the entropy of p plus the callback lival divergence between p and q. So there’s a direct relationship here between the cross entropy and the callback lival divergence. And what you can also see is that minimizing the cross entropy with respect to q is equivalent to minimizing the callback lival divergence with respect to q. Because the entropy of p does not depend on q. There’s no q in this term here. So it’s constant with respect to q. So minimizing this with respect to q is equivalent to minimizing callback lival divergence with respect to q. And this is also where this intuition comes from that if you minimize the cross entropy loss, then what you effectively do is you minimize the difference or the distance or the callback lival divergence between the model distribution and the empirical data distribution. So it’s quite useful. Okay.