# 11-785, Fall 22 Lecture 21: Variational Auto Encoders

All right, let’s wait for a few seconds and we’ll start. Okay. 50 students. All right. So morning everybody. We’re going to start our new sequence of lectures on neural networks for modeling distributions. So what we’ve seen so far is that neural networks are universal approximators. They can model Boolean functions, classification functions. They can perform regressions. They can be feature extractors, classifiers, predictors. In all of these previous cases, we considered neural networks that are effectively functions. They take some input and generates some output. They can operate on a process are given input data and they can learn to perform these tasks from data. But then, can neural networks also be used to generate data and can they learn to generate data from examples? This is the topic for our next series of lectures. So here’s the problem. We’re given a large collection of images, say, of faces. And we’d like the network to examine all of these pictures and learn how to generate a new portrait. Basically, it must learn the distribution of face images and learn to generate new samples from this distribution. And it’s unclear how we can even characterize this distribution. Or it might learn from a large collection of landscape pictures. And based on these, it might learn the distribution of landscape pictures and learn how to generate new completely novel pictures by sampling from this distribution. So this is the problem of creating neural networks that act as generative models. What we’ve seen so far is that neural networks can perform classification or regression, MLP, CNNs and RNNs. The next step is neural networks is generic generative models that can model the distribution of any data. Such that we can draw samples from this distribution. But first, before we begin talking about neural networks as generative models, we must first explain what a generative model is. And how to estimate them. And in the process, we look briefly at expectation maximization. So what is a generative model? A generative model is the model for the probability distribution of some data. For example, a multinomial distribution or a Gaussian distribution. The computational equivalent is a model that can be used to generate data with a distribution that’s similar to some given data. So we are in a typical setting. We have a box that takes in random seeds and outputs random samples, which are statistically inding is in distinguishable from the data that we are trying to model, whose distribution we are trying to model. There’s a secondary aspect over here. What is the seed itself? That’s something that we won’t actually get into. But then there’s an extension to the model that we will be looking at in the next couple of classes that deals with that issue. So what is the problem of learning a generative model for data? You’re given some collection of observed data X. And we have to model the distribution for this data. Now the actual distribution may never be known, but we’re going to model it. And so we’re going to choose some parametric form, which I’ve represented as p of X semicolon theta. So this is the semicolon theta has to indicate that this probability distribution function p of X has a set of parameters represented by theta. And that by manipulating this theta, you can change the distribution. And so now our problem is to estimate these parameters theta such that the model distribution best fits the observations that we have. And the hope is that if we can make this distribution fit the observations we have, it will somehow also represent data outside the training set that we are using to learn the distribution. So let’s consider a few examples. Consider this example of someone rolling a dice. So let’s say this person rolls the dice a large number of times. And then this and then you plot the histogram of outcomes for the dice. So here are the six sides. And each bar represents the number of times that particular outcome was observed goes without saying that the dice is loaded and it’s not fair. So now given the choice between the two distributions to the right, which of these two distributions do you think is more likely to be the true distribution for this dice? Anyone? Number two, right? Why two? Why number two? Look similar, right? Here’s another example. The left figure shows the histogram of a collection of observations. And it has a nice bell shape. So we decide to model the distribution as a Gaussian. Now the parameters of a Gaussian are the mean and the variance. So when we model the distribution as a Gaussian, we’re basically saying we’re going to try to compute the mean and the variance of the Gaussian that best represents this distribution. Now I’m showing three different distributions to the right, the green, the pink and the yellow or three out of Gaussian, which of these three Gaussian’s do you think is most likely to represent the true distribution of the of the process that generated this data? Anyone? Red one, right? And once again, why? Why do you think it’s the red one? Looks like, can we quantify looks like? Can we quantify? It’s the observation, right? Lost? Why do you think it’s not the yellow one or not the red one or not the green one? The mean and variance. Can you give me something more? What is the underlying principle there? Why do you want the mean and variance to be similar? Yeah, but why do you want them to be so? So there is an underlying principle, right? Which you are not quite able to articulate. And that is this. And assuming of course that the data are generated by draws from the distribution. So the generating process draws from the distribution, but then we have an underlying assumption. And the assumption is that the world is a very, very, very, very boring place. And so boring that the data that you observed are typical of the process that generates these data. And so in fact, we’re going to assume that the data are so typical of the process that generates these data. And so if you’re given a collection of different processes and asked to choose which of these process could have generated the data, you’re going to choose the process for which these distributions are most typical. So basically, we’re assuming that these data are very typical of the process that you’re trying to characterize. And so you are going to choose the process that has the for which these data are most typical of the process for which the distribution that has the highest probability of generating this data. It assigns lower probability to less frequent observations and vice versa. So the assumption here is that the data that you observed are very typical. The world is a very boring place. And pretty much everything that you just stated is some expansion or some, some is based on this assumption that this is typical of the distribution for the process, which is why you chose the red one or that this histogram is what you would expect that dies to roll, which is why you chose the green one. And so this assumption can be quantified in terms of what is called the maximum likelihood principle. We are not choosing the data that are most likely for the distribution. Instead, given a number of distributions, which could be infinite, you’re choosing the particular distribution that has the highest likelihood of having generated this data. This is the maximum likelihood principle. And so here’s how the maximum likelihood principle works. You’re given a collection of data like this multinomial data or this cows, you know, observations or these continuous valued observations with a bell shaped histogram. And then you’re going to try to explain these data as observations from some distribution. Now this, this distribution is parametrized by the parameters theta. And so you’re going to choose the theta such that the probability of these observations is maximized or, importantly, because the logarithm as a monotonic function, the log of the probability of this collection of observations is maximized. And so this concept that you just sort of almost articulated is the concept of maximum likelihood. You’re going to choose the distribution that has the maximum likelihood of having generated the data that you just saw. And let’s put this to work in these two examples. In the case of the histogram, you have a dice. So the dice has the parameters of the dice and the probabilities of the six faces. And so the probability of generating this particular set of observations is going to be the probability of the observation one. Race to n one times the probability of the observation two, race to n two and so on. Similarly, the probability of this collection of observations for the Gaussian is going to be the product of the probabilities of the individual observations as computed from this Gaussian. And so if you write those down, the probability of this collection of observations is given by this term here is the product over all possible outcomes one through six of the probability of the outcome raised to the number of times that outcome was obtained. Or in this case, it’s the product over all observations of the Gaussian computed at that observation. Now, we’re taking the logs over here for convenience. And this term here is what we will maximize with respect to the piece. Similarly, over here, this term here is what we’re going to maximize with respect to the mean and the variance. And when you work through the mathematics, it’s very trivial. The this the log of the product of P irase to n i is simply going to be the sum over all observations of the number of times the observation was observed times the log of the probability of that particular outcome. And so this the sum over all outcomes of the number of times it was obtained times the log of its probability. This when maximized over all the probabilities of all the outcomes for the dice is going to give you this result. It will tell you that your best guess for the probability of the outcome i is the ratio of the number of times that outcome was observed to the total number of observations. Similarly, over here, when you take the logarithm of the logarithm of the product of the Gaussian computed at several points, the log of a product of course is a sum of logs. So this is going to be the sum over all observations of the log of the Gaussian computed at the observation, which is given by this term here. And when you maximize this with respect to the mean, you’re going to find that the estimate for the mean of the Gaussian is just the average of the data and the estimate for the variance of the Gaussian is your familiar equation for the variance that you’ve seen in high school. Now all of these these estimates that you see over here, they are probably what you would have come up with intuitively without even thinking about it. But in fact, these are the outcomes of maximum likelihood estimation. So any questions? Questions so far? Is the concept of maximum likelihood estimation clear to you guys? Raise your hand if it is. Say yes or no, right? Okay. At least of you. Okay, so here’s your first part. Okay, 10 seconds guys. And someone answered the first question to our false. In second, also true, right? Maximum likelihood estimation of probability distributions is based on the theory that the world is in fact a terribly going place. And it estimates the values of the parameters of a probability distribution such that they maximize the probability of the training data. Now this is all very fine. The problem is sometimes the data provided are incomplete. They may be insufficient to write out the complete log probability. And so as a result, they are insufficient to estimate your model parameters directly. In the previous examples that we just saw, I could just write out the log probability. I could differentiate them, equate the derivatives to zero and I caught an answer, but when the data are incomplete, this is not going to be possible. And why would the data be incomplete? This could be because the data themselves have missing components like data vectors might have some missing components or because of the structure of the model, like mixture models are multi-stage additive models. So to give you an idea, let’s take a consider this one, where the data themselves have missing components. Let’s say you have a collection of vectors which you assume have been generated by a Gaussian distribution. And so your problem is that of estimating the parameters of this Gaussian distribution. Unfortunately, in each of these vectors, there are some components that got obscured, that got hidden, and so you couldn’t see them. Those are the components shown by these little black patches. As a result, your data are incomplete. You have many vectors, but the vectors are themselves incomplete. You don’t have many of the vectors are not fully seen. And it is from this collection of incomplete vectors that you must estimate the parameters of this Gaussian. Now remember, the Gaussian itself is not defined on an incomplete vector. The Gaussian is defined on the complete vector. So the original problem that we just saw earlier was to estimate the Gaussian given a collection of complete vectors. You were trying to find the mean and the variance that maximized the log probability of a collection of complete vectors, which was the sum over all the observations in the collection of the log probability of the individual vectors in the collection. Unfortunately, in this case, many of our vectors are missing components. In fact, these are the actual data that you have. For each vector, you only have an incomplete collection of components represented by O1, the little vector, little character O, which represents observed. So each, you only have a subset of observed components for each vector. And you have, you’re missing some other subset of components from each of the vectors. And so now the actual problem is that you have to estimate the entire Gaussian, but only from this collection of incomplete vectors. Now, ideally, you’d have the complete data. And the complete data would include both the observed and the missing components. So for each vector x, you would see both the OI, which is the set of observed components for the vector and MI, which is the set of missing components for the vector. So this is what you would ideally see because that’s what the Gaussian is defined on. Unfortunately, you only have incomplete data. And the principle of maximum likelihood estimation says that it is what you have observed that are typical, right? They’re the outcome of a boring word. So you must maximize the likelihood of the observed data, which means that you must find the mean invariance of the Gaussian such that this term is maximized, which is the sum over all vectors of the log of the probability of the observed components of the vector. Unfortunately, the Gaussian is not defined on just subset of components. The Gaussian in fact is defined on the entire vector. And so in order to compute the log of the probability of observed components, this must, this P of O must somehow be derived from P of x. And how can we derive this? Now, every vector x is a combination of observed of the observed as a missing components. So if you want just the probability of the observed components, then you must marginalize out the missing components, which is to say you integrate out the missing components from the distribution of the complete vector from the probability of the complete vector. And so the log probability of the entire observed training data is the sum over all the vectors of the log of this integral, which is the probability of the complete vector where the missing components have been marginalized out. And so this is what we have to maximize. So our maximum likelihood estimation now will require us to find the mean invariance such that this term here is maximized, which is the sum over all the observed vectors of the log of the integral of the complete vector of the probability of the complete vector where the missing components are integrated out. And so this requires the maximization of the log of an integral. And this doesn’t have a close form. It’s really, really difficult to do even at the best of circumstances. So this was an issue situation where the data had missing components. We can also have incomplete information. Okay, similar is this entropy maximization. Here the relationship between entropy and likelihood is well known, but we won’t get into that yet. Here it is similar. It’s like saying I know nothing about these missing components. So that would be maximizing the entropy of the missing components. But then here there’s an also an estimation aspect to it, which makes it somewhat different. Luxured on say a question, if your data dimension was D, then some components might be missing. For example, if these were a factor of some data obtained from the census, but then some people chose not to reveal their income, then this is going to be a missing component in the data. That on say a question. Right. So here this was a situation where the data incomplete. You can also have incompleteness because of the structure of the models. Here’s another example. In this example, we have a Gaussian mixture where the genetic, that we have a district data was distribution of fairly complicated. We are modeling this as a mixture of Gaussian, but the generative process for this mixture Gaussian model says that the data could have come from one of a number of different Gaussian’s. In order to generate any given observation, the process first selects one of these Gaussian’s according to a distribution P of K. Then from the selected Gaussian, it generates an observation. All you get to see is the final observation. You don’t get to see which Gaussian was chosen. From this collection of observations that you have, where each observation could have come from a different Gaussian and you don’t know which Gaussian it came from, you must somehow estimate the parameters of all of the individual Gausses. So which is to say you need to learn the means and the variance and the variances of the individual Gausses. Now over here, the data that you actually needed to precisely learn the model was this. Along with each data vector, you would need ideally also know which Gaussian it was drawn from. So for example, if your model said that your data came from either a mixture of a green, a red or a blue Gaussian, then along with each vector, you would also be given the identity of which Gaussian was selected when that vector was drawn. So if you had this data, where for each vector, the specific Gaussian that was chosen to generate that observation was also known, then the answer is very simple. I can just pull up all of the blue vectors, all the vectors that came from the blue Gaussian and separate them out. I can separate out the vectors that came from the red Gaussian and then I can separate out the vectors that came from the green Gaussian, that from this collection of blue vectors. I can compute the mean and variance of the blue Gaussian. From the collection of the red vectors, I can compute the mean and variance of the red Gaussian and from the collection of green vectors I can compute the mean and Laden So the green Gaussian. So this is easy. If I knew which Gaussian each vector came from. Unfortunately, when you get the data, you are not actually informed which Gaussian generated each vector. So from what we see over here, ideally for each observation, we also want another Gaussian. So what we want is the union of the observation and the Gaussian it came from. What we have is just the collection of observations. And so once again, using the maximum likelihood principle, you would be maximizing the log likelihood of just the observed data because that is what and goes into the ML principle, right? But in order to compute, which is to say, you’d be estimating the means and variances of all of the Gaussian’s to maximize this term over here. The sum over all observations of the log of the probability of the observation. But then when you don’t know which Gaussian any individual vector came from, then you have to marginalize out the Gaussian. So any vector in drawing a vector, the process goes through a two step process. The generator goes through a two step process. First, it chooses a Gaussian and then it draws an observation. So you have the distribution, the actual probability is the joint probability of the Gaussian and the observation. You don’t know which Gaussian was chosen to generate a specific observation. So you have to sum out, marginalize out the Gaussian. And so the probability of any observation is the sum over all Gaussian’s of the probability of having chosen the Gaussian times the probability that the observation was obtained from that particular Gaussian. And so when we perform maximum likelihood estimation, we are going to be trying to estimate the means and variances of all of the Gaussian’s such that this term is maximized. The sum over all observations of the log of this term over here, this term inside is the probability of the observed vector. But then that itself is computed by summing over all possible Gaussian’s. So this includes the log of a sum. And when you try to maximize the log of a sum, this becomes, this is very challenging and it can become quite impossible to solve in close to five. So the general reform of the problem over here is that you have some data or variables that are missing. These are terms that had you had then the estimation would have been trivial. But these are missing. They are hidden from you for some reason. And so this results in a maximum likelihood estimate of the form where you are maximizing the log of the marginal probability of the actual observed data, where you are marginalizing out the missing variables or components by summing them out of the probability. And this inner summation could also be an integral of depending on whether this hidden variable is continuous or discrete. And now this maximization is the maximization of the sum of the log of the maximization of the log of a sum or an integral. And this makes estimation challenging. There’s no close form solution. So we need efficient iterative algorithms. Now the real problem here is that we are trying to maximize the log of an integral or the log of a sum. And this is challenging. So the question is can, sure, we can’t maximize this directly, but can we somehow get an approximation to this that is more tractable, that is without a summation or an integral within the log. So that is what we’re going to try to do. So any questions? Questions so far? Any? No? Okay. So if you got what we’ve spoken so far, discuss so far, reasons. Just so I know I’m making sense, at least to some small fraction of it. Okay. All right. So now here’s what we’re going to do. We’re going to try to simplify the problem. Okay. Here’s what we have. We have the log of the marginal probability of the observations, which is the log of the sum of integral over the missing or head and variables age of the joint probability of H and O. But then I can multiply and divide by a second term, Q of H, which only depends on the hidden variable age. And when I do this, this term over here is identical to the term on the left, which in turn and is therefore identical to log of uniform. Is this making sense to you guys? This one raise your hands. This is trivial. I want to see what 50 hands raised. What is one we have? 100 people in class. I’ll wait. This is the easiest one. Okay. 30. I’ll take 30. 25% of the class gets it. Okay. So now here, this is true regardless of Q of H. But we’re going to choose Q of H as that Q of H is actually a probability distribution over H, meaning it’s always non negative and it sums to 1. Right. So now by the concavity of the logarithm function, when you have log of the sum over H of Q of H times something, we have this, do I have the equation for some reason that picture is missing? Let me see if I can get the picture here. One second. This is okay. Yeah. So now the log function is concave. What do we mean by saying the log is concave? Okay. Here is the log of X, this blue curve. Right. Now consider two values X1 and X2. Any point between X1 and X2 can be written as Q times X1 plus 1 minus Q times X2, but Q is a number between 0 and 1. When Q is 0, this is equal to X1. And when Q is 1, it’s equal to X1, when Q is 0, it’s equal to X2. So any point on this is going to be Q times X1 plus 1 minus Q times X2. Now if I draw a line between log of X1 and log of X2, the point over here where this vertical line intersects this line is going to be Q of log X1 plus 1 minus Q of log X2, because it’s on this line. On the other hand, the point on the curve is the log function computed at this X, which is log of Q X1 plus 1 minus Q X2. And what you observe from here is that the log function is always above the lines. The log of Q X1 plus 1 minus Q X2 is greater than or equal to Q times log of X1 plus 1 minus Q times log of X2. So more generally, for any set of XIs, you’re going to have where if you have Q such that the Qs are non-negative and sum to 1, you’re going to find that log of summation iQI XI is going to upper bound summation iQI log XI. So is this concavity making sense to you guys? Raise your hands if this concavity is making sense. Right? So, all right. So now, because of that, just directly applying this formula here, I have log of the summation of QHP, pH over QH is going to upper bound. And this we know is this log of PFO, correct? This is going to upper bound summation over HQH log of P over Q. So, basically, as we saw over here, log of summation QX is greater than or equal to summation Q log X, where Q’s are a distribution. So log of summation Q this P over Q is greater than or equal to Q times log of P over Q. So, in other words, this term over here to the right, which is the summation over H of Q of H times log P over Q, this is a lower bound on log of P over. It’s also called a variational lower bound or also sometimes called the evidence lower bound for various reasons. So, what is the point of having a lower bound? So, here’s what we have. Even if I have a parametric model, the log of the probability of an observation is going to be lower bounded by this summation over HQH log of P over Q, right? Now, what is what is having a lower bound mean? If I have some function, observe that both the term to the left and the right are functions of theta, right? So, let’s say this red curve is my, is the term to the left, this function of theta. This blue curve is the function to the right, which is this term over here, which is also a function of theta. Now, this inequality is going to hold regardless of theta, right? And so, what this means is that the lower bound is always the lower bound curve, which is shown by the blue curve, is always going to be below the original log likelihood curve, which is shown by the red curve. And we like to have tight lower bounds. A tight lower bound is something of a discount, which hugs the upper red curve as closely as possible. When you have a tight lower bound, then the maximum of the lower bound will generally can be expected to be close near the maximum of the original function itself. So, what we want to do is to make a tight lower bound. And so, this gives us a two step estimation process. We know, for by the concavity of the log function, that log of P of 4 upper bounds this term, but we haven’t specified what Q of H is. Q of H can be anything, right? So, in the first step, we’re going to find the Q of H that maximizes the right hand side. You know that the largest value it can take is log of P of 4, right? So, I can just blindly maximize this using any maximization process and try to estimate Q of H, which is a distribution that maximizes this guy. And that’s going to give me a tight lower bound on log P of. So, once I’ve found those QHS, I’m going to fix the QH and then maximize the right hand side with respect to theta. And that gives me my next estimate for theta. So, here is the, so what maximizes, how do I maximize Q of H? For any P of H, P of H, P of H, O. So, for any function of this kind, any probability distribution P H O semicolon theta, it’s easy to show that the optimal Q of H is P of H given O semicolon theta. How is that? When I’m saying semicolon theta to say, I’m sure that it’s P of H given O as computed using this parametric function. How is that? You can see that Q of H log P over Q using this term is going to be P of H given O times log of P of H comma O divided by P of H given O. But then by Bay’s rule, P of H, P of H comma O is going to be P of H given O times P of O. So, log of P of H comma O divided by P of H given O is simply log of P of O. So, this term over here is simply P of O. And so, this entire term is going to be summation over H, P of H given O times log of P of O. And so, this log of P of O is not a function of H. So, this can come out. And so, it’s log of P of O times the summation over H, P of H given O. This is the sum of a probability distribution. It becomes 1. And so, this is just log of P of O. And so, clearly, if I choose Q of H to be exactly equal to P of H given O, then this log of Bound actually exactly becomes P of O. There is no other Q of H which is going to make this log of Bound greater than P of O because this is really a log of Bound. So, in other words, if I choose Q of H equals P of H given O, then this term takes the maximum value that it can get. And so, in the first step, in an iterative process, when I determine a Q of H that maximizes this right hand side, I’m going to have some initial estimate for my theta. And then using that estimate for theta, I’m going to use a computer Q of H that maximizes this guy. And the best case value is the conditional probability of H given O using the current estimate of theta. And then I go to the second step where I fix this Q of H and maximize this right hand side with respect to theta. And so, here, I’m going to say theta. Now, log of P over Q is simply log of P minus log of Q, but this is not a function of theta. So, I can zap it down to this. And I can just say theta is going to be updated to the theta that maximizes this guy, which is the sum over all Hs of Q of H times log of P of H comma O. So, the joint probability of H and O. And so, this whole thing is also called expectation maximization algorithm. This is training by maximizing a variational log of bound where you first find a Q of H that maximizes this lower bound. And then you find the theta that maximizes. So, you keep that maximizing a lower bound. In the first step, you fix theta, fix this guy and find the Q of H, which maximizes this term. And then in the next step, you fix Q of H and find the theta that maximizes this term to the right. You do this two step process. Let’s try it. Let’s try it. And that gives you that this is a training by maximizing a variational log of bound where you keep tightening a lower bound and then maximizing it. And eventually, it’s going to give you it’s always going to be because you’re always decreasing the gap between log P of O and this guy. You’re always going to be finding a theta that max increases this increases the log likelihood of P of O and eventually you’ll find a local maxim. And now, this is the standard formulation where you find the Q of H that maximizes this right hand side and then fix the Q of H and maximize this term over here. That’s training by maximizing a variational log of bound. And in the specific case where you find that for Q of H, you choose your H given O, you end up with what is called the expectation maximization algorithm. So you start up with an initial estimate for your parameters. And then you’re going to iterate over, you’re going to iteratively update your parameters in the first step. You’re going to fix your parameters and estimate the Q which is going to be P of H given over the current estimates for the parameters. Then you fix the Q and then you update the parameters by maximizing this term. And that’s going to give you your new estimate for the parameters. So this whole thing is your this general process is training by maximizing the parameters by maximizing a variational log of bound. And in the specific case where you choose P of H given O as your Q of H, this gives you the expectation maximization algorithm. If none of this made sense, don’t worry, we will spend some more time on it. But first, answer this. Okay, 10 seconds, guys. Okay. So why did we originally introduce Q of H again? So here is a thing that the point was this term is not tractable. Log of P of H comma always not tractable, you cannot maximize it directly. And so you come up with an iterative algorithm where you first choose a Q of H, it turns out that in these problems, log of P of H comma O is maximized. And we know that log of P of H comma O is maximized. I go over here, for instance, if I gave you H comma O, in this case, it’s the same as getting both the identity of the Gaussian and the vector. Here, it was trivial for me to learn the model, right? I could just segregate out the vectors and learn the parameters. For here, in this case, if I gave you the missing components, it was trivial for me to estimate the model. So when the H is given, it’s easy to maximize log of P of H comma O. The problem is H is not given. And so we had the log of the sum of over H of P of H given also having the summation inside the log was a killer. What this process did was move the summation outside the log, right? What was the summation here? He mort over here. And now this is something that’s maximizable. But now this is only a lower path. And so you end up with an iterative algorithm. Did that answer your question? Did that answer your question? Are you clear? Oh, yeah, professor, it was question in the chart. Yeah. So why is it a killer to have a log of a sum? I suggest you try, right? What will happen? You have a log of a sum, the first thing, remember, if you look at this guy, I have log of P of H comma O semicolon theta. This really has semicolon theta over here, right? When I take a derivative, I’m going to get one over summation H P of H comma O semicolon theta times some additional times, right? So now we have a ratio that you’re not that where with a summation and the denominator that you would be equating to zero. That’s not, that’s not easily solvable. In fact, you don’t have a solution. So getting back over here. Yeah, this is our previous question. Sorry. So who was the question? I think I was quite talking to you, but no, there’s someone called ER. I don’t know, ERS. Do we have answer to the answers to the first question? Anyone? 1 and 2. And second one. Also 1 and 2, yes. It’s an iterative algorithm that can be used to estimate the probability distributions when the data incomplete. It iteratively maximizes an elbow function. And the elbow is a lower bound on the actual log probability as computed by the model, and it’s a function of the model parameters, right? So now there’s so much math. I’m sure you know, you have no idea what that matters to you. If you do, if you do, then you’re smarter than my hand, we should look at the intuition. What is it really doing in how is it solving the problem? Let’s consider the first case where the data themselves have missing components. Here’s what we have. These are the actual data. These are the incomplete vectors. We are missing some components. In each vector, these black components are missing. The complete data would include both the missing and the observed components, right? Consider a single vector. I cannot use this vector to compute the Gaussian because the vector is incomplete. So when the vector, now the vector is incomplete because this guy is not available, right? So what is the cheapest trick to, what is the cheapest trick to fix this problem? I have a hole in my data. What would you do? Anyone? If you have a hole in the data, what would you do? You’re going to fill that hole somehow, right? What do you fill it with? Estimate. What kind of estimate? So let me give you an example. Suppose you had a Gaussian where on just a two component vector, right? Where the second component was always roughly twice the first component. They’re that highly correlated, right? So I mean, you’re assuming that you know that you know which data components are missing. If you don’t know, then it’s not really missing data, right? They are noisy data. But if you have, this is a hole in the data. Like some say, guy in the senses failed to fill in his summary. So now let’s consider this example. Suppose I have a Gaussian where for any time this guy takes some value x, this guy takes approximately two x, right? Suppose you knew that. And now you see a vector where this component misses missing, but this is seven. What will value would you fill over here? Is it going to be the overall mean of the Gaussian? Or is it going to be 14? What would you fill? Again, you’re going to fill what is typical, right? You’re going to fill the most plausible value here. Given the value in this case, if you know that this component is always some alpha times the first component, then if you see only the first component, you’re going to fill alpha times the first component in here. You’re going to fill the plausible term value. So well, you can’t be exactly sure what it is, but here’s how we can think about it. If I have a very large number of vectors from the Gaussian, for which all of these green components are identical, but this component is missing, right? What would their missing components be? You’re actually going to see every possible value for these missing components, but any particular value is going to be seen in proportion to the of that value given, oh, missing, I’m given all. Is this making sense, guys? Is the statement making sense? Right? Can I see some raised hands, right? Okay, so this is making sense, right? So what I’m going to do is to complete the vector by filling the missing components. Okay, so here is this. Check out. Suppose, let me go back to my example, right? The, suppose I have this two component Gaussian and the second component is always twice the first component plus some noise, right? Now, if I saw many, many, many vectors, all of which the first component was seven, what is the second component going to be? The second component is going to be an whole range of values, but it’s going to have a distribution centered around 40, right? So in other words, you’re going to see every possible value, but every value that you see is going to be, the number of times it’s going to be seen is proportional to P of m given, oh, that makes sense. Okay, and so we’re going to complete this vector by filling the missing components with every possible value. That is, we’re going to make complete clones of the incomplete vector, but assign a proportion to each value where the proportion is P of m given, which can be computed if we know P of x, which is P of m, right? And so now, if I have this collection of incomplete data, I’m going to expand every incomplete vector out into all possibilities where, you know, I’m going to fill in this missing, I’m going to complete the data with every value that that missing component could take, but the number of times that value is going, is going to be used to fill in that component will be proportion to P of m given, which I can compute if I know the distribution of the data. And this I could compute program of previous estimate of the model. And now, I have a complete set of data. And I can use this complete data with all of these repetitions to estimate my Gaussian. The new mean is going to be the mean of this completed set of vectors. The new variance is going to be the variance of this completed set of vectors. And what now it turns out that I don’t have to explicitly fill it out with every possible value, which can be infinite in number, it turns out that if you work out the mathematics, it’s simply going to be, it’s going to be an integral inside the summation, but you’re going to have a closed form. I won’t get into the mat, but the general concept is this, right? So also basically what we would do is you’re going to compute the, we’re basically, we’re going to try to complete the data. The reason I’m hiding up is running out of time. We’re going to try to complete the data by filling in the missing components, where the missing components are filled in in proportion to P of m given O. And then we’re going to estimate the parameters from the completed data. And basically what we are saying over here is x i of m is the complete vector x i when the missing data have been filled in by the value m, that has occurred in proportion, this has occurred P of m given O times. And so this entire, this term over here is basically actually explicitly considering the proportion of times where the specific value I was taken. This math may actually be simpler if you take a look at a different different problem where we are looking at the structure of the network, right? The structure of the model not the network, right? So here when we had a Gaussian mixture, we were not actually given the actual Gaussian for each observation our data are incomplete. For each vector you actually want to be both the vector and the Gaussian index. Instead, you only have the own vectors. But then what we know is that every Gaussian is capable of generating this vector with different probabilities, right? So this, if I have a specific vector, let’s say the vector was out here, then this vector could have been generated by the blue Gaussian or the green Gaussian or the red Gaussian. So just by looking at the vector, we can’t say which Gaussian it came from. But then we ask ourselves, if I saw this vector of very large number of times, how many of them would have come from the blue Gaussian? How many would have come from the green? How many would have come from the red? And the fact is that if I saw a very large number of instances of this of a vector occurring over here, then the number of times they came from the blue Gaussian is going to be proportionate to this height. The number of times they came from the green one is going to be proportional to this height. And the number of times they came from the red Gaussian is going to be proportional to this height, right? So all of them could have occurred, but each of them is going to occur in proportion to P of K given over K is the Gaussian identity. Is this making sense? Over here. So, okay. And so I’m going to complete the data by attributing it to every Gaussian. So I’m going to make many clones of the data effectively. But and the missing component is discrete now. So the number of these clones that occurs that that comes from a specific value of K is going to be proportional to P of K given over. And which can be computed from P of K and P of O given K, which we have from the current estimate of the model. And so now the rest is simple. My first vector I will expand it out with every possible value for these missing components, where although I’m showing three bars, this is like having a very large set of these copies of this vector, where this first bar has as many copies as a great number of copies where the number of copies is proportional to P of blue given this vector. The second bar represents a number of copies that’s proportional to P of green given this vector. The third bar represents a number of copies, which is proportional to the P of red given the vector. And now because I have all of these guys, now all of these are completed data, right? I’ve filled in the missing data missing information, where the number of times I’ve chosen a specific value for the missing information is proportional to the posterior probability of that missing value. And now having done that, I can just segregate out all the various individual terms. I can call out all the blue vectors and all the vectors with the that I’ve assigned to the blue Gaussian. And from this, I can compute the parameters of the blue Gaussian. I can call out all the vectors that I’ve assigned to the green Gaussian. And from this, I can compute the parameters of the green Gaussian. Then I can call out all the vectors that I’ve assigned to the red Gaussian. And from this, I can estimate the parameters of the red Gaussian. And that just going to the best is going to give me the formula that I have over here. Now, so here’s the overall, overall estimation procedure. You’d be initializing the means and variances of all of the Gaussian’s. And we prior probabilities for all of the Gaussian’s. Then for every observation, you’re going to compute the posterior probability of the for each of the Gaussian’s given the observation. And then you’re going to effectively make clones of each vector where the number of clones for each Gaussian is proportional to P of K given code. And then you’re basically segregating out all the vectors, including all the clones, which came from each of the Gaussian’s. And from those, you’d be re-estimating the Gaussian parameters. Now, the key over here is that, you know, the math may have been, I’ve gone a little over your head. That isn’t the important bit. The key over here is that we completed the data. But by considering every possible value for the missing data variables, where any particular value was chosen in proportion to its posterior probability. And then we re-estimated the parameters from the completed data. But then here is this bit that this, if you do it, do it in exactly this manner, by considering every possible value. And in detail, then what you will get is a complete solution, which is kind of theoretically correct, in terms of being of increasing a variational level bound. But then you can do something else. You don’t need to complete the data by considering every possible value for the missing component. So for example, for this vector, you didn’t have to complete many clones of it. And have as many clones with a blue Gaussian as, you know, as proportional to P of blue given observation. Or then have as many clones assigned to the green Gaussian as is proportional to P of green given the vector. Instead of doing that, you could have just filled in this missing component by sampling, free from P of missing given O. And then when you sample it, now at this point, you’re going to have complete data. And then once you have complete data, you can re-estimate the parameters from the completed data. So is this portion, is this bit making sense to you? The fact that you don’t actually have to go through the complete math, you could have just sampled the missing components from the posterior distribution. And now you would get a complete set of completed data from which you can estimate parameters. That making sense? Made sense to two people. So, okay, so guys, this is probably going away over your head, but you can pull this over outward on piers and if necessary, I’ll go over this again more slowly, right? But then here’s the overall principle. Initially, you have some data or information that I’m missing. So we’re going to initialize our model parameters. And then you’re going to complete the data according to the posterior probability computed by the current model, where either you consider every possible value with posterior base proportionality or simply by explicitly completing the data by sampling this term from pi of m given O. And then re-estimating the model from the completed data. And then once you re-estimate the model from the completed data, you can go back and use this re-estimated model to complete the data once again and repeat this process to the process converges. So this is the mechanism that we generally use. So here’s a third barne for 5 seconds, guys. This first statement is a true or false. Second one is a second one true or false. What about the third one? True or false? False or true? Right? What about the fourth? False because you got to redo the completion every time you update your model. So let’s leave this hanging for a bit. We’re going to change topics briefly. How many of you are familiar with principal component analysis? Raise your hands. Okay. Several of your principal family with PC. We’re going to spend the next 10 minutes talking about PCA. Here’s principal component analysis. Let’s say you have a collection of data that are centered. So the mean is zero. Principal component analysis tries to find the principal subspace such that when all vectors are approximated as lying on that subspace, the approximation error is minimum. So in this two dimensional case, a principal subspace would just be a line. We approximate all the data as lying on that subspace. You’re basically dropping them perpendicularly onto that line. So the orange dots now become the approximations for each of the data vectors. And so then you’re going to have an error in the approximation for each of these, which is going to be the squared error is going to be the squared lengths of all of those black lines. And we’re going to try to find this principal subspace that minimizes this total squared error, which is the sum of these squared lengths of all of these projection error lines shown by the black line. So basically what we’d be doing is, see, this is a, you’d have some initial data. And this would be the axis that you’re searching for. The, the principal subspace is that when you project every data point onto that subspace, the total squared error is minimum. So you literally sort of scan through all of the subspaces until you find the subspace where this total projection error is minimum. As you can see for different subspaces, the projection error is going to be different. On the principal subspace, this total projection error is going to be the smallest. Now you can do this in close form. The principal subspace is going to be represented by some, some normal vector, some vector w. I’m going to assume without loss of generality that W is a unit vector. So the projection of the data point onto this vector or this basis set is going to be just the inner product between the vector and w, right? That’s going to be this length over here, x transpose w. The inner product is going to be this length. And so the error that you make, which is this dot, the squared error, which is this dotted line, is going to be the square length of this guy. But using Pythagoras theorem, that’s going to be the squared number of x minus the square of this length, which is squared number of x minus x transpose w, which when you write it out, is going to be x transpose x minus w, to transpose x times x transpose w. And so this should have been a superscript. The total projection error for all of the data is given by this guy over here. You can minimize this with respect to w. And when you minimize this with respect to w, you’re going to end up with this close form solution, which says that the sum over all x of, this should have been x minus w. And so this is going to be the number of x transpose x, you know, have some error over here. Basically, you get your familiar eigenvalue equation. And it turns out that your principal subspace is simply going to be the principal eigenvector of the covariance matrix of your data. This is something that all of you are familiar with when you perform principal component analysis. And this is, there’s also an iterative solution. Our objective is to find a vector subspace such that the projection, the total projection error of all of the data points onto the subspaces, minimize it, right? So we want to find a vector subspace w such that the error that you get when x is placed on w, the total error across all training points is minimized. There are two things to estimate in this process. You want to estimate w. You also have to figure out where on this space x lies when it’s projected onto w. So basically, we want to find both w, which is the subspace. And for each x, we want to find this position z on the line that the projection calls on. So if you think of it this way, then I can say that if I have a collection of data vectors x, if I put the collection in a matrix, capital X, and if I put all of my bases for the subspace into the matrix w, and if I put the coordinates of the projection on the subspace into the vector into, for each vector x into a corresponding vector z, then the collection of coordinates would be this matrix z. Then we want to find w and z such that w and z, the w times z is as close to x as possible in an L2 sense. So this is the decomposition that principal component analysis is actually performed. And this can be done with a two-step process. I can initialize say my w. If I initialize w, if I assume w is known, then the z that minimizes the error between w, z and x is going to be the pseudo inverse of w times x. So then once I have a z, then the w that minimizes the error between w times z and x is going to be x times the pseudo inverse of z. So if I iterate between these two equations, then I’m guaranteed eventually to find a basis matrix w that that is in principle subspace. So if I write the whole thing down in cartoon form, this is what it looks like. If I were to do the iterative solution over here for principal component analysis from x, I first go, I’m going to try to find the z, which is the coordinates on this principal subspace for all of my training data. And that’s obtained by multiplying x by the pseudo inverse of the basis matrix. And once I have z, then I can get the updated as I can get the projection in terms of z times w and minimize the error. So the real problem over here is estimating z such that in estimating z is finding the pseudo inverse of w. If you know pseudo inverse of w, then z is obtained by a direct matrix multiply. But then here is the overall process. This is what principal component analysis is performing. You have x, you’re finding some matrix w plus such that x becomes z, which is the coordinate on the principal subspace. And then given the coordinate on the principal subspace, because w plus is known, w is known, you can go back and find the projection on the principal subspace. And you’re finding this w such that the error between x and x at is very much true. And this basically is just an autoencoder, right? You’re going from our data onto a hidden representation and back to the data. And so when you perform back, you’re on an autoencoder. This is an autoencoder with linear activations. You’re basically simultaneously estimating both that with boozzy and w in time to increase. But then there are a couple of things that you want to notice over here. You want to mention over here that principal component analysis is trying to find x as being approximately equal if x is, you know, if x hat as the closest possible to x, then x hat approximates x. And so x approach approximates w times z. And you’re trying to find w and z. You will find a minimum in wc. But then if you just pre multiply w by some matrix B and pre multiply, a post multiply w by some matrix B and pre multiply z by the inverse, the actual solution won’t change. So although this this iterative algorithm of all the back prop or this iterative algorithm that I just mentioned is guaranteed to find you a minimum in terms of error. The actual if we’ll find you the principal subspecies, but I can find you different. But the actual basis matrix you find can can can be one of any set of bases on that principal subspace. So the solution is not unique. Although the subspaces unique. And if you want to have a unique solution, then this can be further constrained to give you a unique solution by posing a unity constraint on the gradient solution. So all that is very fine. But pictorially what’s it doing? I’m giving you some vectors x. I’m asking you to find the principal subspace. Now for if for every vector, I told you where on the subspace, I could give you the coordinates of the data point point on the subspace. Then you can find the subspace. Unfortunately, we don’t have the coordinates. So what do we do? We can initialize the subspace somehow or rather the basis for the subspace. And then once you have the basis for the subspace, then I can try to find out the coordinates of each of these data vectors on the basis for those bases. And so now you’ll be filling out the z’s. And then once I have the z’s, I can go back and say, if these are my coordinates, what is the best subspace to represent the data? And so I can go back and re-estimate the subspace. But then once I re-estimate the subspace, I can go back and say, okay, for this subspace, what is the best set of coordinates for the data vectors on that subspace? And then I can keep going back and forth. So basically what was happening here was that there was some data missing. So I came up with an initial estimate for my subspace. And then I filled in my missing data. And then using the missing data, I re-estimated my subspace. But then I can reuse the subspace to go back and re-fill the missing data. And this looks up whole lot like the expectation maximization algorithm that we just write. But then we saw that the E and algorithm actually works on generative models, the represent distribution. So what is the distribution that’s actually being captured by principal component analysis? For that, to understand what’s going on, we have to look at what happens when we impose a constraint on z to have unit radians, which is zero mean and unit radians. You can think of the decoder of the autoencoder where you have a constraint on z being z is that z is a unit radians, zero mean Gaussian as being a generative model. Basically what we’re saying is that there is some principal subspace. And then on this principal subspace, the data have the projections of the data have a Gaussian distribution. So in order to generate the data, we’re first taking a Gaussian step on this subspace. And then we are taking off the subspace at 90 degrees to it because that’s the projection error. And so in order to generate any data, the generative model story of principal component analysis is that in order to generate any data, you first take a Gaussian step on some subspace. And then we take a Gaussian step at 90 degrees to the subspace and that gives us the data. And so PCA basically has this generative model that all of your data are generated in this manner, that there’s some subspace where you first took a Gaussian step on the subspace and then took a Gaussian step at 90 degrees to the subspace. And so the PCA model says that your z vector, which is the coordinate on the space, is first drawn from a K-dimensional isotropic Gaussian, where K is the dimensionality of the principal subspace. And so, and here is your basis matrix. So when you take a Gaussian step on the subspace, you’re going to say you’re basically drawing a vector from an isotropic Gaussian and then multiplying it by the basis matrix. That’s like taking a Gaussian step on this subspace. And then from here, you’re adding to it a noise that’s also distributed according to a Gaussian, but the noise is always orthogonal to the subspace. And the noise also has zero. So this is the principal component analysis model. In fact, PCA implicitly obtains maximum likelihood estimates of the variance of this orthogonal noise and the basis from the principal subspace. So when I think of it in terms of autoencoder terms, basically here’s what we have. You are saying that there’s some isotropic unit variance Gaussian, zero mean unit variance Gaussian data that are being entered to this linear layer. And then at this output, you’re adding some Gaussian noise and that’s what generates your observations. And the decoder weights are just the PCA basis matrix. And so the genetic story behind PCA is that it’s basically an autoencoder. The encoder basically from the data, it finds the specific Z value, the coordinates on the subspace where the assumption is the coordinates are distributed according to a standard Gaussian. And the decoder uses these coordinates to position the data on the principal subspace. And the actual data themselves are obtained by having an orthogonal noise, Gaussian noise, added to the point on the principal subspace. So the distribution that’s modeled by PCA is like this. Z is Gaussian. When I multiply Z by matrix, you basically Z is a K dimensional Gaussian, where when I, which is basically a coordinates on this K dimensional subspace. So the X hat is the actual location in the K dimensional subspace. And then this Gaussian, this noise is Gaussian noise. It’s orthogonal to the subspace of the overall distribution that PCA models is going to look something like this. It’s a Gaussian distribution, which is very close to some K dimensional subspace, some principal subspace. So I’ll skip this poll. But then PCA has a problem. It assumes that the noise is always orthogonal to the data, which is not always true. We know that noise in images can look like images. In sound, random noise can sound like speech. So we can generalize principal component analysis to something called the linear Gaussian model. And the linear Gaussian model is an extension of PCA, where it assumes that in order to generate any data point, you first take a Gaussian step on some subspace. And then you add a generic Gaussian noise, which is no longer constrained to be orthogonal to the subspace. It’s a full rank Gaussian uncorrelated Gaussian that’s independent of the position on this subspace. And so this is how a linear Gaussian model models data. It says that data, again, Gaussian distributed close to a principal subspace, and that in order to generate any data point, you first take a Gaussian step on the subspace and then you generate some Gaussian knots. Now, what this means is that unlike PCA, where there was only a unique way of generating a data point, you had to take a step directly below the data point and then take a 90-degree step off the space. Here, a given data point can be generated in a number of different ways. You could have come here and generated this noise. You could have come here and added this noise, right here and this noise and here and this noise. And all of them could give you the same point. And so the way to produce a data instance is no longer unique. This is still a parametric model for a Gaussian distribution. The parameters of the Gaussian distribution are now the basis for the subspace and the covariance of the noise that’s added to the to generate the final data point. And this whole thing is often known as the layering matrix and the coordinates on the data on the spin-score subspace are factors called factors and the covariance of the noise. There is added to the point on the subspace to generate the final data as far as you have to be done. So the probability distribution that’s modeled by NGM is also a Gaussian, which is close to some subspace with uncorrelated knots. I’m going to, this whole thing was kind of confusing. I’m assuming we can go back over the specifics which make it confusing, but I’m going to pick up from here on Wednesday and I’ll try to cover some of the more confusing aspects of second time. So please post questions on Piazza, right? I don’t expect any questions after doing the lecture, realize this was confusing. But if there are any, I can take questions. Man, all right folks, I’ll see you in the next class. Thank you.