11-785, Fall 22 Lecture 22: Variational Auto Encoders (Part 2)
All right, let’s begin. So I am going to assume that at least some of you have watched yesterday’s makeup lecture. For the rest, I’m going to quickly recap some of the things that we discussed in the last class. I expect that some of it will not make sense to those of you who didn’t watch the makeup lecture, but honestly, the Tuesday morning lecture, I mean, Monday morning lectures are not the greatest. And so if you have any questions, please follow up on the answer, right? So here’s the recap. We’re looking at neural networks as generative models. We’ve seen how neural nets can perform classification or regression. Now we want to use them as generative models, which can model the distribution of any data, such that we can draw samples from it. So the national question is, what is a generative model? In statistical estimation, a generative model is a functional or computational model for the probability distribution of a given data. This can be represented generically as your x semicolon theta, where x represents a data instance and theta are the parameters of the model. But this model actually encodes a generative story for how the data were produced. Now, why do you need generative models? We can compute the probability of observing a given value x. If you have a model, you can also use these models to generate samples that are statistically similar to the data that you’re trying to distribution. We’re trying to model. We saw some examples of generative models. They can be simple one-step generative models, where the process of obtaining a random sample, there is a simple drawing process, like a Gaussian or a category distribution. But you could also have multi-step generating processes, where the process of generating an observation could have multiple steps. Like in the case of the Gaussian mixture, where in the first step, you chose a Gaussian, and then in the second step, you draw an observation from the Gaussian or linear Gaussian models, which we haven’t seen yet, but we will shortly. So, the way we estimated the parameters of a generative model from some observative data was to try to find the parameters that maximized the log probability of the data that you did observe, and whose distribution you’re trying to model. So, this was based on the idea that the world is a boring place. Whatever you’re here, observe is very typical of the process, and so you’re going to pick the process parameters for which these data are more typical for which these data are most likely. This was the maximum likelihood estimation procedure of our principle. We also saw that in many situations, the data we observe may be missing information. For example, you might be trying to estimate a Gaussian distribution from a collection of vectors, but each of the vectors might be missing some components, or you might be trying to estimate the parameters of a Gaussian mixture distribution, but then you may not know exactly which Gaussian generated each observation. And because of the missingness of these variables, it becomes very hard to estimate the distribution. And the reason for this is that we can, the maximum likelihood principle says you must maximize the likelihood of the observed data, and so you want to maximize log p of O, where O is the observed data, but then the observed data incomplete. You may be missing some components, or some other variables which carry information. And so, the actual process generates a more complete data. The process generates the complete vector over here, or the process selects a Gaussian, and then generates an observation. But in the process of observing the data, some of this information is masked, like this, these components may be masked, or the identity of a Gaussian that generated the observation may be masked. And so now, in order to compute the probability of just the observed components of the data, the, you must marginalize out the missing or unseen components from the complete data, H, O, which includes both what you observe and what you did not observe. And the maximum likelihood estimation in this case is going to maximize the log of the F O, which is the log of the summation over the hidden or missing variables of the probability of the complete data, including the missing and the observed components. And so, maximizing this means you want to maximize, minimize, but maximize the log of a sum of functions, and this usually doesn’t have nice solutions. So, in many of these settings, it’s not so hard to optimize log of P of H, O. It’s the fact that we are having additional summation over here, which makes it challenging. Challenging. So, we came up with our workaround. We defined an empirical lower bound for the log probability of the observation, which is the expectation over any probability distribution Q of H computed over the missing components of the log of the probability of the ratio of the probability of the complete data and the probability assigned to the missing value by Q of H. So, this term we saw in the last class was a lower bound on the log probability of the observed data. And furthermore, we saw, particularly in yesterday’s session, that the gap between these two is smallest. When this distribution Q of H is actually the conditional distribution of the missing components given the observed components. And so, we can now maximize the slow bound and you get a nice iterative maximum likelihood estimator. If we maximize iteratively, maximize the low bound instead of log P of O directly. And this was our expectation maximization algorithm. Again, ideally, the tightest bound is obtained when this distribution Q of H is the conditional distribution of P of H given O. But the entire process would still work or kind of work if we used other distributions over here, which were close to this gap. We didn’t actually get into that, but we didn’t make that statement. But assuming that we’re going to use this optimal value for Q of H, then here’s how things worked. We could define an auxiliary function, which is the expectation of the log likelihood of the complete data where the expectation is taken with respect to the conditional distribution of the missing components given the observed components, where this conditional distribution is computed using a previous estimate of the model parameters. So, as a result, you get this auxiliary function, which is a function of two distinct sets of model parameters. The theta key that occurs here and theta, which occurs over here. And we saw in yesterday’s class, again, I recommend all of you who didn’t actually watch the lecture to go back and view the lecture. We saw in yesterday’s class that this gives us this iterative estimator, where you can set up this auxiliary function with the second argument being equal to the current estimate of the parameters and maximize it with respect to theta. And that to give you the next estimate. So, why is this not divided by Q over here? Because Q is not a function of theta. So, because when you’re maximizing this, it’s log of P minus log of Q, but Q is not a function of theta. So, you’d only be maximizing the numerator. Did I answer a question? Right. So, and this is guaranteed to increase log of P up over every iteration. Then we went back and tried to understand what this really meant. And we saw this. The problem was that we were trying to train generative models using incomplete data, like incomplete vectors or observations where the other variables drawn by the generating process were not observed. And what expectation maximization was doing was to complete the data by filling in these values that had been masked. So, it would fill in these black values over here or it would assign a Gaussian identity for every vector. And then once the data had been filled in, now we have complete data. And from the completed data, it’s usually easier to estimate the parameters of the process. So, how exactly do we fill in these missing components? What is the mechanism we use to decide how to fill in these missing components? So, let x be the observed component and y be the missing component. In the case of the Gaussian, because it’s a Gaussian distribution, x and y are jointly Gaussian. So, when I give you the observed value x 0, the values that y can take is going to be follow the distribution given by the slice of this Gaussian at x equals to x 0. So, the missing components, we may not know what the value is, but we do have a distribution for the missing components, which is going to be the slice over here. Is this making sense to you, guys? Yes or no? Like I’ve always requested, please respond, because otherwise you’re going to slow everyone down, right? So, and so, similarly, in the Gaussian mixture case, here’s what we had. You want to inform which Gaussian, any particular observation came from, but then from the knowledge of the distribution. You know that say these were the three Gaussian’s, then you know that if the observation is O 0, all three of these Gaussian’s could have generated O 0, but then the blue Gaussian is far less has a far lower probability of having generated O 0 than the green Gaussian, which in turn has an even lower probability of having generated this O 0, give them the red Gaussian. So, given O, if you had to be asked which Gaussian, do you think actually was chosen and when this O was generated, you would be able to say that the probability that the blue Gaussian was chosen was this, the probability that the green Gaussian was chosen was maybe this, and the probability that the red Gaussian was chosen was maybe this. And so, you can, although you may not be able to identify which Gaussian generated the observation, you could definitely assign probabilities to all of those, which is p of k given O. And now that you have these equals theory probabilities, p of y given x over here or p of k given over here, you can use those apis for theory probabilities to fill in the missing components. So, in this Gaussian case, each vector, you’re going to make many a very large number of clones of it, turning to infinity. And you’re going to fill in this missing value with a component by every value it could take, but the number of times the specific value is going to be assigned to it, is going to be proportional to p of m given O. Similarly, over here, every vector, you’re going to make copies of it, many copies of it, but then each copy is going to be assigned a Gaussian. But the number of vectors that are assigned say the blue Gaussian is going to be proportional to p of blue given O. And so, now we have this complete in data, where we have filled in the missing components using some notion of plausibility. What these values could have been and, you know, what is the frequency with which that particular value would have been seen when you had the rest of these lab observed values. So, and then you can use these completed data to re-estimate your model parameters. We saw how to do this yesterday. There was an alternative instead of trying to make many copies and filling in the missing component in any possible way. There was a second technique and, in use, someone who attended yesterday’s lecture. Can you recall what the second technique was? Anyone? What was the sample, right? You can say that you’re going to fill in this missing value by sampling from p of m given O. And then when you sample, they’re going to have completed data again, and you can use the completed data to re-estimate the model. So, questions? Any questions? I don’t know. I’ll go back to the last slide a little bit. Okay. Yeah. All right. No questions. Thank you. All right. Okay. So, let me proceed. A second topic that we sort of touched upon in the last class was principle component analysis. You’re given a set of data, centered at the origin, and this isn’t some high dimensional space. In this example, in two-dimensional space, and you want to find a lower dimensional projection, such that the error of the, you want to find a lower dimensional subspace, such that when you project all the data onto that subspace, the total error, squared error from the projection is minimized. So, we want to find a line like this one, in this example, such that when you project every data point perpendicularly onto the line, the squared projection error, which is the squared length of all of these black lines, is minimized. And so, the problem of principle component analysis was finding the subspace. So, when we actually think of this, you know, here’s a three-dimensional example, you’d be given a set of data points, all these ways, and you want to find the subspace shown by the plane over here, such that when you project every data point onto the subspace, the projection of every data point is shown by a gray shadow. Although, I should have drawn lines to show which shadow represents which point, but it would get very cluttered. But the point is, you want to find the subspace is that when every data point is projected onto the subspace, the total squared error is minimized. Now, there’s a different way of looking at it. This subspace has a bunch of bases, which basically acts as which lie entirely within the subspace. And the shadow of any data point on the subspace is going to have some coordinates z in terms of these bases. So, the shadow of every point and is going to be any point di is going to be some w times zi, where w is the matrix representing the set of bases. And zi are the coordinates of the shadow of that point in terms of these bases. So, the projection error is going to be vi minus w times zi. And you want to find the w, the base is specified by the subspace. And the sets of coordinates, such that the coordinates for each of the data instances, such that this total squared projection error is minimized. And when you work out this math, we found we would find if you’ve done PCA that the best set of bases to minimise this guy are the eigenvectors of the covariance matrix of the data. So, anyone who has done PCA, do you remember this? Is this making sense? Yes, no. 3 out of 100 man, this is bad. I can at least have 10. So, I’ll stay with 3. I have 3% of my class following. But then we saw two classes ago that PCA is also performed by a linear activation autoencoder. Right? We saw that if I just add an autoencoder, which had where dimensionality reducing encoder, reduced x to z. And then the dimensionality increasing decoder broad z backs to an act to x hat. And you minimise the square error between x and x hat. This is exactly equal to PCA. So, in fact, when you think about how this is trained, you’d start up with your data, you’d initialize your w’s. So, you’d initialize this encoder, which is the pseudo-anversive w. You start off with your data and then estimate the the hidden representation z, the intermediate representation. And then pass the z through the decoder to get w times z, which is x hat. And then you’d minimise your estimate w to minimise the error between x hat and x. And so, here you’re assuming that you initially have a z. And then you’re minimising the error. You are estimating w to minimise the error between w z and x. And that will give you, or if you initialize w, you’re going to get that z is the pseudo-anversive w times x, which is how you’ll be finding the optimal z’s given the initial estimate for w. But then w in turn is going to be x times the pseudo-anversive z. You can just retrieve this backward and forward. But then the process of learning the autoencoder was basically just this. And the decoder of this autoencoder is just going to be this component, which is and it’s got, and the decoder is now going to end up with the basis for the principal subspace. So, again, we saw this very briefly in the last class. What is the iterative solution actually doing? What we are doing is in the first step, if you have the w, you’re using the w to compute the z. And that’s actually trivially computed. Then once you have the z, you’re using the z and the x to compute the w. You’re minimising the error between w z and x. So, and the training the autoencoder basically repeats this back and forth. So, again, our point is that when we were trying to learn the principal subspace, we were trying to learn both the basis of the subspace and the coordinates for all of the data points and all the z’s. So, the way we did it was we initialise the subspace, we initialise the w, then use the w to find the coordinates of the shadows, of all of the data points on this principal subspace, basically the z’s. And then once you have the z’s, you use the z’s and the x’s to re-estimate the subspace. Then you use the estimated subspace to go back and find the coordinates and repeated the process. That’s what we did. You’d initially pass x through this encoder and get the z’s. Then given the z and x, you could you can estimate the w. But now from the w, you have this guy. So, you can pass x through this and get the z’s again and keep repeating the process. What’s really going on? The way we are looking at it as we are saying that if I knew the coordinates of the of each data point, of the shadow of each data point on this principal subspace, then I could estimate the principal subspace. Unfortunately, these coordinates are not known, they are missing. And so, we iteratively completed the data by estimating the coordinates of the data vector, of the shadow of the data vector on the subspace. And then using these coordinates, we re-estimated the subspace and flipped back and forth. And what does this look like when you try to complete the data by estimating some missing information and then re-estimate something. What does that look like? We just find out that it’s right. This is just em, right? So, basically, if I do this iterative mechanism, when we train an auto encoder to minimize the error between the reconstruction and the original data, base what we are performing is em. But then we just saw that em is really a model for a mechanism for learning the parameters of a generator model. What is a generator model over here? There doesn’t seem to be anything generator over here, right? You’re just trying to find a principal subspace. Where is the generation? So, for that, we have to go back and realize that there is in fact a generative story behind principal component analysis. And the generative story we have derives from the fact that when you compute the shadow of any point on the sub-subspace, you’re basically computing a perpendicular drop of the point on the subspace. So, the error between any point and its shadow is always going to be orthogonal to the subspace. So, in other words, the error between v and wz, which is its shadow on the subspace, is going to be 90 degrees to the subspace. Basically, the inner product between itself and the basis of the subspace is going to be zero. So, once you keep this in mind, here’s what it turns out, the generative model behind TCS. You’re assuming that you have, if you take this principal subspace as K-dimensional, that the coordinates of the data point in terms of the bases come from a K-dimensional isotropic Gaussian. And so, in order to generate any point, we’re first drawing the coordinates from this isotropic Gaussian. And then based on the coordinates, you’re actually finding the position, which is K times E, you’re drawing Z from the isotropic Gaussian and computing X hat, which is the position on the principal subspace, which is A times E. And then to this, we are adding some noise, also drawn from a zero mean Gaussian, but now this is uncorrelated noise, with the one constraint, that the noises always are orthogonal to the subspace. And then you’re adding the noise to this point X hat to obtain the final data. This is the actual model that PCA employs. And so, you can think of it this way, that the decoder of, which is the actual principal, which predicts data points into the principal subspace, is this linear transformation AZ, A times E, which starts off, which takes this K-dimensional nice circular distributed Gaussian distribution, and distorts it into an elliptical distribution, where the different components may be correlated with one another, on a K-dimensional subspace of the full dimensional space. And so, basically, the another, the say the generator model says that the actual data are generated by taking a Gaussian step on the subspace according to this transformed Gaussian. And then adding a Gaussian noise, which is always orthogonal to the subspace. And so, you’re getting X equals A times Z, which is X hat plus E. And Z is Gaussian, therefore A times Z is Gaussian. The noise E is also Gaussian. And so, because everything is Gaussian, AZ plus E is also going to be Gaussian. It’s actually assuming that the distribution of the data is Gaussian. Specifically, what is the distribution of X hat? X hat is just A times Z, right? So, if Z is zero mean, the mean of X hat is zero. And the covariance of X hat is going to be simply A times the variance of Z times A transpose, the variance of Z is I. So, the covariance of X hat is A transpose. I’m adding this noise to X hat. So, X equals X hat plus E, which is Vellaro case E. And the noise is also zero mean and has a covariance D. When I add two uncorrelated variables, that their variance is add and then means add. So, the result of the effects is also Gaussian with mean zero and covariance equal to A transpose plus D. So, is this model making sense to you guys? You see what we’re speaking on. Right? Okay. And so, now, what I’m doing when I try to estimate the model, what are the parameters of the model? The parameters of the model are this matrix A and this covariance D. And when I perform PCA, it turns out all I’m really doing is performing a maximum likelihood estimate of A and of D, which are the parameters of this model. And in fact, PCA basically just ends up giving you a maximum likelihood estimate of the subspace and the noise covariance given the collection of data where you’re assuming that the distribution of the data itself follows this format. And that the covariance matrix for the noise is always orthogonal to the subspace. So, here’s your first ball. Okay, 10 seconds, guys. Okay. This is first statement true. Second one, third and the fourth basically all are true. Right? PC actually performs maximum likelihood estimation of the estimation of a generative model for the data. But the generative model is that in order to generate any point, the process first takes each Gaussian step on the principal hyperplane followed by a Gaussian step perpendicular to the hyperplane. Because it’s a generative model, it can be estimated using EM. And because everything is Gaussian, basically the distribution model by PCA is Gaussian. Right? But then let’s try to improve on the PCA model. PCA is basically saying that your actual data consists of two components, the main signal. Any generative model of missing components can potentially be estimated by the EM or some variation of it, which we will see. So anyway, what PCA is assuming is that is is that the main signal lies on a subspace. And then there’s some noise for Fogunar to the subspace, which is why we often perform PCA to to reduce the dimensionality of the data. We are trying to capture the signal component of this of the data and we are throwing away the noise. Right? But does one see it, but do you see any issue for this generative model? Is it something that doesn’t seem quite perfect here? Anyone? Even if it’s supposed to be Gaussian, a Gaussian assumption is not a bad thing, right? The real issue is that you’re assuming the noise is always perpendicular to the principal subspace. You’re assuming that the noise is nothing like your data, right? But then in real life, you know that when I generate, if I just randomly speckled an image, that noise could itself look like an image. If you listen to noise, you can actually sort of hear whispers and music and noise. Random noise can sound like speech or music. Random noise can look like images. The noise doesn’t have to be orthogonal to the subspace, right? So the reality of it is that in real life, this noise doesn’t have to be merely orthogonal to the subspace. You can be full-rack. And so in this case, the correction you want to make to the model is that in order to generate any data point, you’re first taking a Gaussian step on the subspace. But then from there, you’re adding a noise, which is drawn from a Gaussian, which is still uncorrelated, but can be in any direction. And so now this noise is full-dimensional. And that’s a much more realistic model. This is what is called a linear Gaussian model. And so the probability distribution that is modeled by this linear Gaussian model is LgN, says that in order to generate any data point, you first draw a z from an isotropic Gaussian. You put it through a linear transform, which is to say you’re drawing a coordinate on this principle subspace and A represents the bases. And so X hat is the point in the principle subspace. But then to it, you add a full-rank noise, where the covariance matrix of the noise is diagonal. So the noises are uncorrelated. And this gives you the final observation. So again, if I assume that z is a standard Gaussian, then the probability distribution of X hat is going to be a Gaussian with mean 0 and variance A A transpose. And the probability distribution of X is also going to be Gaussian, which is with mean 0 and variance equal to the variance of X hat plus the variance of the noise, which is A A transpose. And also with this, suppose I gave you, if I told you what z was, what z you had drawn, then you can easily enough compute the probability distribution of X. If I gave you z, then X hat is known. And so because A Z is known. And so X is simply going to be the noise E plus this term A Z. And so the distribution of X given Z is going to have basically the same distribution as the noise, with the mean shifted to A Z. So is this making sense to you guys? You have X given Z. What I mean by a full-ranked diagonal is that every component is non-zero reduction. So the noise can occupy the entire space. And I assume that the variance between different components are independent. We are assuming that each component is independent of every other component is the noise. It’s just random noise. And you also see. So now this is a generator model for Gaussian. It’s assuming that the data distribution are Gaussian, lying largely on a hyperplane with some Gaussian fuzz. Now the first of the hyperplane is uncorrelated. So only the components that actually lie on this principle subspace have correlation. So this is a nice model which allows you to typically you assume that you’re going to assume that a Gaussian has been fully correlated. All components are correlated to one another or that it’s uncorrelated. The covariance matrix is diagonal. This gives you something in between which says that there is correlation but there’s only the correlation only occurs within a subspace. So this is basically saying and this is a very realistic model for Gaussian data. That the data lies has a Gaussian distribution which principally lies very close to a subspace and all of the correlations are cut off the subspace and the fuzz off the subspace is noise. And this is a superior model to principle component analysis as you can see because the model for the noise is more realistic. But then what if I give you a collection of data generated by this model? What are the parameters of the model over here? Anyone? What are the parameters of the model? Does someone want to venture? With the covariance matrix and what is it you don’t know? You don’t know a and d right? So these if I know these two then the model is fully specified. Take a look at the formula. The p of x is a Gaussian with mean 0 and covariance is a 38 transpose plus d. So these are the parameters of the model. And so the likelihood log the likelihood log likelihood of any data instance is going to be the log for Gaussian computed using these parameters at x because the mean is 0 this is x minus mean transpose is just x transpose. The covariance is a transpose plus d so this is a transpose plus d inverse. And so this entire term is going to be one awards by root of 2 pi raised to d times the the determinant of the covariance times the exponent of minus 0.5 times x transpose covariance inverse x. This is the log likelihood of any data instance x given a collection of data instances you’d be summing this over all of the data instances. And then you’d be estimating a and d to maximize this. And unlike pca this doesn’t have a closed form solution because d is full rack. And d is so what is the problem over here? If I gave you the subspace and if I gave you the point then in pca you knew the exact coordinates on the subspace because you had to drop a perpendicular on the plane and there was only one way of doing it. But in this case because the noise can be anything even if I know this point and if I know the subspace I don’t know exactly where on the subspace the data point came from. So the information about the intermediate values in generating x the coordinates in the subspace are not enough. If you knew the azif for each x then that estimation is simple. Let’s assume I know the z for each x then I have p of the complete data is x comma z. And so the probability of the complete data is p of x given z times pc. And z of course is a Gaussian distributions of p of z is Gaussian. And the probability of x given z is simply a Gaussian with me az and the covariance t. So if I try to maximize the log likelihood of the complete data because p of z is not a function of a or d it’s the same as maximizing the log likelihood of the condition probabilities of the data instances x given c. And this can be written down. And so this term over here can now be maximized with respect to a and d by differentiating it with respect to a and d. And equating it to zero I’m not going through the map but you can understand you can see how this follows right. It’s a nice thing it’s a nice formula for at each point we know both x and z because we are assuming that the coordinates are given that we have the complete data. And so now we just have to ask to be a and d. And so when you work out the arithmetic you get these nice closed from formula for a and d. I won’t go through the map you have to just have to I just want to get across the point that the having this z gives us a closed from formula formula. Unfortunately we do not observe c it’s miss the data incomplete. So what do we do. I don’t know the z right. So I’m going to go back to E and I’m going to I’m going to complete the data in every possible way proportional to p of z given x. And so what is p of z given x. We know that p of x is Gaussian. We know the joint distribution of z and x is Gaussian right because the data Gaussian the initial z that went into the linear processes Gaussian and that have processes linear. So the joint distribution is Gaussian without working out the map I can tell you that the conditional distribution of z given x which is you know what is the distribution of z if I tell you the final outcome x that 2 is going to be Gaussian given by this ugly looking formula. And so what we would do is to fill in these missing components according to this conditional distribution for each x. And subsequently we’re going to re-estimate the parameters of the distribution from this complete we’re going to estimate a and d from these completed data. And so the overall solution is you’re going to complete the data in every possible way proportional to p of z given x. And then compute the solution by maximizing the likelihood of the completed data. And once again if you work out the arithmetic the fact that you’re considering every possible value of z gives you an integral we don’t need to get into the math of it when we’re just trying to get convenient concept across it turns out that when you work the whole thing out you basically end up with nice closed form solutions right. What we are doing is we are drawing the missing components from p of z given x given the current estimates of a and d to complete the data. And then we estimate the parameters from the completed data. So there are two ways of completing the data. You could either have filled in the z every possible value of z or you could essentially sample the z of p of z given x which has a closed mathematical form. And either way you could now use the completed data to re-estimate your parameters and everything is fine right. So you sort of get the idea of what we are doing. What we are doing over here is to try to learn a Gaussian distribution. And the way we are learning the Gaussian distribution is by where the Gaussian distribution has a specific structure that we are assuming that the data lie on some principle subspecies and the noise is an ice and to generate any data point you take a Gaussian step on the subspecies and the Gaussian noise is added off of it. And that is going to so we just go there and then you take a step and you get your observation. But when you are trying to estimate it we don’t really know which location on the space generated the data point. So you go back and take a look at what are all the locations on the space that could have generated the data point. The location perpendicularly below it needs the smallest amount of noise. So that’s going to have the lowest probability. As you go further and further away from it there’s lower and lower probability that that the original z came from here. And all the equal probability contours are going to kind of look like these ellipses. They’re going to be Gaussian’s also. And we are using the entire lot of them to further the data to estimate the model. So that’s the linear Gaussian model. These are models for Gaussian distributions. There’s some sort of a of a agreed over PCA and specifically modeled the distribution of data as Gaussian where most of the variation is across a longer linear manifold. And these are excellent models for data that actually fit these assumptions which is going to be most realistic data in most realistic kinds of noise and many other kinds of deal. But there was a there was a lesson that we were trying to convey through all of this. And that lesson is perhaps brought across by this point. 10 seconds guys. Okay. You can’t find the poll 1609. All right. So just read this on the slide right. Is this first statement true or not? linear Gaussian models models the distribution of the data as a Gaussian centered on a principle hyperplane. What about the second one? The generator model is that in order to generate any point the process first takes a Gaussian step on the principle hyperplane followed by addition of uncorrelated Gaussian voids. What about the third one? Also true right. The parameters of the distribution are the bases of the hyperplane and the covariance of the nodes and the fourth. If you knew exactly where on the plane you started off from to generate each data point then the estimation is trivial. But because it’s not known the fifth is true the actual estimation is performed using EN. So far so good right. But what kind of distributions that do Gaussian linear Gaussian models? Again you can see that this is a generator model right. You’re drawing a Gaussian data from a ga a sample from a Gaussian pulling it through a linear transform and then adding Gaussian noise to it. It models Gaussian distributions which are close to a hyperplane. But then what happens if you have data which have distribution of this kind? They are not close to a hyperplane. They are close to some non-linear manifold like this guy or this one right. How do we model these? Here’s what I can do. I can model this distribution as maybe Gaussian but then it’s not Gaussian on a plane it’s Gaussian on a curved manifold. So you can model this as a Gaussian data on a plane where the plane has now been worked to give you the new distribution. Is this making sense? Right and so what is the model for that? It’s trivial. This is very much like the linear Gaussian model but I’m just going to make a small change. I’m going to say that my Gaussian noise now goes through a non-linear transform because this is like a non-linear autoencoder remember to give me X hat and now the first is added by adding some uncorrelated noise to this cap. This is the non-linear Gaussian model and it’s going to model distributions such as this. Is this making sense? Here’s how the NLGA models the data. Our first step if you want a model of distribution using an NLGA and you want to find a function that warves a lower dimensional input plane into the target manifold in the data space. Basically we want to transform a Gaussian on this lower dimensional linear plane into some distribution on this warp space. It doesn’t even the result may not even be Gaussian on this warp space. It says that the distribution is now on a curved manifold and the actual assumption is that the actual data are obtained by adding some fuzzy noise to this distribution and this is actually a perfectly good model for the distribution of most real-life data. Here is the generative story for non-linear Gaussian models. You have a non-linear function that warps this k-dimensional space on which you have this k-dimensional isotropic Gaussian data into a curved manifold and the final data space. Then samples drawn from here from this input Gaussian are going to end up as samples on this curved manifold. High density regions over here are going to correspond to high density regions here. Low density regions here are going to correspond to low density regions here. Then the final observations are obtained by adding uncorrelated noise to this data. So you’re drawing some samples from here for you’re putting it through this non-linear transform which flogs it somewhere on this manifold. Then you have some uncorrelated noise which is added to this point to give you this final data observation. That is the generative story for modeling distributions. This energy m can model very complicated distributions of this kind. Any distribution that may be viewed as lying close to a curved normal dimensional surface in data space or even a linear surface. F of Z equals AZ is a special case. So is this making sense to your guides? And here is the beauty of it. If you see any structured data, the fact that the data has structured means that it’s actually low dimensional. If the data were not structured the distribution would be all over the place and it would be just noise. And so pretty much any structured data like images of faces, human beings on planets will name it. If it plot their distributions they are going to be far Z, you know, an element of fuzz, the random variations around a distribution that lies on a curved low dimensional manifold. So this is a very generic model that can model pretty much any distribution in the world. And so when I want to learn the distribution of an often complex data, this is a beautiful way of modeling it. The key requirement is that we’re going to need to know the dimensionality of this manifold, which is a dimensionality of Z. And the function f that maps this guide into this curved space. But once we know that, so the design choices when I’m trying to learn a model of this kind for some data is to first guess the dimensionality here, then choose the right function that’s capable of learning the shape of the manifold. And this function is where we’re going to stick in a neural network. This making sense. Right. And so this is how we’re going to learn it. You’re given a collection of data x. And from the data x now you have to learn two things just like in the LGA. You’re going to learn the covariance of this noise. But then you’re also going to learn the parameters of this f over here. And this is a generic model that can that actually models the distribution of a data. And estimate theta and B. Again, they’re estimating a derivative model. So we’re going to use maximum likelihood estimation. Now let’s begin by assuming that in reality, you’re only given the collection of x’s. But this generative story says that every x began as a z. That z was put through this non-linear transform. And then some noise was added to it. Except you only see the outcome. You don’t see, you only see the fully grown adult. You don’t see the baby there in the ns. But then let’s assume that you are given this guy to begin with. If I were given z, what can you say about what the final outcome is going to be? The conditional probability of p of x given z is, you know, because e is a Gaussian. If I give you z f of z is just a constant. And so the distribution of x given z is simply going to be the distribution of e with the mean shifted to f of z. And so p of x given z is a Gaussian with mean f of z and covariance t. This making sense? This one. Right. And what is the probability of p of x is going to be the integral of p of x given z times p of z, where you’re integrating z out, which is going to be the product of a Gaussian with f of z is the mean and a Gaussian over z. But then here, because you have this ugly f of z as the mean of this Gaussian, this is not tractable. You cannot actually compute p of x. In this case, because of this ugly form, of this non-linear transform, even though you know the process, and you know all the details of the process, I gave you that data, I gave you that if I gave you everything, you still could not actually compute p of x in closed form. So this is a function that can model a distribution, but you don’t know exactly what it’s modeling. So here’s your form. Can you see this form? All right, 10 seconds. Right. So let’s get the answers. Is this first statement true? NLDM’s are generative models. Is the second one? Is that true? Yep. They modeled the distribution of the data as Gaussian distributed about a curve manner. Right. Third one. NL, the second one is not exactly true. On the third manifold, it’s of the data may not be Gaussian. The start off is Gaussian, but then the surface gets blocked. So the second one is actually not true. What about the third one? True. Right. The generative model is that not to the generate any point. The process first, again, this is not strictly true. It takes a Gaussian’s random Gaussian input and transforms it. So it’s the equivalent of taking a Gaussian step along the curved manifold, followed by addition of Gaussian nuts. And what is the fourth one? The fourth one, the parameters of the Gaussian are simply the parameters of the function that transform the k-dimensional plane, the k-dimensional manifold, and the convenience of the not this is true. And so basically the parameters are the parameters of theta and d. Right. The last one, I haven’t introduced the concept, but trust me it’s true. Right. So how do I learn this data? How do I learn this model? Now drawing a sample from the NLGM is a two-step process. First, a z is drawn is to get x-hat. You place the data here, then the noise is drawn, and then you add the noise to get x. So the complete data to describe how the generative process generated any x requires z, it requires, and so the complete data is going to be x, and z, and e. But you don’t need all three, sufficient to have just x and z, because if you’re given x and z, then e is implicit. Right. So the complete data in order to describe any generation, x and z, but only get to observe are the x’s. But then first consider the glass box process, which gives us the complete data for every x, you know the z that it began as. Then the output, so in this case let’s derive the estimation rules to learn theta and d. Now if I give you z for every x, x equals f of z plus e, so the probability distribution of x given z is going to be a Gaussian with mean f of z and variance d, right. So the log likelihood of the total training data, where for each data instance, it also know the z is the sum over all instances of the log of p of x comma z, which is p of x given z times p z, p of z is a Gaussian, say, and it cannot be maximized. So this is the term that you would have to maximize with respect to theta and d. But then I can write out log of p of x, given z. I know that p of x, given z is a Gaussian with mean f of z and variance d, right. And so log of p of x given z is going to be the log of this term over here, which is just the Gaussian with mean variance d and mean f of z. And I’m going to basically estimate my model parameters theta and d to maximize this. So for every observation z was also known, the estimation process would just maximize this. Is this making sense? Do you see where this comes from? It’s very trivial, right. So just the two of you, what about the rest of you guys? Do you see what’s coming up here? Okay, what is the problem? What’s the problem over here? We don’t have complete info. What don’t we have? We don’t have z and x, right. So how would we go about completely solving it? What have we been doing all along? We are going to complete the data by filling in the z, right. And specifically, we are going to fill in the z. So here, I’ve just written out the formula. You’re maximizing this guy with respect to theta and d. So you can define this as a loss, the negative of this is the loss. And you can minimize this with respect to theta and d and again, we can just use back product, right. But you don’t know z, right? The data are incomplete. So we’re going to complete the data. The two ways of doing it, I can fill in this missing z in every possible way proportional to p of z given x or alternatively fill in the missing z by drawing samples of from p of z given x, right. This making sense? Okay. And in general, option one using every possible value is to be preferred if you can get a tractable closed form solution. Otherwise, you want a sample, right? Because sampling is always noisy, but then there are two different solutions. Now, let’s look at the problem of completing the data. To complete the data, I need p of z given x, right, which is p of x given z times p of z divided by p of x. p of x given z is a Gaussian. p of z is a Gaussian. The denominator is the integral of this guy over z. But we know that this integral cannot be computed. It’s not tractable. It’s not in a closed form for most f of z’s except if for a of z being nonlinear, you can’t actually compute this guy. So this means p of z given x is not tractable. You can’t actually find p of z given x. So how do you generate samples from p of z given x? You don’t write. What we will do is to try to approximate p of z given x. Now, remember when I was speaking of EMS today or even earlier today, I said that when you don’t have p of z given x, if you come up with an approximation that’s close enough to it, it still works, right? And so we’re going to try to approximate p of z given x and x specifically. We’re going to approximate p of z given x by a Gaussian, whose name is a function of x and whose variance is also a function of x. Does this make sense? Yes or no? Right? And now we are going to use this approximation as a proxy of p of z given x. So here is the overall solution. I’m going to initialize f of z theta. Right? I’m also going to initialize the f’s which compute the mean and the variance for this key. Then actually, so I don’t know the f’s, but let’s say I’m going to initialize theta. Then given theta, I’m going to estimate the needs, new and sigma that give you the best approximation cube of x, comma z. And then you’re going to use this guy to complete the data. And then you’re going to use the complete data to re-estimate f. Once you use a re-estimate f, you’re going to use these guys to go back and update the parameters here to come up with a better estimate of your equivalent, q, q. Then you’re going to sample from that q to complete the data and use the completed data to re-estimate update f. So here’s the complete pipeline. So this is the generative model. This is the complete model. From the x, I’m first going to generate estimate q. From q, I’m going to sample a z. That gives me the complete data. And from the sample z, I can compute estimate theta and d. So I initialize phi, actually, sort of this slide has it backward. I’m going to initialize phi. And then for each x, from the initialized phi, I can compute this n. And then from this n, I can sample a z. Then using that z, I can update, I can estimate f. Because now I have complete data. But then using f, I can now, I know I have f, I can go backwards. And for each x, I can get you an estimate of what the z must have been. And using the z and the x, I can go back, I can get you an update the estimate for what the phi must have been. And so I can estimate the phi using this data. Is this process making sense? Let’s know, right? So that’s what we’re going to do. It’s very straightforward. So let’s take a look at what this, what are you doing? These steps. The first step is, once I’ve initialized these phi, I have a Gaussian. I want to sample a value from the Gaussian. The mean of the Gaussian is mu of x, the variance is sigma f. So what am I going to do? Sampling from a Gaussian is done using of standard reparameterization. Take trick. You’re in fact, they’re not, it’s very easy to sample a variable from a standard Gaussian with zero mean unit variance. So I’m going to sample z from a standard Gaussian. Then I’m going to shift and scale it so that it looks like a sample from this Gaussian. So for each training instance, I’m going to compute mu and sigma, the mean and variance of q. Then I’m going to draw samples from one sample, one sample from this by first drawing. So again, I know that this is a key dimension of Gaussian, right? So I’m going to first draw a key dimension of vector epsilon from a standard Gaussian. Then I’m going to scale it by multiplying it by the square root of the covariance. So now it has the right spread and do it and going to add the mean. And that can now gives me this new z, which is a sample from this Gaussian with mean, you know, and variance sigma. This making sense? So phi, you learn is remember, I’m trying to estimate approximate p of z given x using some Gaussian, right? So to make this approximation as close as possible, I’m going to have the mean and the variance of this Gaussian must themselves be functions of x. Those functions will have the numbers. Does that answer you either? Either did that answer your question? Okay. So we know how to sample this, right? And observe that the z that you sample is specific to x because basically the mean is a function of x and the variance is also functional x. And now we can re-estimate theta from the completed data. How do I do this? If I give you x and z, we already saw that we can write out a loss function that we can and that this loss function is now a function of theta, which you can optimize with respect to the parameters theta as well as the covariance of the noise. Typically, we assume that the noise is diagonal, is a diagonal matrix where all diagonal components are identical and equal to sigma squared. So this loss function can be minimized with respect to the variance of the noise and the parameters of theta once I have the z. So this making sense? Yes, no. Right. Okay. So this is easy, right? Now, so we know how to do this and we know how to do this. The last step is that simply assumption, right? The noise is uncorrelated. So the covariance matrix is diagonal. Once you begin assuming correlated noise, then the model becomes more complex, right? So the last step, so here’s what our process was. I was going to initialize the parameters of q and from that I compute a sample of the z, but once I have a z, I can use these pairs of z and z’s and x’s for my training data to estimate theta. But now once I have theta, I can use that to go back and estimate phi. And how do I do that? That’s the most complex part of it. Recall that q is a Gaussian with mean new invariance sigma, right? And you want it to approximate, you have z given x as closely as possible. So what we will do is to try to estimate this phi to minimize the error between q and p of z given x. And for this, we want to quantify the divergence between these two guys. And I’m going to use the KL divergence. The KL divergence between q and p of z given x is the expectation of what all z is drawn from q of the log of the ratio of q and p, right? Which is the expectation over z drawn from q of log q minus expectation over z drawn from q of log of p of z given x. P of z given x, I can write as p of z times p of x given z times p of x. So if I expand this out, this is p of log of p of z minus the expectation of the log of p of x given z plus the expectation of log of p of x. So this is what we’re going to try to minimize, right? But if I look at the first two guys, the first two guys by themselves are simply the KL divergence between q of z x and the distribution, the the a priori distribution of p of z, which we know to be a standard Gaussian. So this whole thing can be written as the KL between q and p of z minus the expectation over z of log of p of x given z plus the expectation over z of the log of p of x. Now this guy, this last term is basically q of z, if your q of z, x is a function of x, right? You want to minimize, minimize a function of f. So if you want to minimize this with respect to phi, we are going to get these two terms. This last term has no log of p of x has no dependence on z at all, right? So this guy, the expectation over z doesn’t change this. So modifying phi will not change this term at all. So we can ignore this as a minimization. And so basically all we want to minimize is the KL divergence between q and the prior distribution p of z minus the expectation over z strong from q of the log of p of x given z. So what exactly does this mean? Let’s go back and take a look at what we are trying to do very quickly, right? What we are saying is for every x we are missing some z’s. So we do, we’re going to sample some z’s, right? And then we want to minimize the divergence between q of z given x and p of q of z and z for each x and p of z given x. So another way of thinking about it is I want to maximize the minimize the difference between the log of p of z given x and log of q with all of the training instances. We’re going to minimize this discolency. And if you work through this, basically what you end up with is two terms. What you end up with is an average taken over all possible completions of the data. And again, remember in the past that I mentioned just a little while ago that when you’re trying to complete the missing components, ideally you will complete the missing components in every possible way. It does not give you a tractable solution, but if it’s not giving you a tractable solution, you’re going to just sample a single value. So that’s what we will do. So when you work out the arithmetic, you’re summing over all of the completions or the log of q minus log of p of z minus log of p of z. And this first term can be computed over all possible completions. And that ends up being the kL divergence between q and p and the cry. The second term, the expectation that we saw over here, that is not tractable. This expectation is not tractable because p of x given z has a has a z component in it. And so this second term is going to just be the so this kL divergence I just written out the formula for the kL divergence between the Gaussian q and the prior p. And the second term, which was the this guy here is just going to be this term here. And so the overall loss that you’ve got when you’re trying to minimize it with respect to pi is going to be this equation here. I’m not going to spend time with spend time with the equations. You can go go through those on the slides at your leisure. But basically what we want to say is that what I want to point out is that we can come up with a loss term that we would have minimize to to minimize the error between you and the posterior probability of z given x. And that loss term is going to have is going to be this guy here. So you take the derivative of this term with respect to pi. And so that’s going to be the loss with the derivative of this first term with respect to pi, which is explicitly a function of pi times the derivative of the second term with respect to pi, which is going to be using the change rule, the derivative of this term with respect to z times the derivative of z with respect to pi. And that gives us. And we know how z is a function of pi. z was a function of pi because of this repatimentalization trick. We would used a long time ago. If you change pi, mu is going to change and sigma is going to change. And so z would change. So we can compute the derivative of z with respect to pi. And so that gives us this overall problem that for this third step, we would define this loss with respect to pi, which is given by this formula over here. And we are going to minimize this loss with respect to the parameters of pi. Now, no thing looks very complex. But then when you write all of the arithmetic down, it’s sort of first minimizing with respect to theta and then minimizing with respect to pi, you can do it jointly by just adding the two losses up. So this is the loss you minimized with respect to theta. This is the loss you minimize with respect to phi. You sum them all up. And you say here is a loss. It’s a function of both theta and phi. And we need to minimize it. And so the overall training process is like this. You’re going to initialize both the phi’s and the theta’s. In the forward pass, you’d run for each x, you’d compute this Gaussian, you’d sample a z. And now we have complete data. Using complete data, you can now define this loss, which is a function of both phi and theta. And you’re going to just compute the gradient of this loss with respect to phi and theta and perform a gradient is an update. So training this kind of for model, which is called a variation order encoder, variational because you’re using a variational lower bound. You’re maximizing at you’re maximizing a variational lower bound. Right. And it’s an autoencoder because you can think of this as the encoder and the generating component as the decoder. So this whole thing can be learned by just minimizing this loss iteratively. You first sample the z to complete the data. Then you compute this loss using the samples e and the x and then you take gradient descent updates with respect to this loss. Once it’s trained, you’re really interested in a generative model. That’s just the decoder portion of this entire model. And so to generate new data, I can just randomly generate a Gaussian, a vector from a standard Gaussian, put it through this function, add some noise and I’m going to get x. So where are the neural networks? F is a neural network and nu and sigma over here are also going to be computed by a common neural network. So we’re at 2.0. Now the decoder is the actual generative model. The encoder is primarily required for training. But then this encoder can also be used to generate approximate distributions of latent space representations, condition on specific inputs. So for example, if I gave you an x, you can think of nu of x as a latent representation of x. And so let’s give this whole, you can, you have this entire model which can both now generate new samples and compute latent representations of x. The VEs are unfortunately strictly generative models because you cannot actually write out the formula for p of x. You cannot compute the probability of any given observation. All it is really useful for is to generate new samples that look like you’re training data. So for that you’d be sampling from z, a standard Gaussian and putting it through f and you’d be getting new samples. And what do they look like? Here in this case, this is from Carl Ders, he trained a VAE is on M nest and then generated random samples from it. You can see it actually looks like characters from M nest. Or this is from Rocco, he trained a VAE in images of faces and then generated random samples from it. And you can see it actually ends up looking like faces. So what you found is that it actually learns the manifold on which faces lie and is also able to characterize this distribution. So that when you sample these data from it, they look like faces. But then there’s more to it. This, even the M coder can be used. The M coder computes this new of X from X, which is roughly the me of the distribution of all z’s that could have generated this X. So if I started off by say two X’s, the first X was this face and I got a new from it. I got a second X which is this face and I got a new from it. So then if I take any Gaussian whose mean lies linearly between the meat for this, this new of X for this face and the new effects for this face and then generate z’s from that and pass it through F. The faces and generates are also going to be some kind of linear interpolations between these two limits. So you can actually use this to get latent space and calculations off. They were to get to explore the continuum between different images. So in conclusion, BAEs are simple long linear extensions of radios and models. They’re excellent genitive models for the distribution of a data. There are various extensions that is conditional means which model conditional distribution is like we have X given. They can also be embedded into dynamical system models like marble models. In all cases, the arithmetic for learning is very similar to the that presented here and is actually very simple. It’s very simple, similar to simply trading an autoencoder and what’s really well. So I’ll stop right here and I’ll take any questions. I’ve gone over by three minutes. I’ve spotted my apologies for that. We’ll take other questions on the answer. Any questions? I just have one if I may. Yeah. So actually, how efficient is using like the original expectation maximization to generate data versus the variation of autoencoder? So the variation of autoencoder is my expectation maximization algorithm. Right? If you could actually write P of Z given X in closed form, that would be here. The only difference over here is that we don’t know how to write P of Z given X. So we’re approximating it with this curve. Is there a reason why are we always trying Gaussian? So, Gaussians because Gaussians have any two parameters. You could try other things. Okay, thank you. In fact, there is some work on trying other kinds of models. So, Hasha, the poll is up. Any other questions? Done. So, in that case, thank you all for attending this lecture. I’ll stop right here. The next two classes are going to be about Gens, and then I’ll come back and see you the next. Thank you.