11-785, Fall 22 Lecture 21: Variational Auto Encoders (redo)
So, all right folks, my apologies for the not-too-great lecture last year. Yesterday, so I’m trying to go through a lot of that and probably go through it reasonably fast. So please stay with me if you have any questions just ask. So here was the problem, right? I’m going to revisit many of the topics we did. We were trying to train neural networks to be genitive models. So from a large collection of images of faces, we want to train a network to learn to generate a new portrait or maybe from a collection of landscapes, we want to train a network to generate new landscape pictures. So in each case, we want to learn the distribution of the data and this case, in this case, images of landscapes and know how to draw a new sample from this distribution so that when you draw this new sample, it also looks like a landscape. So we’re talking about, we’ve seen how to use neural networks as a record networks, classifies and regresses. Now we’re speaking of how to use neural networks as generative models to model the distribution of any data so that we can draw samples from it. Now what is a genitive model? As we saw, this is a model for the probability distribution of a data. The computation of the equivalent is that we have some magic box and it can be excited with some random seed and it’s going to generate data with that looks similar, statistically similar to whatever the symmetric box has been trained to emulate. So but genitive models can also be thought of as distributions. So we’ve seen parametric distributions. So for example, we’ve seen the probability, the probability mass function like the rule of a DAI or a Gaussian distribution, these two are generative models. We also saw that the way you would want to learn a genitive model is that should be given some observed data X and you choose a parametric model for the distribution of the data. So p of X and equal and theta as the parametric model where theta represents the parameters of the model and we want to estimate theta so that this p of X theta best fits the distributions of X. And so for instance, if we had a collection of outcomes of the rule of a DAI which had this histogram, then we want to learn the probability probabilities for the pieces of the DAI that could best explain this histogram or if we had a collection of data that was histogram looking like this bell shape and we decided to model it as a Gaussian, then we wanted to model, we wanted to estimate the parameters of the Gaussian that best explained this data. Now do you guys remember what the principle was behind how we estimated these parameters? Anyone? Does anyone remember that? Hello? Anyone here? Yeah, so it was the maximum likelihood estimate and the MLE was based on the assumption that the world is a boring place that the data that you’re seeing are very typical of the process that you’re trying to model. And so in fact, we’re going to pick the distribution or the distribution parameters published in the probability as that assigned the highest probability to your data. And so we saw a couple of instances of this. For example, when we were trying to model, learn the distribution of a category distribution, the parameters of a category distribution that best explained the histogram, then the log probability of the data was given by this formula over here. It’s a sum over all possible outcomes of the number of times the outcome was observed times the log of the probability of that outcome. If you maximize this with respect to the probabilities of the outcomes, you end up with a familiar ratio formula for these probabilities. Similarly, for the Gaussian, if you had a collection of data and if you computed the log Gaussian’s for all of this data, this would be a function of the mean and the variance of the Gaussian. If you maximize this log likelihood with respect to the mean and the variance, we ended up with a well-known formula for mean and variance. So this we saw. We also saw such that in some cases, the data provided may be incomplete, may be insufficient to write out a complete log probability because the data could have missing components or the kind of the structure of the model. So in this instance, where the data have missing components, we saw an example of trying to learn a Gaussian distribution from a collection of vectors where some components were missing. So here, these are the data that you see. These are the observed vectors and these components are missing. The complete data of course include the observed and missing components, but we don’t actually have the complete data. Now when you use the maximum likelihood principle, you want to maximize the likelihood observed data. So because that’s all you really have. And so this is the sum over all observations of the log of the probability of the specific components that you have observed in that vector. Unfortunately as we saw, the Gaussian’s are not defined in complete vectors, they define on complete vectors. And a complete vector is basically the combination of the observed and missing components, and so the way we would get the probability of just the observed components was by marginalizing out the missing components by integrating them out of the distribution. And so the log probability of the training data is going to which only considers the log of the probabilities of the observed components has an integral inside the log which is. And so trying to maximize this guy with respect to the parameters of the Gaussian is not as is not going to be easy because you have the log of an integral that must be maximized. Or we saw this other case where we had a Gaussian mixture model. So here the generative model is this, they’re given a collection of Gaussian’s. And in each draw, the process first selects the Gaussian and then generates an observation from the Gaussian. And so what you’re finding in terms of just these observations, if you look at the distribution of the final observations output by this box, it’s going to be a convex combination of many Gaussian and can have an arbitrarily complex shape. And so we want to learn the parameters of these Gaussian that best model the distribution of those observations, it can have these observations and where the distribution can have a fairly complex shape. But then if you think about this from the perspective of the generating process, then in order to generate any observation, the process actually draws two variables. First it’s selects a Gaussian, then it draws a vector from that Gaussian. So the probability of a specific generation is the joint probability of selecting a particular Gaussian. And that Gaussian generating the observation that you just saw. So it’s going to be the probability of selecting the Gaussian times the probability of the observation as computed by the Gaussian. And so the overall probability of any observation is, as to some of our all Gaussian, you could have generated this observation by selecting this Gaussian and this Gaussian generating the observation or by selecting this Gaussian and then this Gaussian generating the observation and so on. So you have to sum the product of the probability of selecting the Gaussian and the Gaussian generating that observation over all of the Gaussian’s. Now so if you want to learn the distribution, you need to know everything that the generating process actually drew all the steps that it went through in order to generate an observation. In other words, with every vector, you would also need to know which Gaussian was used to generate that vector. And if you had that, they could segregate out the vectors drawn from each of the Gaussian’s and then you could estimate the parameters of those Gaussian’s. So this was easy. Our problem was that we actually only get to see the observations. We don’t see the Gaussian’s, the observations that drawn from. So what we need is the co-op union of the observation and the Gaussian. What we get are just the observations themselves and so when we try to maximize the log likelihood of what we have observed, we have to, we’d be estimating the parameters of all of the Gaussian’s to maximize this term here, which is the log likelihood of the entire data that we have observed except that we have only observed the vectors. We don’t know which Gaussian’s they came from, so the Gaussian’s have to be marginalized out. And so the maximum likelihood estimate ends up with the summation term inside the log, which is up to maximals. And so we had this general form of the problem where we had some missing data or variables. And this is also in an estimate of this form where you have a summation or an integral of inside a log where the inner term marginalizes out the components of the variables that you did not observe. And this form of function is very hard to optimize. And so we need some kind of a simple tractable approach to dealing with these problems. So do we remember all of this? You guys, yeah, so I’d like to see some raised hands to make sure that all of this is required. So okay. So guys, please, stay with me, respond because I’m going otherwise it will slow us down, right? So the way we did this, and this is where I believe we began losing it, we can rewrite log of p of O, which is log of the sum over all the missing variables of the joint or the hidden variables of the joint probability of H and O as sum over H, q H, ph O over q of H, where q of H is some distribution over H. So once we write this, we used the concavity of the logarithm function to show that the log of a convex combination of terms is an upper bound on the convex combination of the log of the terms. So do you guys, so this term to the right, which is the weighted combination of log of ph over q H, where the weights have q H themselves, this term to the right is a lower bound on log of p of H, P of H, so this term to the right is a variational over bound on log of the, also called an evidence over bound. So do you guys remember this? Any doubts about this over here? So do you remember this or whether there are any doubts? If you remember this, please raise your hands. Okay, so no doubts, right? And so I’ve explicitly shown that this p of O is actually a function of the parameters that put the theta on the side. Now how did this work? This term to the right is a lower bound on log of the upper, it’s going to be log of the upper, some function of theta if you change the parameters, the probability changes. And so the lower bound is going to be some function below it. And if it’s a type lower bound and when you maximize the lower bound, you expect to, you hope to also maximize error. And so this was what they were trying to tear off upon. Prof. So variational over bound is such as a terminology that it’s, this is called a variational approximation. This is literally called a variational over bound. It’s the calculus of variations. It’s terminology as far as we’re concerned, right? That’s why we drop the term, a variation and offense use the term evidence low about elbow. And we’ll see why the evidence comes in, right? So you’re, you’re sort of comfortable with the idea that it’s a lower bound. And so the log of the upper is lower bounded by this term. And so how can we make this lower bound tight? We can make this lower bound tight by choosing the Q of age so that it becomes tight. You want it to be as tight as possible, right? And so we had a two step as solution where we first made the lower bound tight by choosing the appropriate Q of age. And then maximize this term with respect to theta with this tight lower bound. And then it repeated the process. Now how do we find the Q of age that makes the lower bound tight? It’s very simple. This is, we know that this term to the right is upper bounded by log of Q of O, right? This thing over here. So if I maximize this, what is the largest value this guy can take? Anyone? If I maximize this, what is the largest value it can take? Log of PFO, right? And so I can maximize it. So what I can do is literally use, you know, take the derivative with respect to the Q of H’s with the constraint that they’re all positive and sum to one. And then I can solve for Q of age. In other situations, that’s unconstrained. I may have constraints on Q of age, where Q of age is specifically modeled by neural network or something else. In that case, you can’t just use it, you can’t take derivatives and equate to zero, but you can still maximize this by using some kind of back prop on Q of age to maximize this term. This, we can find the Q of age that maximizes this neural bound. And then once we do that, so if we didn’t have constraints, we just maximize this with respect to Q of age, unconstrained. What we would find is that the Q of age that maximizes this term is the conditional probability of age given, using the current value of k term. So here’s what we’re going to do for this optimization problem. I’m going to estimate the Q of age to maximize this guy and then fix Q of age and maximize this term to the right, right? So let’s choose a Q of age. And based on this knowledge that if you explicitly maximize this, you’re going to find Q of age is the best Q of age, speed, P of age given. If theta prime, if, so again, this is using the current estimate of theta over here. If theta prime is the current estimate of theta, I’m just going to set Q of age to P of age given whole computed using theta prime as the parameters. So now for the second step, I fix Q of age and maximize it right inside. So if I fix Q of age, Q is given by P of age given whole with theta prime as the parameter. So this is going to be the sum over all age of P of age given over theta prime as parameter of log of the joint of age and P of age and whole divided by P of age given whole with theta prime as the parameter. This here is Q of age. This term here too is the Q of this Q of age. So this is this bit, you know, is the equation, the surface equation kind of clear or is there any doubt over here? Any conclusions? Can we just say Q of age is? No, so I’ll tell you one. Okay, so this here. So let me see what I choose this particular one, but we can generalize. So observe that now this term to the right actually has two instances of theta. One is theta prime that was used to obtain Q of age, right? The other is theta which occurs inside this guy. So this lower bound is a function of theta prime and theta. Is this making sense guys? The second term. So I’d like to see like a few hands raised, right? So we also know that because it’s a lower bound log of P of O is always an upper bound on J theta theta prime. I’m going to call this an auxiliary function. It’s this term over here, which is the evidence over bound, right? Now, we can see that theta prime is just some parameter that I used to decide Q, but it’s actually a function of theta. Do you see this? If I look at it from the perspective of say this equate, this step over here, Q of age is fixed. So theta prime is just a parameter. It’s actually a function of theta. Is this making sense? And specifically, when theta exact, when these two theta are the same, then you can, then the right hand side something becomes log of P of O. So how is that? Let me see this. This has two terms. In some theta, theta 1 and theta 2, right? If both of them are the same, J theta theta, then it’s P of age given for with theta as parameter log of P of age comma O divided by P of age given, given O. And P of age comma O divided by P of age given O is just P of O, right? So this entire thing becomes P of age given O log P of O. And log P of O is not a parameter function of age, so it can come out. So this is a summation over H P of age. And the sum of any probability distribution is 1. So this is just theta, what happened here? So this is just log of P of O. So do you see how J theta theta is log of P of O? Guys? Okay. So how does this work for us? Here’s what we will do, right? This gives us an E and algorithm. I’m going to define J theta theta prime as this term over here where Q of age is P of age given O theta prime is parameter. And inside you have log of P of age O with theta parameter. We know that P of O log P of O upper bound J theta theta prime. And we also know that log P of O theta equals J theta theta prime. So using these, we end up with a nice iterative algorithm. I can use the current estimate of theta over here and maximize J with respect to the theta here and make that my next estimate. And it will keep increasing my log likelihood. If you don’t believe me, I’m going to show this pictorially. So let’s say I start off with an initial value theta 0. I stick in that theta 0 out here, right? So when I stick in that theta 0 out here, I’m going to get a J theta theta 0, which is a function of theta, right? And this is a strict lower bound on the actual log probability, which is a function of theta. So is that making sense to you guys? Over here. So, but then when theta becomes exactly equal to theta 0, you’re going to get J this lower bound being equal to log of P of O theta 0, right? That’s what we just saw. So the blue curve is going to touch the red curve when theta equals theta 0. And then I can vary this theta. And then if I vary this theta, I can find the specific theta where J theta theta 0 is maximum, right? And I know from the property of lower bounds that at this value, if I call this theta 1, J theta 1 theta 0 is guaranteed to be less than P of O theta 1, right? So, you guys get this? So what is this mean? As a result of this, if I maximize this J function, add the theta where J is maximized, the log of P of O 2 is going to be such that the log of P of O here is guaranteed to be greater than the log of P of O over here. Do you see how that happens? Is this clear guys? How this is happening? I’d like to see if he more hands raised, right? So, this is our very simple trick. Here’s what I’m going to do. I’m going to start off with the theta 0, plug in the theta 0 to the right over here and create one J function. Then I maximize the J function. That’s going to give me this, right? Then that gives me a theta 1 and this new theta 1 gives me as at this new theta 1, the log likelihood of the data is greater than a theta 0. So now I’m going to create a new J function with theta 1 on the right hand side. So the new J function is going to touch my log likelihood function over here. And then I maximize this guy. And when I maximize this guy, I’m going to get my theta 2. And again, using the same property, I know that the log likelihood of the of the of the of my data using theta 2 as a parameter is going is greater than the log likelihood with theta 1 as a parameter. So I increase the log likelihood. Now I’m going to create yet another J function with theta 2 as my parameter and maximize this guy. And I keep repeating this until I find a J fun, you know, I find a maximum where the maximum of the J and the maximum of the P are the same. So there’s not so, so the algorithm is cannot move any further. But this is a guaranteed like, you know, iterations are guaranteed to always increase or at least not decrease the log likelihood of the data and you’re right Eric. This can go into a local minimum. This is much like gradient descent. So this is the EM algorithm that this makes sense to you guys. Here’s the critical question. And this make more sense than today’s girl. Yesterday. I’d like to see some hands raised if it didn’t. It might have someone to stay there. It didn’t. Okay. So you can see how this works, right? How the map works. And that’s your iterative algorithm. Any. So skip this. So this was training by maximizing a variation low above on the log likelihood of the data, right? That’s so much matter. But what does it really do? Right? So let’s take a look. And this is where I believe I really lost you. So consider this case where the data have missing components, right? What EM really does is to complete the data. So we have this incomplete data where you only have partial components. So now consider one single vector. This single vector has a missing component, right? So I’m going to try to fill it up somehow. So let me ask you a question. Suppose I have a data, a data which has, this is very annoying. I’m going to. Suppose I have a data which can have say, why won’t this write? Suppose I have a data of this kind where it takes only these four values. For whatever reason, it only takes these four values, okay? And let’s say the probabilities over here are point one. This is point one. This is point three. Shoot. This is. I can write with my finger. How do I erase this? Okay. So let’s say that I can only take these four values, right? And so let’s say the probability here is point one. The probability here is point one. This is point three. And this is point three, right? This is x and this is one. Now let’s assume that you saw this value of x. Nobody told you what the value of y wants, right? If you had to choose a value of y, which value of y would you choose? Anyone? Why only takes these two guys, right? Which one? So now this is this is this is let’s let’s call this a and let’s call this be point three is the probability here. This is point three is the probability point one is the probability and point one is the probability, right? So which value of five would you choose? A because A is more likely, right? That make sense? Of all the instances of x. Three quarters. Had y equals a and one quarter had y equals, right? And now suppose I have I’m having trouble with Microsoft whiteboard because they’ve changed the damn thing. So suppose I had like one million instances of observations with exactly the same x, all right? Then of these one million instances, how many of them would have y value a and how many of them would have y value? Can you tell me? Anyone? 7525, right? Because P of E given x equal this x is 0.75 P of B is 0.25, right? So now let’s here’s what I will do. I’m going to make a large number of clones of this guy. I’m going to make some L clones of this guy. Right? I’m going to figure out if there’s some way of moving this. So I’m going to make L clones of this guy and each one is going to have this missing value filled in, right? So when I fill in the missing value, I would expect that L times P of the missing value in this case E given x are going to have the value E over here, right? And P or L times P of B given x are going to have the value B over here. That making sense? This is the missing component. This is observed. So if I just made if I made L clones of these and imagine that these L clones where you know some L from a very large collection of draws. I picked some L instances where the x values were all the observed x. Then I then I would expect that these many of them are going to have a in the missing value and these many are going to have B in the missing value. Correct? That makes sense? And so I’d like to see more of the way hands raised. Does that make sense? So here’s what I’m going to do. I’m going to fill this. I’m going to make many clones of this guy. I’m going to fill every possible value, but I’m going to make as many copies of each value as. So if I make L copies for any value M, I’m going to fill it L times P of M given. And I’m going to fill it L times where always the observed value and M is the missing potential missing value. So each value is going to occur L times P of M given all times in these in these hidden in these filled in values. Does that make sense? It sounds like English. So now here’s what I did. If I’m going to complete the vector by filling up the missing components with every possible value. But I’m going to the number of times each value is going to fill in is going to be proportional to P of M given, which can be computed from P of O comma. Right. And so here’s what happened. I’m going to take this guy and make L clones of it where L is a very large number. And I’m going to fill in the missing components where each missing component is going to the number of times it occurs is going to be proportional to P of M given all. So it’s going to be it’s going to occur L times P of M given all. That’s what I’m. So I’m going to make L copies of this guy and put in all the missing components. Similarly, for this one, I’m going to make also exactly L copies again. There is a very large number. And I’m going to fill in each combination of missing values in proportion to P of M given. So is this step making sense? Yes, so no. Okay. Can you at least type a yes. So I know that you’re there. You guys are with me. Okay. Thank you. And this can come from a previous estimate, right? And so now I have a completed dataset. Okay. If I have a completed dataset, let’s just try to see what this. So suppose I have each vector has been expanded to be L times right now. I can just say that the overall mean of the data is now this the mean of this expanded dataset, which has been computed by filling in these missing values, right. I’ve completed the data. And because every vector has been further expanded by example, expanded by the same proportion L, I’m not over representing anything on the other hand. All the data are filled in so there are no incomplete data, right. So is this making sense? Yes, I know just say yes on this touch. So I so I know I’m with you. Right. And so now I can complete my complete my mean from this correct. But then so let X I M be. So let’s consider any single vector, which has XI in these components and M over here, right. How many times is it going to be observed? That is going to be observed. So if I look at it right that vector is going to be. I’m going to see M over here. I’m going to see this combination XM. I’m going to see this L times. P of M given X times correct. You agree. And so the contribution of X comma M to the mean is going to be this vector XM times P of M given X times M, right. Correct. Or this is not X. I should call this observed. So you guys with me. So far yes. And so I would want to sum this overall vector over all possible values of M because every O has been expanded. This is the contribution of each observed vector to the mean, right. Because again look, our here not here look here. So each vector X has generated this box. And so from that box, this specific combination with O M has occurred these many times, right. That making sense. And this I sum over all my observations correct. And this is going to be the sum of all of these guys. And then I divide this by the total number of observations, but what is the total number of observations. I have the total number of vectors. But then each of them was expanded L times correct. These else cancel. So do you see that. So L is not the same for. No, I’m saying we don’t want to over represent any given observation. So you can you don’t want to make more copies of some observations than others. Otherwise, you’re assuming that observation occurred more times than it actually did, right. Eric. So to maintain the first. Yeah, sure. But if you have one piece of missing data from one vector and three pieces of missing data from the next vector won’t they expand to different numbers. Not fill in every combination. So I’m saying that if L is very, very large, everything is going to get represented. But if you make this, this box wider than this box, you’re over representing this observation. So you’re not doing every combination. You’re doing a large number of combinations. I’m saying that every combination. If L is a large enough number, you’re going to see every combination anyhow. Okay. So the in the first example with one piece of missing data, it’s probably got some duplicates. Maybe quite a few duplicates. So it’s going to remember everything is being duplicated and everything is occurring these many times. Everything is occurring L times the end given or times right. Sure. But if you need, if you need say 10,000 copies to fully represent three missing data, three missing variables or something like that, you don’t need 10,000 copies to fully cover the space of one missing variable. But you’re going to have to look at that case, right. The point is that you don’t want to over represent any one thing. Think of L as standing to infinity always. Right. And so when you do that, you’re going to end up with this equation over here, right. And then there’s a sum of what all missing components of the posterior probability of the missing component time the times the O, M, XI with M in the parenthesis basically is O, M. And this is actually work it all out as this was away, right. So it’s just going to be one of what kind of in a form. And so this is this time here is curious is going to be one of our inner form, right. And so, but then if M is continuous in the limit, this summation becomes an integral. And so that’s how you end up with this formula for the updated, the making sense. And you guys see where this equation came from. Yes or no. Can I see a yes or the right. And so the same thing. Now, I can do the same thing over here. I can just expand. For every completed vector, I can do X minus mean times X minus mean transpose. This combination is going to occur, you have M given all times. Correct. And so I’m integrating over all possible values of M and something over all observations, one over M. This is going to be my updated estimate for the millions. So do you guys see where this equation came from. Right. Easy, right. So that’s what EMS doing. Now, let’s take a case of the Gaussian mixture problem. Here, we don’t know which Gaussian each vector came from. Now, consider this case. So let’s say I have these three for some reason it’s not letting me use my pen. So I’m doing this with my mouse. So let’s say I have these three causes Gaussian one. Gaussian two, right. And Gaussian three. And now, let’s say I observed a specific this particular observation. And if this particular observation intersects here. So if I plot this right, then let’s say this height on this Gaussian is say point one. And point two, right. And let’s say this height on both Gaussian’s is point four. A proportional to point two and point four. Right. And if you got this observation. And then if you had like a billion of these observations, how many of them do you think were obtained because in the first step you selected the black Gaussian. 20% right. And how many came from the purple. Yeah, 40% and the red would be 40% right. So that is once again, it’s going to be proportional to pure K given. Oh, yeah. So we’re back to where we are. I’m going to expand. I’m going to ask if I saw very large number of these vectors. How many of these would have come from each Gaussian. Each vector every Gaussian could is theoretically capable of generating that particular observation. So if you had a very large number. Some proportion K of O, K given four of those are going to come from the K Gaussian. So I’m going to complete the data by attributing every vector at the data vector to every Gaussian. I’m going to assume that I had saw very large instant number of instances, some large number L of these guys in my example, I had three Gaussian’s. So the number of instances that of those L that came from the blue Gaussian, what is that going to be. Anyone. If I saw L instances of O, how many of them would have come from the blue Gaussian. How much of that be you have blue given all times L, right. And you have green given all times L is going to come from the green Gaussian, you have red given all so. And of course, the P of K given, how can we obtain from the previous estimate of the model. So here’s what I’m going to do. I’m going to take my data and I’m going to complete it in every possible way. So this box actually represents L clones of this O where L is a very large number 10 into infinity. And then this one represents all copies of this O that were assigned to the blue Gaussian. This one represents all copies of this O that were assigned to the green Gaussian. And this one represents all copies of the so that were assigned to the ink out of the red Gaussian. So is this picture making sense. Yes or no. Okay, so beautiful right. And now once I do this, I can just segregate all out all of the blues, all of the reds, all of the greens because now I have complete data. And I can generate I can re estimate my Gaussian’s right. But let’s take a look at this blue Gaussian right. If I take a look at the blue Gaussian, I have many instances of vectors. I have vector one. I have vector one, which was assigned to the blue. So let’s call this for one right. I have O2, which is a second vector. I have O3, which is a third vector right. So how many all of these are assigned to the blue Gaussian. So this O vector, this this first guy. How many copies of this one do I have assigned to the blue Gaussian. Anyone how many copies do I have assigned to blue Gaussian. In term, we just spend through this right. It’s going to be whatever number I expanded it by times the blue given. Oh correct. That’s what this guy represents. Yes, yes or no. Oh, one. Relusio is that making sense. So this absolute silence. Can someone answer did this make sense or do I have to go this again. Anyone else I just have one answer. So this is the right information. Yeah, right. This one’s correct. So how many of these that are assigned to the blue Gaussian. Of O2. I should maybe you can answer that it’s going to be L times E of B given. Oh, two right. And. And this one is going to be. Overall, it’s going to be summation the total count is going to be L times. All observations. Here of B given. This is the total count of vectors assigned to the blue Gaussian. But then if you want to take the average of all of these vectors. What is it? I had O1. I have L times E of B given O1. Copies of O1, right. So the sum of all of those guys at L times P of B given O1 times O1, right. And so the sum of all of these guys is going to be the sum over all observations. Of. Some over all observations. Alice common of P of B given O. Times O. Right. And so if you want the average. It’s some over O. You’ll be given O. Now it can’t so’s out. Correct. This is making sense to everybody. The rest of you guys making sense and learning out of time. So. Okay. And so any questions. And so when you segregate these things out. You’re going to get that the updated mean is the one over the sum of the updated mean for the care of Gaussian is the one over the sum over all of the sum over all the observations. Of the posterior probability of the care Gaussian given the observation. Times the sum over all observations of the posterior probability times the observation. So you see where this formula came from. Yes or no. All right. And so also you’ll get the same formula for the variance. Yeah. Yes or no. Okay. Perfect. And so the formula technically came up with was that we have to initialize the means and variances and the prior probabilities of the Gaussian’s. We computed P of K given O using the current estimates of the parameters. And then use dose to update the formula. And parameters using this one. But then that’s the details right. The point we wanted to bring across was this that what we did was to complete the data. By filling in the missing values in both cases. And the way we fill in the missing values wanted to be unbiased. And then every possible possible every possible value for these missing. Times and the number of times the occurred was in proportion to the of m given home. This now gave us a completed data set. And then we just asked once and once we have a completed data set. We knew what how to deal with the with the estimation problem. And it is from the completed data set. And that makes sense. Yes or no. Right. So. This is easy right. It’s not complicated at all. The map makes it look a lot more complicated than what we’re really doing. But then given this. What is the key thing we are doing over here to make the problems are tractable. Can someone tell me. What was the key thing we were doing to make the problem tractable. Anyone. We were completing the data right. Because once you had the completed data, then estimation was it was trivial. It’s just that. You know, completing the data and dependent on the current estimate of the distribution itself. So we ended up with an iterate of solution where we initialize the distribution user to complete the data. Use the completed data to re estimate the distribution and so on. That makes sense. And I see some hands raised to this mid sense. Okay. Half a dozen people. So can you give me some other. We have completing the data. Sample right. It’s sufficient to complete the data by sampling the missing values from the posterior. So all we could do was after initializing the models, we could sort of. Complete each vector simply by sampling from you don’t even need to make many clones and complete each one separately. You could just complete each data point by sampling from the posterior. And now you have completed data and from the completed data. You can re estimate the parameters. And that too should have essentially the same effect. And then just. No. Considering every possible value is more complete and cleaner solution, but sampling will work just fine. So the overall principle, remember this. Initially you have some missing data or information, you initialize your model parameters. Then you complete the data according to the posterior probabilities. Either by implicitly considering every possible value or by explicitly computing completing it by sampling the posterior distribution. And then re estimate the model. So is everybody here with me so far. Any questions. Any questions anybody. If you have understood this, can I see you see some hands raised. Let’s start. We’re going to stop in a couple of minutes, but I’d like to see at least half of it as in hands raised. Okay. So the key here was that you had to sample from pure of m given over for this to work right. What do you think would happen if you didn’t sample from pure and given over sample from some other distribution. What it’s still working. Okay. The given examples are in this case, you know, incomplete. Girls in data are very common problem. Girls in mixes are very, very common problem. But. So maybe you like guys will let me run over for by five minutes. I’ll sort of give you a preview into what’s going to come in the next class. But before I do that, I want to give you something very surprising. We know that this whole completion process works. If we get exactly. We sample exactly from pure m given over what happens when we sample from pure m given over is that. If you have some f of data, this is your log like it, that I function. And the J function was going to sort of touch exactly. So this is J theta. It has zero right. And so if the J exactly touches over here. The data zero then a maximizing J is guaranteed to maximize. And as well. But then you don’t need to exactly touch it. If it comes close. It’s often sufficient enough. It didn’t exactly touch it. Right. So what this means is that if you’re q of age. In the original formula was not exactly equal to. And given it, you know, q of age given for. But something. Approximating it. Then to the technique has a reasonable chance. It’s actually giving you and continuously increasing sequence of estimates. Does that make sense guys. And that makes sense. So what would that map on to over here. Anyone want to tell me what would that map on to. So in which case. So where is this going to be useful. And what are the problems where p of m given o is not a tractable function. This is what we’re going to encounter in the problems that we will see in the next class. So what people do is to approximate the of the p of m given o by some other function. Which is tractable, but very close to p of m given o. For instance, suppose you model p of m given o using a. So remember going back all over here. Basically we have p of m given o is just a q of h. Right. So. It was just pure p of m given up here or h given o is just q of h right the best case value of your age. But then. If I can’t actually give compute p of age given or because my function is very complicated. Then I can try to learn some some kind of a neural network. Which maximizes this guy. And in that case, q of h is going to be very close to p of m given o, but not exactly that. And then you can use that. That’s why it’s a my variation level bound and then you can use that to do the filling. And the estimation and things will still work. Did that answer your question error. So any other questions guys. Yes, no. Okay. So. How is this going to make life different for us? I’m going to skip all of this tomorrow. I’m going to talk about linear Gaussian models. Where I can I can say that. There’s a kind of Gaussian probability distribution function. Something called factor analysis. Where it assume that some Gaussian random variable has been put through a linear transform and do it noise has been added. And that’s how the garden. This is a model for. This looks like your auto and photo. Except that. You know, there’s some additional condition of noise at the output. And this is a model for. Or Gaussian probability distributions. It says that the data have a rough big house distribution close to a better surface. And. The problem we began with early in the class. Is going is that of trying to generate faces. And in that case, you’re going to have a problem where the data distributed. According to some distribution, but not in the linear manifold, but in the curve manifold. And so here this box becomes a new level network. And so our generator model is that there’s some nonlinear function which takes in Gaussian’s. And generate faces. So that’s that’s our generator model. And if you do a very good job of estimating this guy. Then it will indeed learn the distribution and generate faces. But to learn this guy for each face, you need the corresponding Z. And you won’t have the corresponding Z. So we’re going to use the trick that we just. That makes sense. Any questions folks. No, okay. So. Z was the. Sure. What I just said. So over here. Are example, suppose I wanted to randomly synthesize faces. Correct. Then I will see that I’m brand. I’m going to randomly sample a point from a Gaussian distribution. And put it through a nonlinear function and it’s going to generate a face. That is my generator model. And so. And so what we would say is that for every training instance that we are learning our data. That training instance was generated in this manner that some Gaussian random variable. Was put through this nonlinear function and gave you the picture of say my face. And so. If you knew the Z that went into this box to give you the picture of my face. Then you have the input and the output you can just use back to learn the. Network itself, which track which converts random Gaussian Gaussian random variables to face like images. But all you’re given are faces. You’re not given the corresponding Z’s. And so now we have to estimate. You have to you have to work your way back. Right. So this is where the whole end life framework comes in. But I’ll see a question. Sorry if it might be easier to talk in person at some point. Next class. Okay, guys. Thank you very much. Let me stop recording. I’ll just stop recording.