# 11-785, Fall 22 Lecture 23: Generative Adversarial Networks (Part 1)

So hello to everybody on zoom and Some people here on in person so welcome to this lecture on the first lecture on GANs and I’m up Rajit and Abu will also be there to like present this lecture and Before he start just a quick question. Is anyone Having the name GAN because last semester there was a guy named GAN and he was kind of start to throw this throw the lecture Because yeah, obviously So just a quick question before that so yeah So until now For the past two three lectures we have been discussing about Generative models so you have a You have a Data distribution and your model tries to generate data new data from that distribution It tries to learn the distribution and generates data from that distribution So in this example, which is given we have a set of faces a data set consisting of a human faces and R a must to generate a new face which is very similar to the ones which are actually present on the data set So a cool thing about Generating data and GANs is that if you can see the images there on the slides They they are actually very similar to real world images, but they are not These people do not exist and you can go to this website this person doesn’t does not exist calm and every time you reload the website You get a new face and that person actually does not exist in this world So even though the top left corner lady looks pretty unfortunate that she doesn’t exist So moving on to the characterization of Discrimination and the generative models so we have seen discriminator models which Model the conditional distribution so just Yeah, so if you have a Distribution of data like X is a part of data and circles are another set of data So what distribution Discriminative models do is that given a data What is the probability that it belongs to some class? So that’s what Discriminative models find and it aims to find a decision boundary which separates this This set of data from this set of data. That’s how Discriminative models work and the change in Generative models is that they try to find this distribution as the whole so How can you characterize this distribution? So this is from a class and this is from another class so The joint distribution is given by this Formula so this is basically the joint distribution in terms of probability and You could actually write this as P of Y P of X given Y So this is the probability of that class and this is The distribution of all data given that it’s from a particular class. So for example this Distribution would be the probability of all data given that it’s from It’s from the class of circles and this Would be the probability of all data given that it’s from the class of Exus or a plus us so so on So in in generator models your aim is to just find the distribution of the data and not just to like find the boundary Which separates the data so that’s the whole goal of it and This is actually a rhetorical representation of the same so Generator models try to learn the distribution of the data as shown and Yeah At this advantage of discriminative models is that it just it’s limited in scope and it can just be used for classification task and Sometimes it can be used for a bit of regression task, but Generative models can also be used for discrimination. So this can also be written as So So this is the probability of data and you can also find the discrimination or the decision boundary with suppressed data so Generative models can also be used as a Discriminative model, but the other way is not possible But what becomes challenging for Generative models and stuff if we can this is a very low dimensional subspace, but if we consider this the problem of generating images new images X will actually have a lot of features So if we have like 64 cross 64 image and three channels you’ll have like I don’t know just multiply all these all the three together You you’ll have so many Features and even if you like reduce the features with your contacts to like seven to eight features even that’s pretty high So the problem with this is that the dimensionality of the input is pretty high and That makes the problem very complicated and Charitizing these distributions will also be a little challenging. So that’s the hard part of this generator problems Yeah, so this is the problem which we are we are actually trying to solve Learn a distribution and try to generate a new data from that distribution Yeah, so until now we have seen about a Generative model called VAE so you just take a VAE train it into end throw away the encoder and the decoder by itself This is actually a generator. That’s what the profit of actually mentioned in this lecture and if you actually give a random vector z random latent vector z which is sampled from a distribution P of z Which is most probably a Gaussian distribution and pass it through the generator the generator generates a synthetic data so you can actually consider the decoder of a VAE to be a generator the second half of the VA And as you all know from the previous previous lectures This VAE is actually trained with the help of maximum likelihood a match. Yeah, maximizing the likelihood of data So what we have to consider is that so just take this since So consider that this is some subspace and this line represents the fair space of faces space of faces and sampling points in this line gives you like human faces and so on So maximizing the likelihood is basically maximizing the probability of the data Which you actually have so if you consider Data which is kind of a little far away here It’s not really necessary that this might not be a face There is a possibility that this might also be a face but since we are actually maximizing the likelihood The probability given to these points which may become face may be faces might be a little low So that’s what maximum likelihood does so it just maximizes the probability of the training data and the other data They might or might not be garbage. We do not know about it. So that’s something which is not really ideal for a Disclinator sorry generator. So and this actually does not give a direct relationship about it because We do not know about these data’s which might or might not be garbage So we have to make the training criterion a little more direct So yeah before going on to this what do you think is the most simplest way to evaluate a generative model the most simplest way So the naive the most naive way to evaluate a generative model is that you just Generate a generate result and just just look at it and say if it’s okay if it’s real or not That’s the most naive way to do but it’s very unfortunate that you are not differentiable and you We cannot find the gradients with respect to you nor we can back propagate the gradients Into you so that’s very unfortunate But we need this to this to happen so that we can actually make our training criterion pretty direct so Income’s Generative adversarial networks GANS so this actually have Giants have two models we have generator models which is usually the BA or the quotas of the BA and And adversarial is that it’s made up of two two networks which try to compete against each other and Networks is basically neural networks. We are not trying to use any machine learning algorithms So yeah, this was introduced in 2014 by Ian Goodfellow and The goal is to model the distribution of data here. We are not using the notation P of Y comma X because You just have one set of data and this basically becomes PFX so you don’t have to worry about it. So our goal is to Model this distribution so that if if we sample any data from that distribution we get We get a data point which is similar to the real data and as I told earlier GANS have two networks the generator which generates data and the discriminator which does your job the differential Differentiable aspect of you which evaluates the data So this is the basic structure of GANS. So you have a generator Before that you take a latent variable Z pass through the generator and the generator generates We can call it as a fake data and you also have a real data distribution and you have you take a data point from X This is also possible a discriminator the discriminator tries to classify the The generated data as fake and the real data as real so it comes to a point that the aim of the generator should be in such a way that the generated data should be classified as real by the discriminator. So that’s the whole goal. So initially We want the discriminator the discriminator should actually classify the generated data by the generator to be fake But as the training progresses by generator should actually try to learn to produce data which is similar to real data and the discriminator should be in a position to make mistakes Yeah, going on to the generator of GANS. So Generator the goal of the generator is actually to model the distribution PG of X the fake distribution to match the true distribution of data which is PX of X and that’s the whole goal of Generator as I mentioned earlier the latent vector can be sampled from a Gaussian distribution as well so The aim of the discriminator is Just it’s just a binary classifier you have real images and You have fake images the real images are the images from your data set and the fake images are the day of the images Generated by the generator. So your aim for the discriminator is that it has to learn to classify this generator data as fake and the real data as Like real but as the training progresses by the discriminator should be in a position to make mistakes So so that your this your generator is doing a good job in generating data which is close to real data So yeah, the aim is that if a if a perfect discriminator is fooled meaning if the fake data generated by the generator is classified real Fooling the generate now fooling the discriminator then the generator has done a good job in generating data which is pretty good Now moving on to something a little math. I Didn’t really start this was so hard and props. Yeah, that’s pretty hard Prof. Fulder very easily. Yeah, so moving on to some math. So consider that Consider that you have two distributions. So this these two consider that this is the distribution of your fake data and this is the distribution of your real data So the perfect discriminator or any good discriminator would find a boundary which separates Two regions as fake and real. So what happens here is that First it finds the ideal boundary as this the crossing the place where it cross both the distributions cross and assigns this region as fake and anything towards the side as real So that’s what the discriminator basically does first but what you can actually see is that These regions here are the regions of the data which which are misclassified. So these are the places where the discriminator is actually making errors. So you might have seen this in base error rate. So this is this is the erroneous Regions of the discriminator. So now the goal of the generator is to produce a point a data point which is here like which is on the other side of the fake region in this way the discriminator makes an error classifying this point. So if it makes an error the decision boundary will be shifted in such a way that this point comes to the fake region again. So from this the distribution of the fake data is shifted and the discriminator finds a new decision boundary such that the fake data point which is generated by the generator is on the actual fake data. So this is the right side of the boundary and the discriminator finds tries to like work hard to find a boundary which kind of minimizes this error. This happens continuously and at a point of time you have your real data distribution and the fake data just goes on top of it. The discriminator makes like huge amount of errors but your generator has learned to generate fake data which is close to the real data. So that’s the whole training objective of cans. So these lies actually mention a bit about those. Yeah, the first thing is that the first step in training against is that to do this you actually need a discriminator which can actually find the decision boundary between two distributions. So the first step is to like train the discriminator and the training examples for the discriminator are like since it’s a binary classification problem you need real examples and synthetic examples of fake ones. So the real examples are the examples from your data set and the synthetic examples are the examples from the current state discriminator so whatever the generator, generator generates in the current step that will be used as synthetic data for the for training the discriminator. So and usually it’s trying to minimize the cross binary cross entropy loss. So the next step is that the discriminator’s loss is back propagated to the generator and the generator needs training the generator involves generating data which when classified by the discriminator it should be classified as real. So that’s how a generator is trained to maximize the discriminator loss even though like for example the training data the generated data from the discriminator is fake it should actually learn to produce an output of one by the discriminator. So yeah, so this slide gives the formulation for it. So for real data X the desired output of the discriminator D of X is one because obviously it’s the data which you already have and you want to find the distribution of those that data and the log probability is of is log B for synthetic data X hat the desired output of the discriminator is D of X hat is equal to 0 because we know that that’s the synthetic data. And the discriminator this is during the training step of the discriminator for the synthetic data it should produce the output 0 and the corresponding log probability is 1 minus D of X or G of C. So this is the first step in training GANS and after this we actually have a poll. So that’s a period. So one is actually give. Okay. Hello. Okay. Cool. I guess we’ll wait another 20 seconds for you guys to finish that super difficult poll. Oh. Also I just realized that the number of the other posts on our in our classes absurdly high and it happens every time I just notice that 1600 and 81 is an enormous number for the other posts I think it’s pretty cool. Okay. I think that’s about time for the poll. One thing that you guys need to keep in mind is what appraud is mentioned that the first thing we need to do in training again is to have a good discriminator. And the reason for that is if you don’t have somebody telling the generator how to generate images. The generator really won’t be able to make things which look like what you want to look like. I mean, in theory you could do it the other way as well, but the math works out a lot nicer once we have a good discriminator first, but we’ll get to that in a second. The first we want to visit something known as information and information in math terms. Now, let me, I mean, we have we have these two sentences here, which is today I don’t see I didn’t see a tornado and the other the other sentences today I did see a tornado. A quick question to anybody on zoom or the few few here. And the answer is on the slide, but which of the two sentences obviously contains more information. The one on the bottom right. And the reason is clear because the one on the bottom is an event that does not happen that often right and the same concept can be applied to things like NLP. Because if you have a sentence which the only words that it has are the articles a and in the it doesn’t tell you much. Because those words are super common and you don’t really get much information out of that sentence. Right. So the information content in any sort of set of data or set of events is more when the probability of that event is low, right, which is why the negative log of p of x, which is the probability of the event x sort of makes sense as a formulation for how much information content there is. Now what is entropy right and we’ve all heard the term cross entropy because we’ve done all these homeworks and we know that it’s a loss function to for classification x, y, and z, but what is entropy really so entropy is just a weighted average of information content in a bunch of events right. In other words, it’s a measure of uncertainty and the way it’s calculated is as I said, it’s a weighted average and the weights are the probabilities of the events happening right and all of this sort of comes together in a bit. This is sort of just an interesting thing but I will come to this at the end of the lecture if we have time. But let’s get to cross entropy and cross entropy and the word cross here really means that instead of measuring the probability of the event under the same distributions p of x and p of x, what we do is we measure the probability of an event x under two different distributions q x and p of x. And this obviously looks a lot more familiar to you guys because this is exactly the the loss function we use for classification right. And now let’s get to binary cross entropy right this is very, very close to what we’re doing here today because the discriminator is really just a binary classifier right it tries to classify all the real images as one and all the fake images as zero and that’s a training data that we feed it. But essentially the binary cross entropy formula opens up to look like this like a large summation term and the reason why it works out is because p of x can really only take two values either zero or one and when it’s zero the first term cancels out in the second term stays in and when it’s one the second term cancels out in the first term stays in. And hopefully you guys can understand why we can write it as an expectation because expectation opens up to be awaited some with probabilities. So coming back to the discriminator right the discriminator wants to minimize the binary cross entropy loss with respect to real and fake images so that it can correctly classify real images as real and fake images as fake. So the first term here corresponds to what it thinks is the true data and the second term corresponds to what is the fake data here there if you want to mean you should notice that there is a bar over p data here and that just represents the complement of the distribution so everything outside of the real data is fake. And we can sort of reward this to be true and generated images now because our x can be replaced by g of z and z is our latent latent vector that we sample from. Right this is pretty important going forward we will be we will be referring to p data as p of capital X because capital X is our data distribution. So if we just drop the negative sign then instead of minimizing the binary cross entropy the discriminator can achieve the same output by maximizing the same term just without the negative sign outside it. But the generator wants to fool the discriminator it wants to make sure the discriminator cannot be a perfect classifier so it wants to minimize the same term that the discriminator is trying to maximize. And with that we get to this formulation of the gun loss it’s a parity between minimizing and maximizing of the same term and the reason why one of them is colored in red is because generator really only affects the second term right so while the discriminator is trying to maximize it the generator is trying to minimize it and that’s where sort of this adversarialness comes in. Right and this is the old formula that we saw. What something to notice here is the reason why we can write this as X sample from p of g and this becomes even more important later is because a generative adversarial network or again is something known as an implicit model and before I go on does anybody know what’s the difference between an explicit model and an implicit model of probability distributions. And then we get a file guess. Right exactly so the explicit model exactly knows what probability distribution and sampling from right and it can if you give it a data point it will tell you where it is with respect to that probability distribution or how probable that data point X is an implicit model cannot do that. And implicit model has and correctly name an implicit understanding of what distribution the data comes from but it can sort of achieve the same point of generating the data right of course if a model already has the distribution that it’s outputting you can sample from that distribution and get a data point but you can also do the same with an implicit model. What making an what making an implicit does is it makes the gun formulation what’s called as likelihood free and that’s also part of what makes the math in it pretty simple right. And this is what we mentioned before the objective of the discriminator is to put d of X as one and d of g of z as zero and the objective of the generator is to make sure that the discriminator cannot put zero for images that are generated. And this is kind of what we’ve seen before so say we have this sort of a setup where the red distribution talks about the fake data the blue distribution talks about the real data and what the generator will try to do is and as we already have here it will try to generate more and more images on the real side of the of the decision boundary. And the more it tries to generate points here the peak will start shifting this way and it will sort of start to. Peak at the point where it’s starting to generate more and more images and then the discriminator needs to do a better job and it tries to push its decision boundary to be here which is what we did here so now the points that are classified as falsely real now actually fall on the side of. The fake data right so this is just shown here in a bunch of slides we shift our distribution towards the real side and then the discriminator gets better and then the generator gets even better and the discriminator needs to follow so there’s this sort of a loop which goes on the discriminator learns a perfect boundary the generator then learns to push its distribution past this boundary then the discriminator initial learn a new. And then the boundary and this sort of interplay between them keeps going on. And this is where the most important math part of this whole lecture comes in and unfortunately for the few few guys here I will try to make this as interactive as possible so I might be asking you guys. Questions as we go along is to make sure that the mass makes sense to everybody right so bear with me here and a note for anybody watching the lecture later please make sure that you watch this on the zoom recording because you’ll be able to control how large are part of the screen it. So what we have here is we have a minimizing objective and a maximizing objective and it looks sort of like expectation of a data point coming from p of x with log of d of x plus an expectation of some latent variable z coming from from distribution p of z. And you put log of one minus d of g of hopefully the number of brackets losing is correct here. So first question how do we open an expectation term what’s the form will offer expectation. Correct what if it’s what if it’s a continuous variable. It’s integral right so I can write this term here as an integral over all of all of x. Probability of x times this function here log of d of x. Correct this is a mix of right similarly I can write this term here as an integral over all of z probability and this is over probability distribution of x probability of z of the state variable z log of one minus d of g of z. Easy right and this is where the implicit nature of gans really comes to shine because we can’t merge these two integrals yet right because they’re integrating over a different sort of variable which is distributed differently. So what we can do here is we can replace this term and this term and this term by using the implicit model that the gans so we can just assign this probability p of g of x to some data point x. And then our equation for the right hand side or this after this plus sign starts to look like you can integrate this over all of x p of g of x log of one minus d of x. Right any questions so far awesome so this means that we can merge the two integrals into one right so. Okay we might run out of board space here but and hopefully I don’t make any math mistakes but now our whole integral looks like this we have p of x of x log of dx. And this is the formulation for the loss function and let’s try to see if we can find the optimum point for a discriminator and how do we optimize an equation. We can find the optimum in an equation minimum or maximum derivative and set it zero so what we’re going to do is we’re going to take the derivative of this with respect to the discriminator and set it to zero. So let’s try to take the derivative of this equation inside of it and try to set it to zero. Hopefully this should be pretty simple what would be the derivative of this first term with respect to dx. So what would you differentiate this first term with respect to dx. Yeah cool alright so what we have is we have p of x of x over d of x correct similarly the second term will open up to have a negative sign because there’s a negative sign here it’ll have p of g of x over one minus d of x. And we set this to zero. Yeah with me so far. I mean we’re just setting the whole thing to zero so if we individually by parts at everything to zero the integral of the whole thing would also be zero so we’re just sort of differentiating the inner function here with respect to dx. And so we can just set this part to zero and if you rearrange the terms you’ll sort of find that the optimum point of the discriminator the optimum function of discriminator which I can call d star of x comes out to be p of x of x over p of x of x plus p of g of x. What we so far any questions on zoom. Okay so this is what’s there on the slide as the optimum base classifier right all it does is maybe in the absence of all information a science of probability of real data to be the probability of the red point under the real distribution over the sum of the probability of the data point of the real as well as the generator distribution. Right so now we have the optimum classifier. And yet this slide is just for the blackboard the masses at the end of the slide deck you can see it it may be a little illegible right now because it’s my handwriting and if it is we’ll fix it but let’s move on. So once we have the optimum discriminator we can now try to make a generator which will beat this optimum discriminator right this is the best a discriminator can do. And we try to beat this by making a better generator right so how do we do that is we’ll do it on this one. Awesome so I’m going to write the equation of our found discriminator to be p of x over x divided by p of x plus pg of x and we’re going to put this equation into the objective that we’re now trying to minimize with the specter g. So what we’re going to do is we want to minimize with respect to g the expectation over p x of log of dx plus the expectation over pg of log of 1 minus dx. So let me plug this equation in here as well as here what we get is something which looks like and I’m going to ignore this minimizer for a bit we’re only going to work with the equation and it’s going to look like the expectation over p of x of log of p of x of the data point x over the x of x plus pg of x plus the expectation over pg of and 1 minus dx is just going to be p of log of p of g of x over p of x of x plus p of g of x. Any doubt so far the few of you are lucky you can ask doubts in person but any doubt so far does this math kind of makes sense all we’ve done is we plugged in the value of the optimum discriminator that we found into the equation that we’re trying to minimize for the generator. Does that make sense okay cool so what we’re going to do here is going to look kind of silly but it’s going to work out really well what we’ll do is we add a term of log for and then subtract the same term log for. This looks silly I agree but it will make sense in just a bit you can write log for as being to log to yeah that makes sense because 4 committed to power to and log to the more exponential the exponentials and then the initiator comes out yeah so it looks like log to twice and this we can write as log to plus log to yeah it’s still kind of silly but wait for it to make sense right we can take each of these log to and add them here and here because log of a plus log of b is log of a times b right so now our equation starts to look like twice of px and twice of pg and this log for term goes away yeah but we still have this negative log for right and this is still silly admittedly right by look at what happens now if we take this to and put it under here and take this to and put it under here this term now resembles the KL divergence of p of x under x and p of x and the sum or the average probability of the data point x under x and g yeah and it’s going to look like dkl to the care divergence of p of x of x and p of x of x plus p of g of x over 2 and this term similarly starts to look like dkl of p of g of x plus p of x of x over 2 and minus a log for yeah any questions about how we went from here to here there’s math on the slides at the end and you’ll see that the equation for KL divergence between p and q is this the expectation of log of p over q and here we’re going to do one more silly thing we’re going to take these two KL divergence terms and multiply them by 2 over 2 yeah this does nothing to the equation but what it allows us to do is it allows us to take this denominator to and put it under both of these KL divergence is right now this looks like the average of 2 KL divergence terms between p of x x p of g x and the average of the two right if you look at this square bracket at term alone do you think this function is symmetrical what I mean is if I swap the values of x of p of x and p of g will it make any difference no right and that’s precisely what this does it makes it a symmetric variant of the KL divergence this is what’s known as the Jensen Shannon divergence which is denoted as djs d which for two distributions p and q is given as dkl of p and p plus q by 2 plus dkl of q p plus q by 2 over 2 right I know it’s a it’s a bunch of mass but does this make sense right and we have this weird little log for terms sort of hanging around here but the reason why it’s there on the slide is because we used the plus log for term to get the Jensen Shannon divergence and from there and we’re going to erase this now the objective for our discriminator or sorry our generator ends up becoming minimizing twice the Jensen Shannon divergence between p of x for some variable x p g for some variable x minus this weird log for term right so with the optimum discriminator with the optimal discriminator what the generator is trying to do is it’s trying to minimize the Jensen Shannon divergence between what it generates versus what’s the real data and this Jensen Shannon divergence has some pretty nice properties here’s a formula that we saw before that the Jensen Shannon divergence can be written as the average of two KL divergence terms which is one between p and the average of p and q and q and the average of p and q which obviously means that this is a symmetrical function and this has the nice property that it doesn’t explore or exaggerate the values with respect to what which of p and q really turns the probability to zero right it also helps us get fewer log zero terms right I mean one of the simple ways you can make KL divergence symmetric is you can add a key of KL of pq plus KL of qp and that would be symmetric as well right but that will have a bunch of log zero terms because there will be a lot of data points where the real distribution is zero and a bunch of data points where the fake distribution is zero and we don’t know how to handle these log zero terms the only case where here we’ll get a log zero is when both p and q assign it zero probability and in that case we’ll multiply it by zero as well because in the expectation term there’s a p of x for over g or capital X there so we don’t really care okay that was a bunch of math any questions on all of this math because after this it becomes conceptual again all right any questions on zoom awesome either people are super confused and don’t know what to ask or everything is crystal clear I hope it’s the latter but okay yeah I was wondering how did you like brought the log for inside the expectation oh yeah I can I can show you that pretty simply so the the expectation term can be written as a large sum correct yes yeah and we all understand how log of some x plus log of some y equals the log of x y yes so all we’ve done is sort of added this log y term to the expectation of x and this also comes from this thing called linearity of expectation where the expected value so if you add some the expectation of fx and if you add some variable a to it this can be written as the expectation of fx plus a what this means is you’re essentially shifting the mean of all the values of fx by a right so the expectation of anything is sort of the mean of that thing okay and so when you shift the mean of a whole distribution by some constant a it’s the same as finding the mean of the shifted distribution by a right so say we have and this is a super I think it’s a job but let’s see if you can make it work yeah we have a bunch of points here which have a centroid somewhere in the middle right and if you shift that centroid this way by some distance a the centroid lies here and all the points now lie around it correct yes instead we moved all the points by some distance a and took the mean we would end up at the same point yeah that’s that’s where the linearity of expectation comes in it’s a it’s it’s a pretty foundational term in statistics I didn’t expect you guys to know it’s a thank you for asking that but essentially that is how we moved the log two term which is a constant it doesn’t depend on x right so it wouldn’t change as the distribution shifts over x or x or g and you can just add that log two term inside the expectation function does that make sense yes awesome any other questions okay thank you awesome so this is as you mentioned it tries to minimize the gents and Shannon divergence between PG and PX and once again PG is the implicit distribution or implicit probability that the generator assigns to a point small x right so what this means is there is a stationary point between this minimization and maximization object right as it as the discriminator gets better it stabilizes itself at the base optimal point and the generator will try to push its distribution to the other side and the discriminator will get better and it will push it to the other side and the discriminator will still get better but as the discriminator keeps getting better the the base error rate also keeps increasing because there is more and more overlap between the generated distribution and the real distribution and eventually in the limit and this is what we hope happens the distributions of the real data and the generated data completely overlap with each other so that the peaks are at the same point right and now the best a discriminator can do is guess at random right because and hopefully there is a little bit more chalk left because if we have one distribution here and another completely overlapping distribution here the best of the discriminator can do is sort of put itself in the middle and then it’s a coin toss right is this point real or fake we can’t tell this point real or fake it can’t be tell it has to sort of randomly guess right and this random guessing is unfortunately also the case for a random classifier right if at the beginning all you have is a random classifier which throws around guesses that yeah this is real this is fake it won’t really able to tell the discriminator anything important which is why we first need to train the discriminator to be non random right so that it tells something important to our generator right and of course the stationary points need not be stable because the generator might overshoot, undershoot things of that nature can happen but we won’t get into that for now this is more important right because if we don’t train the two things together then information won’t flow well if the discriminator is not trained well enough for example it’s it’s the random classifier it will just give not very important feedback to the generator right if for the same image the discriminator says real and then it says fake it knows nothing about what to do the generator got opposing feedbacks or if the discriminator is too well trained then there is no local feedback coming in it doesn’t know how it can do better and this is what Abhrajith mentioned that the discriminator needs to be in a position where it can still make some mistakes right because in ML and most of life we learn through mistakes right if things are going well there’s nothing for us to change and nothing for us to learn cool so this is where we get to the training formulation of can we need to sort of train them jointly and we first need to arrive at a decent enough discriminator so we can do this in batches in one batch we can take a bunch of discriminator generated images a bunch of real images assign the real images an automatically able of one assign the fake images an automatically able of zero and then train the discriminator and once we have a discriminator we can start generating more images try to feed those through the discriminator and try to see if the discriminator correctly classifies them as zero or it is actually full and it starts to classify them as one right and of course what we want at the end is a generative model and we can sort of throw away the discriminator once everything is good right this sounds okay but GANs are super hard to train because there are a bunch of problems that we can face something like mode collapse can happen anybody knows what mode collapse means I used to be a super fan of physics and we had this thing called probability collapse that happens over event distribution right everything is a probability shorting is equation and once you observe something the probability collapses and all fertilize at the point observation the same thing sort of happens here is if your distribution has bunch of modes or a bunch of different dimensions to describe it if again can’t sort of cover everything it just drop everything that it can’t cover and focus on things that it can so what this means is it can possibly find that one super hyper realistic image as a discriminator can’t say that it’s fake and keep generating that right I mean what good is a generator if it always generates the same face right if the if the website we saw before which is this person does not exist each time you reload it it shows you the same picture of the person what good is that website it’s correct that person doesn’t exist but I mean we want a few bunch of different faces coming out from the GAN right so they can produce really really good images and we’ll see some examples in a bit but they are super tedious to train and since homo 5 is about GANS hopefully you’ll see or hopefully you won’t see that they’re terrible to train and you’ll be able to find ways to actually train them Yeah just for the 685 people but yeah hopefully towards the end of this I mean I think we have enough time to go over the code is all right awesome okay cool so yeah of course there’s a bunch of variants of GANS you must have heard of I mean hopefully maybe you’ve heard of cycle GAN or star GAN or this washer’s time GANS which sounds like that sauce from Britain that nobody can pronounce but is that spelling I don’t know maybe okay so how do we make sure that the discriminator works well right one way to do it is to sort of eyeball the discriminator but that’ll be sort of weird right the whole reason we have a discriminator is because we don’t want to eyeball what the generator does so if you put somebody else in and then we have to focus on them then we sort of kept our work and employed somebody else as well it’s waste of labor so what we can do is we can also computationally figure out how well a discriminator is functioning and we won’t go into too many details here but essentially what they do is they take a very good face classifier and try to feed these generated images through the classifier and the idea is hopefully the classifier is able to classify this face as one of the categories that is trying to inception B3 is one of those images image classifiers which came out at some point and became really big top the leaderboards for a bunch of face classification tasks and here’s where entropy comes in again right we want the model to not be super uncertain of what’s happening if it has if it’s trying to classify between a hundred things and the distribution is just uniform over all of those hundred that just means that what we’ve given it doesn’t look like anything it doesn’t know what to call it so we want to have low entropy in its labor or probability distribution and it’s upwards right and that’s what they call as an as an inception score the reason why it’s called inception score is because it comes from inception B3 it’s just a measure is this to see how well is my discriminator doing at some point it’s an evaluative metric so far we’ve seen V A E’s and GANs has two large varieties of generative networks so the main main key difference is that V A E’s try to minimize KL divergence between the real data and the generated data tries to maximize the likelihood that’s assigned to the generate data under the real data and GANs instead try to minimize the GANs and Shannon divergence between generated data and what’s real right one of them is symmetric one of them is not one of them is much nicer much smoother much easier to work with the other ones sort of not KL divergence is super convenient that’s why we use it everywhere and once you take a derivative or a gradient over it the self term goes away and all you left with this cross and that’s where cross entropy comes in I’m sure we went over this a bunch of lectures ago initial lectures right another difference is V A sort of have this encoder which during training first takes in a bunch of images tries to find the best latent representation of the distribution we sample from it and then we generate a new image where instead in GAN it’s a much simpler formulation you have somebody trying to generate it and somebody looking over the generation to make sure that the generation is good right however although V A is a very complex the math is weird V A is a super easy and nice to train it’s just easier to achieve convergence with V A’s and and a stable convergence are not a non stable convergence well whereas GAN is very hard to train there’s a lot of noise happening and it’s really hard to optimize the learning process right however with that shortcut of easy learning you get blurry images coming out right noisy outputs coming out and with the labor of having a hard optimization task you get much sharper results right any questions so this is from the original paper from Ian Goodfellow in 2014 by GAN the right most images are what GAN produced and everything before that are other generative networks clearly the GANs images are the Christmas I think I think it’s the most easiest observed in the first I mean in the top left image and the bottom left image that the right most results are the most clear cut right and with time of course they’ve gotten even better the images are super hyper realistic I mean I would believe this person exists right but probably they don’t and yeah you can do something like style transfer as well with stargain the style essentially here maybe represents the emotion shown on the face of a person and you can sort of apply that style on a random image and you get the image in that style so you can get somebody angry happy or fearful or all these different emotions it’s pretty cool so just more examples of GANs doing really well and since a project needs 10 minutes I’m going to take two more minutes to talk about something that I find super interesting and we sort of glossed over which is how not to put you out of context but how sort of softmax and entropy sort of work together it will just take two minutes right and you can sort of do an experiment over this by themselves so softmax which is exponentiation of a variable divided by the sum of the exponent gives you a probability distribution right it sums up to one it has all those probability features there’s also something you might have heard of which is called a Boltzmann distribution right it’s very popular in pure science but it also becomes popular in computer science because with this temperature term which is this style you can sort of control how much entropy there is in your softmax distribution and what it does real quick is a high value of this temperature makes your distribution hot and if you remember from high school physics or middle school physics warmer gases have more entropy they have more energy that going everywhere cooler systems have less energy less entropy and there is less movement and that’s precisely what happens high values of tau makes your distribution uniform right that means the distribution of probability is equal everywhere as the the more you increase your temperature and the more you call it the more it focuses on where the distribution is max so at at zero temperature which will sort of cause calculation issues but let’s say some excellent temperature you will get essentially all the probability mass comes into what’s the maximum probability and everything else goes to zero and at something like 100,000, 10,000 it sort of approaches uniformness and you guys can sort of try this using numpy and matroclib it’s it’s pretty cool I just wanted to plug this in something I find super cool and since you’re talking about entropy and information we could have but yeah I think with that we can come to some code that appraising can present till then any questions Hello yeah good that I would talk about the Boltzmann distribution in the attention is all you need paper if you take the softmax before that you will actually divided by the key value size so I was doing it I didn’t realize that I was actually doing the Boltzmann distribution and you’ll also be forced to do it in homework for so yeah that’s also there so we thought of actually sharing a bit of code also to give this lecture a bit of a good conclusion so that you also understand how GANs are implemented for art generation so I went over this for homework five people last week so this is kind of a cool application so the reference code which I actually followed is here about DC GAN and this is the data set you can actually join this casual competition and download more than art competitions like the basic import for your homework to things and these are some helper functions to like normalize so we usually normalize to like get some faster training and everything so this faster convergence so this anomalies function basically just as the name suggests anomalies the normalize data so that you can get it back to the zero and zero to one range or zero to two five range show measures just help you to show me just conflicts these are all the standard things which you already saw in your homeworks yeah yeah is it the is it by now yeah so this is basically the normal data set class and this is derived from the test data set class from homework 2 p2 I’ve just renamed it and just name the like give different parts that’s it and you have transforms for augmenting your measures so I remember telling a lot of people not to use vertical flip for homework 2 p2 because it doesn’t make sense seeing people’s face inverted but here we are actually in aiming to like generate art like this like art by one and seeing art like this like in a in the proper orientation and just rotating your head and seeing you get a different art so that’s something which artist will understand and they’ll just say that those are two different art forms so you have the normalize we usually for cans I’m guessing 0.5 is used for use but yeah it’s just a design choice you can even use something else about like the image that if the data set the mean and standard evations are kind of similar so plotting some images from the training data set you have these are which are actually kind of looking good as arts and I used in a made sense of 64 because it helps in faster training so going on to the GAN model this is just a normal weight initialization function and you have a generator so as mentioned earlier generator takes in a latent variable from some distribution p of z and when it is passed through the generator it’s a data so this latent vectors of dimension 100 and after it gets passed to the generator you get an image of size 3 cross 64 cross 64 so that’s the final output of the generator and you’re defining a generator now this is the summary like just right the standard stuff and the discriminators basically binary classifier it takes in a 3 cross 64 cross 64 image and at the end it just produce one output we use a sigmo because we’ll be using the binary cross entropy and having a value between 0 and 1 kind of helps binary cross entropy and you have one output which is basically the class information 0 or 1 and this is just a wrapper class to make sure the generator and the discriminator inside one whole network it’s not even a module class and I just wrote it because this makes it similar to homo 5 GAN model so no reason behind it if we follow the PyTorch reference they’ll be using two different networks so moving on to the training configurations as we mentioned like quite a few times we’ll be using the binary cross entropy laws on the discriminator so the discriminator learns to discriminate between fake and real data and this fixed noise vector is just for evaluation so you use this noise vector and see how this noise vector gives images so that’s why we actually have a fixed noise vector and you’ll get to understand why in a bit so real label is 1 fake label is 0 the generator has its own optimizer the discriminator has its own optimizer as you can see so the generator parameters are passed here the discriminator parameters are passed here so both have their own optimizer the first thing which we’ll be doing to train a GAN is to like train the discriminator first as we mentioned a lot of times so you can either zero grad the optimizer or some people actually do model zero grad it just makes sure that everything inside the gradients inside the models are like zeroed out so anything can be done you get the real data this is the real like assuming that this data is from the data loader you get the real data and as Abu mentioned in the later half labels for the real data would be 1 so you fill you have a torch tensor with the once and pass in the real data through the discriminator in this code and you calculate the cross entry binary cross entry clause where the label is 1 for real data the next step is to train with fake fake data so you have a random vector like random latent vector you generate it of the shape like 100 cross 1 cross 1 here in this example and you get the fake data by passing this noise vector inside the generator and the labels for this process is fake labels which is 0 so it means that the actual labels for the fake fake data generator from the generator should actually be 0 and this is then trying with the same binary cross entropy loss and the error is back propagator so this is the first step in training GANs where you first train a discriminator first with real real images and then the goal of the first part is just to maximize this and the second part is to second part is generator but yeah so this basically the training with real data maximizes this portion and the fake data maximizes this portion of the equation which we showed earlier so this is the training step of discriminator then moving on to the generator you have the fake images generated by the generator in the previous step so you have that you have your real data here from the data loader and again 0 grading everything and here you will be passing the fake data and getting the output of the discriminator but since you are training the generator here the labels for it should be 1 because in this step the output of the generator should actually be predicted as 1 by the discriminator so that’s how it works right in this step the networks are kind of competing with each other so fake in this step fake labels fake data has a label of 1 because it’s the generator training step and you get the loss you back propagate the loss and so on that’s it these are like pretty common steps and this is actually a training loop which I’ve written just to like monitor how it goes and this code is just straight out of the box you can just run it from top you don’t have to modify anything it’ll just work fine just for you to understand how GANs work so in the first epoch if you see you just have like random noise it’s it’s basically nothing but for some people this might also be art so who knows but it’s objective so but for me it’s at least noise for me so as epochs progresses by you can see that the generator images are now doing are kind of looking close to Monet art and after training for like 500 or more than that epochs you get you start to get pretty decent images which are kind of similar to the one which I showed earlier so and after 500 epochs let’s see how it goes so you’ll get yeah you’ll get okay so you can see these images are pretty good right they kind of look like Monet art so that’s the whole goal but as Abu mentioned earlier it takes a lot of time for convergence and this didn’t take like a lot of time because the images are down sampled but people working on homework five will actually face the issue that they’ll have to train it for like 2000 epochs and convergence happens only after like 600 epochs so paying GANS will be a pain so yeah cool that’s it this is the implementation aspect of GANS and thank you all for coming so yeah yeah yeah I just wanted to ask for a little feedback if these slides here is this legible if it is I let it be and have sort of my own artwork on the slides if it’s not I can try to type it out and sort of make it more professional this one is colorful though so maybe that maybe that works for it if this is readable that’s I can just leave it be awesome okay cool I think yeah thank you all for coming it’s knowing outside maybe hopefully it stopped knowing now but yeah I have a great rest of the day and the weekend and the Thanksgiving weekend Christmas and New Year and whatever comes there and homework four and homework five for six eight five people and the project and yeah.