11-785, Fall 22 Lecture 20: Representation Learning
Recording and Progress. This is creator EAT management We Thank you. All right. So morning everybody can you hear me? Morning, we can hear you. Okay. This seems to be some strange sound, but anyway, this is going to be the first of a few lectures that will be on zoom. Today we’re going to be talking about what neural networks learn. So what we’ve seen so far is this neural networks are universal approximators. They can model any Boolean category color real value function. They can check static inputs for patterns. They can scan for patterns and well-peas and CNNs. They can analyze time series for patterns. Those were record networks. In each case, they must be trained to make their predictions. But when we train them to make their predictions, what do they learn internally? What does the network? Can you mute your sub please? So here’s the learning problem for neural networks. You’re given a correction of input output pairs, the inputs X and the corresponding desired target output. And we have to learn the network parameters so that it captures this desired function. So when you’re trying to learn a network to perform classification, then the network must learn the classification boundaries that separate them. Training instances from the two classes. So for example, if we were given this set of training data with the red and the green dots. And the task was to and color represented labels so that task would be to learn a network that based on set on position could figure out which of these two classes the data belong to. That network must learn this double pentagon decision boundary. So it must learn a function of the kind shown to the right where it takes a value. Say one within the double pentagon region and zero outside. Now in reality, the kind of training data we get are not going to be clean so cleanly separated. You’re not going to have clearly red regions and clearly blue regions. It’s going to be somewhat noisy. So you’re going to have some blue dots on the red side and some red dots on the blue side. So what function do we look to understand this. Let’s consider a trivial example of a network network, which has only one neuron. And specifically consider this two dimensional example. So if you have a single neuron, we know that these neurons learn linear decision boundaries. We would want to learn a step function of this kind, which can learn to distinguish between the regions where red dots lie. And the regions where the blue dots lie except the training data you’re given are going to have some red dots of the blue side and some blue dots on the red side. So you’d have some red dots suspended over the blue over here. Blue dots line on the floor on the red side and from these noisily label data noisy data that the labels are not noisy, but from data which are kind of not cleanly separated. You must learn this function. So to understand this even better, let’s separate this. Let’s go back down to an even simpler problem of one dimensional example. So here we have data from two classes. The blue dots representing one class, the red dots representing the other class. And so for the red dots, the class label is one for the blue dots, the class label is zero. Now in this one dimensional case, a linear classic buyer is a threshold function which says everything to the left of threshold belongs to one class. And everything to the right belongs to another class. And these two classes clearly are overlapping. And so they are not linearly separable. There’s no threshold that will cleanly separate the red dots from the blue dots. Also, as we’ve seen this in an earlier class, the neural network is a universal approximate that can learn an arbitrary complex, arbitrarily complex function. So it could end up learning a function such as this one, which is something that we don’t really need. Also, you have this other issue. Even if you assume that the network could learn a function of this kind and it’s allowed to learn a function of this kind. What if we have a situation such as this where you have red and blue dots at the same value of X. So you have some training instances with this value of X where the class label is one and other training instances with a class label is zero. So here, if for instance you have 90 instances with class label one and 10 instances with class label zero. And what is the function we must learn over here must the function value be one at this X because red dominates or would it make more sense for the function value function to take the value point nine to show you what fraction of instances take the take the label value one, which of these two makes more sense to you anyone. And someone asked that question, which of these two would make more sense. So the point nine, right. That’s because it gives you more information. You can indeed derive the fact that the majority are one from this point nine, but it gives you more information about the nature of the data. Now, suppose I have all of these hundred instances, but they are not all exactly at the same value of X. So again, what is point nine over here is actually an estimate of the a posteriori probability of the class one given this input X. Because the 90% of the data belong to class one when I say point nine, you’re basically saying what is the probability that a randomly sample value that has this value that with this value of X randomly sample instance with this value of X will belong to class one and that probability is going to be point nine. So that’s why we like point nine, right. But now what if these instances were not all exactly at X, but they were off by a tiny little bit by say 10 raise to minus 15 or 10 raise to minus 17, which is the resolution, which is a resolution you can represent with doubles. Now, if all of these were slightly off, does it suddenly become more meaningful to be finding a function with goes up and down with at at each X or does it make more sense to say that in this range, the function value must be point nine, which of these two makes more sense. Anyone. You want to say that in this range, the the ideal output should still be point nine, right. That small perturbation shouldn’t really change the output. But then what if this perturbation increases at what perturbation will point nine stop making sense. So it’s not very clear, right. But then let’s take a look at this slightly differently and we’ve seen this earlier too. At each point, instead of looking at just that value of X, let’s take a look at a small window around that point. And now we plot the average by the value within that window. This average by values and approximation of the probability of lie equals one at that point. And so here for because all of the training instances belong to class zero, your best guess for the equals theory probability of one given X is going to be zero. And then as you slide right from the left most point at some point, the begin encountering instances with class able one. And so this average value goes up from zero. And as you go further right, the average value keeps increasing till eventually you see on the red points, meaning the average by value is one. And so the overall function that you would get is going to be something of this kind. And so this is the function that kind of makes sense for us to model. Now, what does this function look like? We’ve seen this also in the previous class. This looks like a sigma right. So if you have an X, then the sigmoid has this function value one over one plus E raised to minus W zero plus W, W one X. And this sigmoid actually represents an estimate of the aposteriority probability of class one given the input X. So is this may, do you guys recall this is this making sense to you? Right, okay, when I ask you this question guys, please raise your hand so that I can make sure that you’re that you’re tracking me right now. Now let’s see how the sigmoid works if the in let’s assume that W zero and W one are positive. Then if X is a large negative value, then you’re going to have minus off. Then W one X is also going to be a large negative value. And so it’s going to be E raised to minus off a large negative value, which is going to be E raised to a large positive value, which is going to be about infinity right. So for large negative values of X, it’s this is going to be one over infinity, which is zero. So that’s why the curve is zero value at large negative values of X. For large positive values of X, then you’re going to have E raised to minus upper large positive value, which is like E raised to minus infinity in the limit, which is going to be zero. And so this function is going to be one over one plus zero, which is one. And so for large positive values of X, this function is going to take the value one. And it’s swing smoothly from zero to one. As you go from large negative values to large positive values that gives it this nice characteristic shape and this shape is representative of the fact that when you have. Data unit dimensional data of this kind on the on one side, one class dominates on the other side, the other class dominates and as you go from one end to the other. The fraction of data from the other class slowly increases, still in dominates. So that’s why you get this kind of curve. Now, a logistic perceptron, as we know, which is a perceptron with a sigmoid activation is basically just this sigmoid function. And in fact, it computes. So if it would come if it had just a single input, then this perceptron is going to be one over one plus E raised to minus W zero plus W on X. And in fact, this computes the a posterior probability of class one given the input. Now, even if you had multi dimensional data, like a two dimensional example over here, where now the data in two dimensions, they’re not separable. You have blue dots on the red side and red dots on the blue side, but then there is a boundary. And then as you go towards the boundary and then continue along in the same direction and and cost the boundary. Initially, you’re going to see all blue dots, then you’re going to see increasing numbers of red dots and eventually you’re going to see all red dots. So even there, you’re going to end up with an opposed theory probability function of this kind, which is, which looks like a sheet, which has been sort of folded into the sigmoidal shape. And this function, too, is just your standard sigma. It’s one over one plus E raised to minus off summing some over all components. Since summation W, X, I plus the bias, W zero. And this function, too, is going to have exactly the same kind of opposed theory probability function that we have even in the unit dimensional case. Although this function actually is non linear, it represents a linear classifier. What is the set represent a linear classifier? Can anyone tell me? The decision boundaries linear. If I say that, now, if I want to take a decision, the way you will do it is to say any instance for which this posterior probability exceeds a threshold, say 0.5 is going to be class one, the rest are going to be class zero. And you would find the boundary, you would have to equate this guy to 0.5, right? Or two times this guy to one. And then if you solve it out, you’re going to find that that that the equation, it gives you is this is that summation W, X, I plus W zero equals some constant, which is the equation for a line or a hyper plane. And then the boundary between class zero and class one is going to be a hyper plane, it’s linear. And so, although this perceptron captures this sig model shape, it actually captures a linear decision boundary, it’s a linear classifier. How would we estimate a function of the sky when I began drawing this function earlier back here, it would have been natural for you to for for you to ask what is the width of the yellow oval. We’re not actually going to assign a specific width to the over. Instead, we’re going to let the data decide. So the way you do it is, it would be given the training data, many X Y pairs, which are represented here by the dots. So from these, we want to estimate W zero and W one, I’m going to be using this unit dimensional example over here, but if it is a multi-dimensional, then W one is going to be a vector and X would be a vector, but the rest would not change. So the math is going to be exactly the same. Now, if you want to estimate this model, how would you do it? This curve gives you the probability of Y being equal to one given X. So that’s going to be the problem I have flipped the colors over here now Y equals one is being shown by blue. So this curve shows the probability of stands with that particular value of X belonging to class one. Now correspondingly, the A posteriori probability of Y being minus one. So I’m here I’m going to be using a plus one minus one notation for convenience. So the classes are plus one for the blue dots minus one for the red dots. So the the possibility of Y belonging to class minus one given X is going to be one minus P of Y equals one given X. And if you work it out, that’s simply going to be one over one plus E raised to plus W zero plus W one X. So the difference between the A posteriori probability for class one and the probability for class minus one is merely the sign of this exponent. So we can write both of these in combined form and say P of Y given X is one over one plus E raised to minus of Y times W zero plus W one X. And you can see immediately that when Y is one that gives you this formula to the left. And then Y is minus one minus times minus one becomes plus. So this gives you the formula to the right. So this formula you have Y given X equals one over one plus E raised to minus of Y times the zero plus W one X captures both of these curves. So is this making sense to guys the math here. It’s not complex. Just confirm it. Okay. So I have a few hands raised. Okay. So just assume that those few hands represent the class. But please do press because otherwise I don’t know if you can cross white. Now I want to learn the model. So what we would be given is a collection of training instances X Y pairs. Therefore each instance are going to have X value this one. Yeah, this is just one dimensional illustration in general it’s going to be a vector and the class value which would be either one or minus one. So now if I’m given this collection of training instances, then assuming all of all the training instances are independent. The probability that a random drawing would give you this specific collection of training instances is going to be the product over all of the instances of the probability of X mod. Or using base rule. I can write this as the product over all instances of the probability of X I times the probability of Y I given X. And now our function our model for this posterior probability is given by this guy over here right. The of Y given X equals one over one this period is to minus Y times W zero plus W and X. So I can write this term the total probability of all of my training data as the product over all training instances of P of X I times one over one plus E raised to minus Y I times W zero plus. This is W one transpose X R this right is W if you use a short annotation of W. So basically now when I’m given a collection of training instances according to our model this term in red is the joint probability of all of the data as given by our model. And our model parameters are W zero and W over here right so the joint probability of all of our training data is the product over all instances of P of X I times the sub-signite function computed at that X I and Y I can separate the X I and Y items out. So now the log of the probability of the training data is simply going to be the summation over all all training instances of log of the X I plus the summation over all training instances of the log of the sub-signite function computed at that X I now if I were to perform if I were to see which of these terms actually depends on the network parameters it’s only this term highlighted in blue. Now there’s a common estimation framework for estimating the parameters of statistical models there’s a standard framework for learning for learning a parametric model for the probability distribution of data. And the manner in which we do it is that we assign a probability distribution which has some parameters and we try to estimate these parameters such that the probability assigned to the data by this model is maximized we’ll revisit this topic in the next week. So that is called the maximum likelihood estimation procedure and so using the maximum likelihood estimation procedure our guess best guess for W0 and W1 is simply going to be the W0 and W1 that maximizes the log likelihood of the log probability of the training data. And which is going to be the arg max of W0 and W1 or this term in blue because the first term doesn’t depend on the parameters of the model itself or alternately this is going to be the arg min over over the parameters of minus of the summation of over all training instances of the log of the probability assigned to that instance. By the model and this term over here you will recognize the simply minus log here fly and that is simply identical to the callback libel or divergence between the desired output why represented in one hat one what format and the actual output given by the model. And so what we find is that when we try to train this model to minimize the KL divergence between the target output and the output of the network this is exactly the same as maximizing the log like the log of the log of the network. And so in fact training the model to minimize the KL divergence is the same as maximum likelihood learning of a logistic function. So when we train our network to minimize the KL divergence between the output and the target output you are in fact performing maximum likelihood learning of a parametric model for the distribution of the data. Is this making sense. So now, but then this was for a linear classifier right where we have data that we are trying to linearly separate. What happens when you have data of this kind when the models when the decision boundaries are not linear. The very analogous situation still remains now for the moment first let’s consider the case where the classes are separable so in this example we are trying to separate the red and the classes and they can indeed be separated separated by this double pentagon decision boundary. And so when the network must learn to classify the network must learn to output a one within these pentagons and a zero or a minus one depending on how you set it up outside. So assume that you have a sufficient network we’ve seen in our first lecture and the second lecture that for this double pentagon decision boundary a network of this kind would suffice. You have five neurons. You have this subnet with five first here in the neurons and the second up new round over here which captures one pentagon you have a second subnet which captures the second pentagon and then you have this final neuron over here which sort of ours over these two guys. But then we know that our first appearance are linear classifiers right so if this perceptron is a linear classifier and it’s doing a perfect job of separating the red and the green glasses. Then if you look at what the perceptron itself sees what can you say about the values y1 and y2 that are being fed to this perceptron what must they are characteristic be can anyone tell me. This perceptron is a linear classifier right so as this is assume this network as being said so that it perfectly separates the red and the blue classes. Now this final perceptron gets as it inputs these two guys let me call them y1 and y2. If I were to plot the scatter of y1 and y2 for all of my training instances what must they look like for this perceptron to be able to cleanly separate the red and the blue classes. They must be they must be linearly separable right because this guy is a linear classifier so if this is a linear classifier it is if it’s able to separate red and blue data in this y1 y2 space then in the y1 y2 space the red and blue data must be linearly separable. In other words for this complex network the output of the penultimate layer of the second to last layer must comprise of linearly separable data from the two classes that making sense. And so the network in fact consists of two parts the first is this linear classifier which is the final classification layer and the rest of the network which starts off with data which have this ugly distribution and which somehow manipulates the data and transforms them so that you now get a modified representation for the data such that in this in this space. The classes are linearly separable so the network actually has two parts the first is this guy which takes your data from the various classes and rearranges them so that they are linearly separate and then the last is the final output layer which actually performs a linear classification task. So this is making sense. Perfect. Now observe that this is true of any sufficient structure. This network was exactly what we needed for this double pentagon model right but I could have over parametrized large networks which also do a perfect job of separating the red and the blue classes. So it’s not just the optimal structure but any sufficient structure the network consists of two portions the portion below the final output layer which converts these data into linearly separable classes and then the final classification layer which actually performs the job of linear separation. And now if for some reason the network below was somewhat lacking so it didn’t have a sufficient number of neurons or sufficient number of connections we know from what we’ve learned before that there’s such a thing as a sufficiency of architecture for any problem the network has to have the capacity to perform the specific compute this specific function that we wanted to compute and. If the network doesn’t have that capacity then when we train the model what will happen is that this portion of the network will nonetheless try to transform the data so that they are as linearly separable as possible. It won’t succeed because the model is insufficient but it will still get as much of it as it possibly can so maybe instead of being perfectly separable now the data have a little bit of overlap in the boundary region but it will bring them to ask close to linearly separable as possible and then the final output layer is going to try to perform linear separation on these data so that gives us a first one. Thank you. Okay 10 seconds guys. Okay does anybody want to answer the first question. So the portion of the network until the second last layer is essentially a feature extraction module that extracts linearly separable features for the classes and the output layer is a linear classifier that can only perform well if the rest of the network transforms the inputs it is such as the so wherever classes are linearly separable. Both of these are true second one is not false right because the output layer is a linear classifier if the data input to it are not linearly separable it’s going to fail so for the output layer to do a good job the rest of the network must transform the input space such that the classes are linearly separable that make sense. Those are few ones at first. Now this example over here just assume that you know we have data of this kind where the two classes are can be separated by the small right more generally you’re going to get data of this kind where you have blue data on the red side red data on the blue side is going to be fuzzy it’s not going to be so clean. Even here you’re going to have the same situation when the classes are not separable it means they’re not separable with the specific by the specific architecture that you have chosen unless you have coincident data where you have data with the same X but from different classes which is highly unlikely in in real life there always going to be slightly or the X is never going to be exactly the same. So when you have data of that kind then as we know because the neural network is a universal approximator if I would have a large enough network eventually I can always learn a model which perfectly models every instance but that’s going to be kind of bonus so you’re going to sort of limit the operation of the network and make sure that they doesn’t follow every little bump in your data and that it gets a smoother surface by limiting the architecture of the network and so when I say the classes are inseparable it means the classes are not separable using the specific architecture that you have chosen and so even in this case what you will find is that the lower portion of the network is going to try to rearrange the data so that they are almost linearly separable like in this figure to the lab. And then the classification layer on top is going to try to do the best job of computing a linear classifier that separates the red and blue classes over here maximum accuracy maximum possible accuracy this making sense. Okay, at least some. And so now let’s go back to what the output neuron is really doing right this output neuron is actually what was it computing when I have a logistic function what is the output neuron compute. What probability lecture. Yes one. Yeah, it computes the a posteriori probability of the classes right but it actually computes the posterior probability of the classes given the input to the neuron and the input to the neuron is going to be f of x where f of x is the. Is the function represented by this gray box the rest of the network again the network comprises two components this function shown by the gray box which I’m calling f of x which tries to make the data linearly separable and then the classification layer right and so the output. Softmax that you compute is computing the a posteriori probability of the classes given f of x but then being given f of x is basically the same as being given x and and so which is one over one plus which is basically the logistic computed at. F of x and so in fact the output neuron computes the a posteriori probability of the classes given the input x regardless of the fact that the class that in the input space itself the classes are not actually linearly separable they have some agreed separating boundary and so. When the data are not separable and the boundaries are not linear you still have this situation where the output of the network is in fact the a posteriori probability of the classes for multi class networks it’s going to be the vector of a posteriori class probabilities and as we saw earlier when I when I just have this neuron if I try to train it using to minimize the care to minimize the KL divergence. The real divergence is the same as performing maximum likely training of this neuron but even when I think of the entire network as a single unit if this output neuron has as a softmax or a logistic function and we are trying to minimize the KL divergence between the actual output of the network and the desired class labels then what we are actually performing is maximum likely training. The entire network the entire network is now just a parametric model that is intended to capture the posteriori probability of the classes given the input and the training process simply lives is is a maximum likelihood. Algorithm that learns the parameters of this model so anytime we train a neural network we might think that we are just minimizing a loss we are training it to perform classification but what we are actually performing doing is we are learning a statistical estimator for the distribution of the data and we are learning it using maximum likelihood training that making sense. Guys, any questions? Any questions? No, okay. So here’s your second poll. Okay. Okay, 10 seconds guys. All right. Is this first statement true? Classification neural network is just a statistical model that computers the aposteriori probabilities of the classes given the inputs it is true. What about the second statement also true training the network to minimize the KL divergence is the same as maximum likelihood training of the network what about the third one. Training the network by minimizing the KL divergence gives us an ML estimate only when the classes are said that is not true right. It is valid and possibly beneficial to train the network and subsequently replace the final layer by other classifiers. You can imagine that right basically what is happening is that this portion of the network once you train the network this portion of the network is actually transforming the input data to become linearly separable as linearly separable as possible right. But once you’ve done that you actually now have a function f which takes your data and sort of rearranges it so that the classes are linearly separable and now using using these new features which are the lies. You could use any other linear classifier doesn’t have to be a logistic function. It could very very well be for instance a support vector machine that you use to perform the classification so the fourth statement is on the truth. So the stories of our classification and of the actually comprises two components a feature extraction network that converts the inputs into linearly separable features or nearly linearly separable features. And a final linear classifier that operates on the linearly separable or nearly linearly separable features and using a softmax the final layer of the network actually computes a posteriori probabilities of classes and training the network to minimize the KL divergence is identical to maximum likelihood training of the network. But then here is the kicker right regardless of whether you’re trying to minimize the KL divergence or some other divergence like the L2 divergence. The minima can be assumed to be at. At the same point in terms of the parameter space so this really means that regardless of how you train the network. And actually performing maximum likelihood training of the network it’s it’s just that some lost functions are going to be a cleaner representation of likelihood than others. So that’s all very fine right we found out what’s what is happening at this Y space. What about the lower regions how do they respond. So instead of Y what does this portion of the network compute. These two compute features but then what do these features look like. So here is the manifold hypothesis you had some X in the input space of X the features were not linearly separable in the space of Y the features became linearly separable right so what would you expect. What was happening to the data in two to lead as you went through the network. Keeping in mind that it starts off being not linearly separable being arranged in some horrible manner. And then finally ending up being linearly separable by in terms of by class. So what. Yeah you’d sort of expect that as the data goes through the network they become more and more linearly separable right. So in fact but then let’s look at exactly how this happens. This here’s a nice little example. Here the network is drawn top to bottom. So just a flip in my notation for the purpose of this illustration to match this figure. So here we have data from two classes in the decision boundaries circular have a bunch of blue dots from inside the circle and a bunch of red dots from the outside. And then I’m trying to train this network. The input is in two dimensional space. The network has one hidden layer with three neurons and the activations for the three neurons at an age and then a single output neuron right. So what happens as the data go through this network. Initially I’m going to just see this initial portion of the network right with this data itself. So over here you’re going to have data in to the two dimensional space and this is going to be the arrangement of my data. Now the first thing that we do is to compute an affine transform right remember when you implemented your network. There’s a linear transform followed by an activation right. So when I go from here to this hidden layer, the first thing that happens is that I use an affine transform to transform my two dimensional data and because I had three hidden neurons. I’m going to go transform it from a two dimensional space to a three dimensional space right that making sense to everyone. Is that making sense right. So basically what we’re doing is before the activation we are applying an affine transform and when I’m going to apply the affine transform what happens. This sheet is going to end up which is a two dimensional sheet is now going to end up as a two dimensional sheet suspended in three dimensional space like so right. Because it’s a linear transform it’s not going to do anything crazy with it is just going to take the space as such and make it a two dimensional manifold and three dimensional space. And so now the arrangement of the data are going to look like something like this and then you apply the activation the activation is non linear right when the activation is non linear is going to take this. Plane our surface and now it’s going to warp it and it’s going to bend it and then when you bend it you’re going to end up with a non linear surface it’s no longer just a plane. Then the output of this non linearly transform data output of this neuron which is not these non linearly transform data are not projected down to one dimension using another affine transform which means that all of this is going to be zapped down. On to an axis projected onto an axis which is given by the set of weights of this activation and so now the data are going to end up with a line scattered light so. And then this final neuron is going to apply threshold on the scatter and that’s going to give you a decision boundary right so let’s look at what it does so initially when I just initialize the bar the data. The first affine transform puts this in three dimensional space then the tannage activation who wants this surface to make it look like so. Then the second affine transform which is it zaps it all down to a line and then this final guy applies some threshold and says everything to the left left is blue set everything to the right is red. So now if I go back and say what is the outcome of the decision of this threshold being applied to this data on the original in the original two dimensional space you find that it hasn’t actually learned the circle let’s learn something completely stupid so is this sequence of pictures making sense to you guys. Kind of right but it will make more sense when I actually play this activation this animation so now this animation shows what happens as you train the network. And observe what happened right this is so beautiful the training sort of figured out first how to position this this two dimensional surface in three dimensional space so that’s what learning this affine the brand this first layer day it figured out how to position this two dimensional data in three dimensional space such that when I apply this tannage activation the center of the circle gets stretched out and the things to the boundaries go down go down the other side it also learned how to project this now it’s no longer a plane now it’s going to look at this sheet looks a bit like a cone with the blue coming out and the reds going down to the side. And it also learned what was the line that it had to project the whole thing down on so that when you projected it all the blues ended up on one side and the reds ended up on the other side such that when this guy applies an activation you end up learning. And then you can see a decision boundary that model is clearly separates the blues from the reds it hasn’t planned exactly a circle but it’s done a pretty decent job so are you able to see what’s going on when you learn the network over here. Any questions. This is beautiful right you’re sort of repositioning the data and high dimensional space and then distorting it so that when you project it down on the projected down dimension single threshold captures the decision boundary of interest this is for a trivial problem here’s something for a more complex problem this is for sipartan this is a network worth 11 layers and you can see 11 hidden layers. We’ve sort of projected the data down into two dimensions for illustration and you can see that the date when you’re at the beginning of the network and the training none of the classes are linearly acceptable but then as you train here’s what happened as you go through the layers the classes become more and more linearly separable and in fact by the time you got this layer which is not even the final layer it’s three layers before the final layer they are already linearly separate. All this last two layers do sort of increase the separation between the layers but basically as you go through the network the classes becoming becoming increasingly linearly separable you can see the same thing in three dimensions and you can see the same happens again right as you go through the layers the classes become increasingly linearly separable and so by the time you get to the final layer the classes are separable and so you’re able to see. The very nice linear classifier but in fact as you train the network you find that there in fact become linearly separable way before you actually got to the final layer in this case you didn’t actually need to get to the to the 9th year and layer or the 10th year and fact to get the. To put them to be linearly separable by the 7th or the 8th layer they’re already separable so in fact when you train a network if you train the entire network and then throw away the final few layers and then just attach a linear classifier to whatever remains to the top of whatever remains and then fine tune you should still get the same performance because the classes become linearly separable way before you actually get to the ultimate layer but the key point being that. As the data pass through the network the classes become increasingly linearly separable separable is this making sense. So questions anyone. Okay so we get an idea of what the network network is doing right what the lower layers of the network are doing now let’s change gears a bit right we’ve seen what the network lines here what the network lines here we’ve seen what happens to the data as it goes through the network. But the overall path what happens to the overall patterns of the data as it goes through the network but what about the individual neurons what do they capture. So now to understand this let’s go back to the basic perceptron itself the basic perceptron was just a function of the sky it assuming the linear threshold activation you computed a weighted some of the inputs if that exceeded the threshold now it was one of the words was zero. So if you set all of the weights as a vector then you’re basically computing the inner product between the input vector and the weight vector and comparing it to a threshold right but then here’s what here’s what this inner product means when I’ve got the inner assume that firstly here’s something surprising in high dimensional spaces almost all vectors are the same length. So this may shock you but if I’m looking at something that’s in a hundred dimensions then when I consider a hundred dimensional sphere almost the entire volume of the surf of the sphere is going to be very close to the surface and as you increase the dimensionality of the sphere more and more of the volume ends up being very close to the surface. And so as a result in high dimensional spaces if I randomly choose a vector with very high probability it’s going to be very close to the surface which means all randomly chosen vectors are going to be approximately the same length. So now if I’ve given that if I assume that all of my vectors are the same length when I say the inner product between two vectors exceeds a threshold we know that the inner product between two vectors is the magnitude of the first vector. Times the magnitude of the second vector times the cosine of the angle between the two vectors so when I say x transpose w is greater than t it means cost theta which is the angle between w and x is greater than some value right so if I say that this neuron fires if this inner product is greater than a threshold. It’s the same as saying that this neuron fires if the angle between w and x is less than something right so in other words you can think of it as saying that the w represents a template that this perceptron is looking for and this perceptron itself is looking at a small cone around the template. And whenever with some of some angle over here and whenever the input falls within this cone this neuron is going to fire otherwise it won’t fire so is this making sense to you guys. Yes, one. Any questions. I’m just going to build the next rest of this next 20 minutes on this so let me know if this is making sense or not. I can explain again. So so Rohan you are going to be my representative for the class is this first equation making sense the simple perceptron is going to fire if the inner product between weights. Which is the same as saying that cost theta must be greater than some value right which is the same as saying theta must be less than some of the value. Because cost theta is going to be maximum when theta is zero right in other words this perceptron fires if the angle between w and x is less than something some threshold. Right so in other words the perceptron for the w represents a typical input that the perceptron is going to is searching for. So it doesn’t matter what the norm of w and x are right for any given norm of w. What we are saying is that all the x’s have more or less the same length because in high dimension of spaces if I have for any sphere the majority 99.9% of the volume of the sphere is going to be very close to the surface. So this magnitude of x is going to be pretty much the same for any randomly chosen as. You can see this as you go from you know just a circle to a sphere right in a circle the arm if I take a small band around this around the age of the disk if I take a circular disk if I consider a small band around the age of the disk there’s not a lot of the of the area of the circle within that band. So from a circle to a sphere that thing is actually going to end up capturing that that surfaces a small band near the surface of the sphere captures a much greater fraction of the overall volume. Then the band near the circumference of a disk so as you keep increasing the dimensionality of the space the fraction of the volume that lies close to the surface keeps increasing and in high dimensional spaces. Pretty much all of it lies close to the surface which means all of the axis randomly chosen x’s are going to be the same length. So which means that in for a high dimensional input the perceptron is simply thinking of the w as a template that it’s searching for it fires most strongly when x is exactly equal to w. And then as x goes away from w it’s actually the firing becomes weaker and weaker right and so if I had a if I wanted to build a perceptron that was looking at say a grid pattern like this like you know school LED pictures watches and it was trying to fire and it was trying to detect if the input was a two. All you would do was to set the rates to be in the pattern of a two and now if you have an input which is not very similar to a two the inner product between the two which is the correlation is going to be low and on fire. Whereas if the input begins to begins looking more like the two the inner product is going to be larger and the perceptron will fire so the perceptron is in fact a correlation filter and so now. I have something more complex like this one where I have a network which looks at these grids and it has to decide if the input is a digit or not. What you would expect is that each of these lower layer perceptrons is going to end up looking for specific features meaning the weights of these perceptrons are going to be the actual patterns that each of these perceptrons is trying to detect. And the perceptron will fire if that pattern is detected so you might find that you know trying to say is this a digit or not one perceptron may capture these horizontal bars on top the other might capture these vertical ones the third one might be capturing the lower vertical ones and so on. The second say a new runs are going to be assembling these things to create individual digits and the outermost one should go fire if any of these guys fire just one hypothetical possibility right. So basically these lowest layer perceptrons are actually capturing salient features and the fire if that salient feature is detected in the input. So what this means is that if all I did was to say look at this perceptron and if it fired then I’m going to assume that this pattern was present so I can just take a grid and fill up that pattern which is the set of which is the pattern of weights for this perceptron. And then I check this perceptron check if it fired and if it fired I look at the weight pattern that it has and then I fill up that weight pattern in my grid and go this go through these perceptron by perceptron and find out which perceptrons fired and then fill in their weight patterns in my grid. Could we expect that it’s going to reconstruct most of the input. What do you think guys. So this makes sense the question I made. So first there’s a question if we train three different networks of the same data was the same neurons fire for the same input no there’s no guarantee right this is training we have no idea what they learn we can only make. Statement that the lower layer perceptron neurons are going to learn low level features to detect low level features but anyway is this statement making sense that if I just found out which of these perceptrons fired and then reassembled their weight patterns. I would be reconstructing most of the salient features of the input so something that looks like the input right that makes sense right because each of the perceptrons is actually detecting features and so I could sort of try partially reconstruct the input using these features in this particular example. These perceptrons are only going to be capturing features that are relevant to the detection of digits so it will capture features that make these outputs distinctly like digits or not like digits but in the more general case I can always do this I can have a bunch of perceptrons. I can if anytime the perceptron fires I put back its weight pattern in the input and then I would find that the output is going to look somewhat like the input right you can assume this now let me formalize this right in this particular problem it’s not going to look exactly like the inputs because that network was optimized to recognize digits. Let’s add it and the lower layer of neurons and only the straight in distinctly digit like are obviously not digit like features the rest are going to be irrelevant and really will be lost but then let’s formalize it let me strain a neural network that is trying to predict the input itself this is what we call an auto encoder so then the auto encoder has a lower portion which is a which has a recall an encoder which learns to digitize the input. So which learns to detect all the most significant patterns in the segment in the input and the decoder which learns to recompose the input from these patterns so this can in fact and this is in fact an explicit instantiation of what I explained over here where we said that by reassembling the weights of the perceptron z fired we could reconstruct the input so here and actually trying to trying to train the network to build an network to do just this. Now let’s consider the simplest instance of this guy the simplest instance of this guy is just a single neuron right so the single neuron is going to have an input it will fire if the input matches its weights if it fires I just reconstruct the output is just the weight pattern right now let me simplify this even further let me say that this doesn’t have any activation. It’s just a linear activation so it’s going to compute the the weighted sum of inputs comes in here and this takes a value and instead of being converted to a one zero pattern whatever value it has has come out over here is directly being used to is directly used to rescale the weights over here right and now I’m going to train this guy to minimize the error between X and X. So when I do that what was perceptron on X hat is going to be W transpose times W X so if I minimize the L2 diverges between the reconstruction of the input it’s going to be X minus X hat squared which is X minus W transpose W X whole squared I’m learning the W to minimize this error along this over a bunch of training data. And what we will find is that it’s basically this test ends up being if your data are all zero mean this test ends up be PCA any of you have ever dealt with PCA are going to recognize this equation this W now represents the principle component of the data collection of training data. And so basically what you would be doing is to detect if this principle component has occurred in the data instance and you’re going to be reconstructing the data as some weighted version of this principle component itself. Now one outcome of this is that regardless of what this guy fires the output is going to be some scaling up W transpose right so W transpose a scale W scaling of W transpose is going to be basically some scaling of the vector W itself so regardless of the input that goes into this network the output is always going to be just a line. Or a hyper plate which is the scaling of W right in this case it’s just a line so this auto encoder finds the direction of maximum energy or maximum variance if they input is zero mean and all input vectors are going to be mapped on to some point on this principle axis. And now because of the nature of this where the output is always some scaling of W transpose regardless of the input over here meaning regardless of the output of this orange ball the final output of the network is going to be something on this line it’s simply going to be an output that lies along the major actually the data and so. This means that this network basically learns to reconstruct to project the data down on to a single line such that the projections of the data onto this line have the lowest error of that results from the projection so for example if this data instances projected onto the line this length is going to be the error right. So over the entire training training data that total square error overall training instances is going to be minimized and the decoder portion of the network which is the upper portion of the network is going to be capturing the slope of this line this is the minimum error direction in principle I get that is just making sense guys. Yes now right so that’s with one dimension right but then I can have a network which has multiple hidden neurons of this kind and so when I have multiple hidden neurons are equation is still the same assuming these are still have linear activations the output of the hidden layers W packs the output of the network itself. The network itself is W transpose W X and so if I find the W that minimizes the error squared error between the reconstruction and the input this is going to find me the bunch of the principle subspace for the data and this is still PCA right and the output reconstructions are always going to lie on the principle subspace in regardless of the input to the network so here is your last poll you don’t actually have the report. Okay five seconds guys. Does anyone want to answer the first question to our false second one also true right an auto encoder with a linear activation and the hidden layer performance component analysis of the input and an auto encoder with the linear activations in the hidden layer that has been trained on some data. Can only output values on the principle subspace regardless of the input so this is a terminology the portion the reconstruct is what we will call the decoder the portion of the network that that computes this this lower dimensional representation is what we’ll call our encoder so the encoder is the analysis network which computes the hidden representation the decoder is the synthesis network which recomposes the data from the hidden representation and what we’ve seen is that is the case where the hidden layer has linear activations and the decoder has is just performing a linear combination so when that happens the weights of the decoder network represents a represent a principle subspace and regardless of the input the output is always going to be on this principle subspace. So does not guarantee right but that they will learn different principle components but if you’re trying to minimize that error because it’s if you have linear activations you should if you have non linear activations it gets a little more complex in the case of linear activations it’s a convex functions and functions so together these guys will learn the principle subspace although the individual neurons may not learn they just end up learning some linear competition of the principle components but they will learn the principle subspace you are intrigued if it’s linear and when the network is linear then the output can only be on linear subspace right but when the network is non linear if I throw in a bunch of non linear activations in the decoder what happens then then if you look at the relation work if you look at the relationship between the input and output it’s actually going to capture a non linear manifold when they hidden layers for the decoder have non linear activations it’s still when when everything is linear the network is performing principle component analysis when the hidden layers have non linear activations it folds the surface as we saw earlier and so the network is going to end up performing non linear principle component analysis so here for example if I have an encoder and a decoder of a sky with non linearities this decoder is only going to be is going to represent some non linear surface of the sky and the encoder is going to capture some hidden representation which is essentially some position on this non linear surface and now as your network becomes more and more complex deeper with more with more complex architecture the surface the non linear surface of the network can capture becomes more complicated and these are the deep water encoders so here are some examples here I have data which are lying on a spider now clearly there’s only one primary direction of variation right guys I’m going to go a little bit over please bear with me five minutes over there’s only one principle direction of variation so if I were to ask you where are the data on there I’ll say the data lie on different regions for position for different points on a spider so if I were to train in this case an auto encoder of this kind here we’ve used a re-lew act e-lew activations and architectures given a here the hidden representation is just one variable just a single new single newron then what we find is that the decoder ends up learning something like the spider so regardless of the input that you give the decoder is going to generate something on the spider and so this network actually ends up learning the structure of the data but it’s not so simple right so once I train the network the decoder learns a spider but it’s not monotonic right so for example if I take the input over here and very from say zero to infinity or minus in this case zero to infinity you expect that the decoder monotonically generates a spiral that’s not what it does what it does over here is that it generates a spiral until here but then instead of continuing this way it jumps to the side so these are just four points false and then it goes back this way and then jumps back over here and continues here over it stays on the spiral only within the region that it actually saw the training data if you have have if you give it more training data z values which are not seen in the input which corresponds to you know hypothetical inputs over here it doesn’t continue to generate the spiral but sort of goes away but at least within the region of the spiral it actually learns the spiral or same thing over here here we have this data and when you train an auto encoder this decoder ends up learning till generate the sinusoidal like waves function except of course when you give the decoder inputs that it never saw in training it doesn’t continue the sinusoid it set up goes off goes off along the line right so is this what’s happening over here and this the fact that the decoder in the order encoder learns a non-linear manifold is this making sense to you guys so what does that mean when the hidden representations of lower dimensionality the input we’ll often call this a bottleneck network it’s a non-linear PC it learns the manifold for the data if properly trained so if I train this network on lots and lots of data from a specific source it turns out that in the real world data doesn’t lie scattered all over the space when you take any particular source of or data of a specific kind the fact that data are structured means that most of the data lie very close to some non non-linear manifold and so when you train the network the network actually learns the non-manifold that the data lies on and now when it’s properly trained the decoder can only generate data on the manifold that the training data lie on and so this also makes it an excellent generator for the distribution of the training data meaning once I train the network because the decoder has learned the principal manifold of the data itself regardless of what you give the decoder is going to generate some data on this principal manifold which you can expect will look like the training data so for instance if we trained our auto encoder on digits data then you took the decoder regardless of the input that you give it over here the output is going to end up looking something like a digit it’s going to produce something that’s typical of the source here’s an example in this case we trained an auto encoder on spectrograms from saxophones then I just take the decoder and then I excite it with in this case a one over here and zeros here here is what the decoder outputs if I can lay it it’s actually ends up sounding like something that could have come from a saxophone here’s something that I get when I give it a different input to the decoder that’s a bottleneck because the hidden representation is lower than the dimensionality of the data so it’s a bottleneck but as you can see the decoder is actually successfully learning the data matter and for the saxophone when I train it or the saxophone regardless of what I give the decoder sound saxophonish here I strain it with a clarinet it’s not pure notes but regardless of what I give the decoder once it’s trained it’s producing clarinet like sounds so I’ll skip this poll but actually let’s just go through this for auto ok guys just go ahead and do this poll I’ll take five or one minutes anyway just do this poll 30 seconds for the poll ok five seconds guys ok this first statement is a true or false second one third in the fourth right so the decoder is now a dictionary which composes data like the training data and response to any input all of these statements are true an autoencoder with non-linear activation performs non-linear PCA it finds the principle manifold for the data near which the training data lies this need not be linear the decoder of the non-linear autoencoder can only generate data on this principle manifold regardless of the input and so the decoder essentially can be thought of as a dictionary which can only compose data like the training data in response to any input and so I’m going to use this in the next four minutes of your time for a very cute application signal separation I’m given a mix of multiple sources I want to separate out the sources so here’s the problem I have a recording which includes guitars and drums I want to process it so that the guitar is separated from the drums so the standard approach here is something called a dictionary based approach where I learn a dictionary of building blocks for each source so I’d have a collection of training data for save the guitar and I learn a model which can only generate sounds from the guitar that sound like the guitar then similarly I’d have a collection of training data from the drums and then I’d learn a model which can only generate data that sound like the drums and now when I have a mixed recording I’m going to try to figure out how to select entries from the guitar dictionary and entries from the drum dictionary so that when these are summed up the results sounds like my mixed recording and then once I do that then the entries from the guitar dictionary when recombined gives me to give me just the guitar portion of the recording and the entries from the drums dictionary will give me just the drums portion of the recording so that’s basically what I’m going to do right when I recompose this I’m going to be using these auto encoders I’d train one auto encoder on my first source and a second auto encoder on a second source and now I know that the dictionary this decoder of this auto encoder can only generate sounds like the first source the decoder of this auto encoder can only generate sounds like the second source and now when I’m given my mixed recording I’m going to say that there was some sound produced by this guy and some sound produced by this guy that when added gave me my mixed recording but then how do I generate sounds from this dictionary? this dictionary has to be excited in some manner right so I’m going to say that the decoder for the first source there was some excitation with which I had to excite the decoder for the sound first source and some other excitation with which I had to excite the decoder the dictionary for the second source such that when I summed there output sum the result looked like my mixed recording and so given my mixed recording I can use back propagation these things are already fixed these dictionaries are already learned this has been learned from source 1 this has been learned from source 2 so now I can just give you the mixed recording and use back prop and say what must the input to this guy be and what must the input to this guy be such that using these inputs when I sum the outputs of these two networks it looks like my mixed recording once I learn those inputs then the output of just this portion of the network is going to give me my first source the output of just this portion of the network is going to give me my second source right let’s see how well that works so this is a mixture of two instruments one of an instrument one is a string instrument and these dictionaries where the decoders of an autoencoder each of them has had five layers those 600 units wide and here’s what it separates out from the quad for the first source and for reference these are the two sounds we mixed as you can do seeks actually does a in this case a near perfect job of separating the two sources right from this mixture so the point over here is that the we are seeing that the decoder of the dictionary the autoencoder over here learns the underlying structure the underlying manifold of the data and is therefore if when it’s property train is therefore kind of designed only to generate data from that manifold which we can put in this case which we put into the problem used for the problem of separating size so the story for the day is that classification networks learn to predict the equestituring probabilities of classes the network until the final layer is a feature extractor that converts the input data to be almost really central the final layer is a classifier predictor that operates on the mainly separable data and neural networks can be used also to perform linear or non-linear PCA autoencoders which can be used to compose constructive dictionaries for the data which in turn can be used to model data distributions we’ll focus more on this second topic in the next week so I’ll stop here I’ll take some questions and I’ll also stop my record any questions?