11-785, Fall 22 Lecture 20: Learning Representations
This is going to be the first of a few lectures that will be on Zoom. Today we’re going to be talking about what neural networks learn. So what we’ve seen so far is these neural networks are universal approximators. They can model any Boolean categorical or real-world function. They can check static inputs for patterns. They can scan for patterns, well-peas, and CNNs. They can analyze time series for patterns, those were recurrent networks. In each case, they must be trained to make their predictions. But when we train them to make their predictions, what do they learn internally? What does our network? Can you mute your sub, please? So here’s the learning problem for neural networks. You’re given a correction of input-output pairs, the inputs acts, and the corresponding desired target output. And we have to learn the network parameters so that it captures this desired function. So when you’re trying to learn a network to perform classification, then the network must learn the classification boundaries that separate them. Training instances from the two classes. So for example, if we were given this set of training data with the red and the green dots. And the task was to, and the color represented labels so that task would be to learn a network that based on set on position could figure out which of these two classes the data belong to. That network must learn this double pentagon decision boundary. So it must learn a function of the kind shown to the right where it takes a value. One within the double pentagon region and zero outside. Now in reality, the kind of training data we get are not going to be clean so cleanly separated. You’re not going to have. So we’re going to have a red region and clearly blue regions. It’s going to be somewhat noisy. So you’re going to have some blue dots on the red side. And some red dots on the blue side. So what function do we look to understand this. Let’s consider a trivial example of a network network which has only one neuron. And specifically consider this two dimensional example. So if you have a single neuron, we know that these neurons learn linear decision boundaries. We would want to learn a step function of this kind, which can learn to distinguish between the regions where red dots lie and the regions where the blue dots lie except the training data you’re given are great to have some red dots of the blue side. And some blue dots on the red side. So you’d have some red dots suspended over the blue over here. Blue dots lying on the floor on the red side and from these noisy label data noisy data. The labels are not noisy, but from data which are kind of not cleanly separated. You must learn this function. So to understand this even better, let’s separate this. Let’s go back down to an even simpler problem, a one dimensional example. So here we have data from two classes. The blue dots representing one class, the red dots representing the other class. And so for the red dots, the class label is one for the blue dots, the class label is zero. Now in this one-dimensional case, a linear classic buyer is a threshold function which says everything to the left of threshold belongs to one class, everything to the right belongs to another class. These two classes clearly are overlapping. And so they are not linearly separable. There’s no threshold that will cleanly separate the red dots from the blue dots. Also, as we’ve seen this in an earlier class, the neural network is a universal approximate that can learn an arbitrary complex complex function. So we’ll end up learning a function such as this one, which is something that we don’t really need. Also, you have this other issue. Even if you assume that the network could learn a function of this kind and it’s allowed to learn a function of this kind. What if we have a situation such as this where you have red and blue dots at the same value of X. So you have some training instances with this value of X where the class label is one and other training instances with the class label is zero. So here, if for instance you have 90 instances with class label one and 10 instances with class label zero, then what is the function we must learn over here must the function value be one at this X because red dominates or would it make more sense for the function value function to take the value point nine to show you what fraction of instances. Take the take the label value one, which of these two makes more sense to you anyone. Can someone ask that question which of these two would make more sense. So the point nine right. That’s because it gives you more information you can indeed derive the fact that the majority are one from this point nine, but it gives you more information about the nature of the data. Suppose I have all of these hundred instances, but they are not all exactly at the same value of X. So again, what is point nine over here is actually an estimate of the aposteriority probability of the class one given this input X. So because the 90% of the data belong to class one when I say point nine you’re basically saying what is the probability that a randomly sample value that has this value that with this value of X, randomly sample instance with this value of X will belong to class one and that probability is going to be point nine. So that’s why we like point nine right. But now what if these instances were not all exactly at X, but they were off by a tiny little bit by say 10 raise to minus 15 or 10 raise to minus 17 which is the, which is the resolution, which is the resolution you can represent with doubles. Now, if all of these were slightly off, does it suddenly become more meaningful to be finding a function with goes up and down with at at each X or does it make more sense to say that in this range, the function value was be point nine, which of these to make more sense. Anyone. You want to say that in this range, the, the ideal output should still be point nine right that small perturbation shouldn’t really change the output. But then what if this perturbation increases at what perturbation will point nine stop making sense. It’s not very clear right. But then let’s take a look at this slightly differently and we’ve seen this earlier too. At each point instead of looking at just that value of X, let’s take a look at a small window around that point. And now we plot the average by the value within that window. This average by values and approximation of the probability of Y equals one at that point. And so here for because all of the training instances belong to class zero. Your best guess for the equals theory probability of one given X is going to be zero. But then as you slide right from the left most point at some point, you begin encountering instances with class able one. And so this average value goes up from zero. And as you go further right, the average value keeps increasing till eventually you see on the red points, meaning the average by value is one. So the overall function that you would get is going to be something of this kind. And so this is the function that kind of makes sense for us to model. Now, what does this function look like? We’ve seen this also in the previous class. This looks like a signal right. And if we have an X, then the sigmoid has this function value one over one plus p raised to minus w zero plus w, w one X. And this sigmoid actually represents an estimate of the ape posteriori probability of class one given the input X. So is this may do you guys recall this is making sense to you. Right. Okay. When I ask you this question guys, please raise your hand so that I can make sure that you’re that you’re tracking me right now. Let’s see how the sigmoid works. Let’s assume that w zero and w one are both positive. Then if X is a large negative value, then you’re going to have minus off. Then w one X is also going to be a large negative value. And so it’s going to be E raised to minus off a large negative value, which is going to be E raised to a large positive value, which is going to be about infinity. So for large negative values of X, it’s this is going to be one over infinity, which is zero. So that’s why the curve is zero value at large negative values of X. For large positive values of X. Then you’re going to have E raised to minus upper large positive value, which is like E raised to minus infinity in the limit, which is going to be zero. And so this function is going to be one over one plus zero, which is one. And so for large positive values of X, this function is going to take the value one. And it’s swing smoothly from zero to one. As you go from large negative values to large positive values that gives it this nice characteristic shape. This shape is representative of the fact that when you have a data unit dimensional data of this kind on the on one side, one class dominates on the other side, the other class dominates. And as you go from one end to the other, the fraction of data from the other class slowly increases till in dominates. So that’s why you get this kind of curve. Now, a logistic perceptron, as we know, which is a perceptron with a sigmoid activation. Is basically just this sigmoid function. And in fact, it computes. So if it could, if it had just a single input. Then this perceptron is going to be one over one plus E raised to minus W zero plus W on X. And in fact, this computes the aposteriori probability of class one given the input. Now, even if you had multi dimensional data, like a two dimensional example. Over here, where now the data are in two dimensions, they’re not separable. You have blue dots on the red side and red dots on the blue side, but then there is a boundary. And as you go towards the boundary and then continue along in the same direction and and cross the boundary. Initially, you’re going to see all blue dots, then you’re going to see increasing numbers of red dots. And eventually you’re going to see all red dots. So even there, you’re going to end up with an aposteriori probability function of this kind, which is, which looks like a sheet, which has been sort of folded into the sigmoidal shape. And this function too is just your standard sigma. It’s one over one plus E raised to minus off summing some over all components. Since summation W I X I plus the bias W zero. And this function too is going to have exactly the same kind of aposteriori probability function that we have even in the unit dimensional fields. Now, although this function actually is non linear, it represents a linear classifier. What is the set represent a linear classifier? Can anyone tell me? The decision boundaries linear. The decision boundaries going to be linear. If I say that, now, if I want to take a decision, the way you will do it is to say any instance for which this posterior probability exceeds a threshold. Say point five is going to be class one, the rest are going to be class zero. So to find the boundary, you would have to equate this guy to point five right or two times this guy to one. And then if you solve it out, you’re going to find that that that the equation it gives you is this is that summation W I X I plus W zero equals some constant, which is the equation for a line or a hyper plane. So the decision boundary between class zero and class one is going to be a hyper plane, it’s linear. And so, although this perceptron captures this sig model shape, it actually captures a linear decision boundary, it’s a linear classifier. Now, how would we estimate a function of the sky when I began drawing this function earlier back here. It would have been natural for you to for for you to ask what is the width of the yellow oven. We’re not actually going to assign a specific width to the oven. Instead, we’re going to let the data decide. So the way you do it is it be given the training data, many X Y pairs, which are represented here by the dots. And from these we want to estimate W zero and W one I’m going to be using this unit dimensional example over here, but if it is a multi dimensional, then W one is going to be a vector and X would be a vector, but the rest would not change. Right, so the math is going to be exactly the same. Now, if you want to estimate this model, how would you do it? This curve gives you the probability of Y being equal to one given X. So that’s going to be the problem I have flipped the colors over here now Y equals one is being shown by blue. So this curve shows the probability of stands with that particular value of X, belonging to class one. Now, correspondingly, the a posteriori probability of Y being minus one. So I’m here I’m going to be using a plus one minus one notation for convenience. So the classes are plus one for the blue dots minus one for the red dots. So the a post the probability of Y belonging to class minus one given X is going to be one minus P of Y equals one given X. And if you work it out, that’s simply going to be one over one plus E raised to class W zero plus W one X. So the difference between the a posteriori probability for class one. And the probability for class minus one is merely the sign of this exponent. So we can write both of these in combined form and say P of Y given X is one over one plus E raised to minus of Y times W zero plus W one X. And you can see immediately that when Y is one that gives you this formula to the left. And then Y is minus one minus times minus one becomes plus. So this gives you the formula to the right. So this formula you apply given X equals one over one plus E raised to minus of Y times the new zero plus W one X captures both of these curves. So is this making sense to rise the method. It’s not complex. Just confirm it. Okay. So I have a few hands raised. Okay. So just assume that those few hands represent the class. But please do press because otherwise I don’t know. Cross wide. Now, I want to learn the model. So what we would be given is a collection of training instances X Y pairs. Where for each instance are going to have X value this one. Yeah, this is just one dimensional illustration in general it’s going to be a vector. And the class value which would be either one or minus one. And so now if I’m given this collection of training instances. Then assuming all of all the training instances are independent. The probability that a random drawing would give you this specific collection of training instances is going to be the product over all of the instances of the product of the probability of X. Or using base rule. I can write this as the product over all instances of the probability of X. I times the probability of Y. I even X. And now. Our function our model for this posterior probabilities given by this guy over here right. If I give an X equals one over one plus key raise to minus Y times W zero plus W and X. So I can write this term the total probability of all of my training data as the product over all training instances of P of X i times one over one plus E raise to minus Y i times W zero plus. This is W one transpose X i this right is W. And we use a short annotation of W. So basically now when I’m given a collection of training instances according to our model this term in red is the joint probability of all of the data as given by our model. And our model parameters are W zero and W over here right so the joint probability of all of our training data is the product over all instances of E of X i times the. Sigmite function computed at that X i and Y i I can separate the X i and Y i terms out so now the log of the probability of the training data is simply going to be the summation over all all training instances of log of the F X i plus the summation of our all training instances of the log of the sigmite function computed at that X. And now if I were to perform if I were to see which of these terms actually depends on the network parameters it’s only this term highlighted in blue. Now there’s a common estimation framework for estimating the parameters of statistical models. There’s a standard framework for learning for learning a parametric model for the probability distribution of data. And the manner in which we do it is that we assign a probability distribution which has some parameters and we try to estimate these parameters such that the probability assigned to the data by this model is maximized we revisit this topic in the next week. And so that is called the maximum likelihood estimation procedure and so using the maximum likelihood estimation procedure our guess best guess for w 0 and w 1 is simply going to be the w 0 and w 1 that maximizes the log likelihood of the log probability of the training data. And which is going to be the arg max of w 0 and w 1 off this term in blue because the first term doesn’t depend on the parameters of the model itself or alternately this is going to be the arm in over over the parameters of minus of the summation of over all training instances of the log of the probability assigned to that instance. By the model and this term over here you will recognize the simply minus log here fly and and that is simply identical to the callback libel or divergence between the desired output by represented in one hat one format and the actual output given by the model. Which is this logistic formula and so what we find is that when we try to train this model to minimize the KL divergence between the target output and the output of the network this is exactly the same as maximizing the log likelihood of the log likelihood of the log. This is exactly the same as maximizing the log likelihood of your training data and so in fact training the model to minimize the KL divergence is the same as maximum likelihood learning of our logistic function. So when we train our network to minimize the KL divergence between the output and the target output you are in fact performing maximum likelihood learning of a parametric model for the distribution of the data. Is this making sense. So now. But then this was for a linear classifier right where we have data that we are trying to linearly separate. What happens when we have data of this kind when the models when the decision boundaries are not linear. The very analogous situation still remains now for the moment first let’s consider the case where the classes are separable so in this example we are trying to separate the red and the classes and they can indeed be separated separated by this double pentagon decision boundary. And so when the network must learn to classify the network must learn to output a one within these pentagons and a zero or a minus one depending on how you set it up outside. So assume that you have a sufficient network we’ve seen in our first lecture and the second lecture that for this double pentagon decision boundary a network of this kind of supplies. So you have five neurons. You have this subnet with five first here in the neurons and the second up new round over here which captures one pentagon you have a second subnet which captures the second pentagon. And then you have this final neuron over here which sort of ours over these two guys. And then we know that our perceptrons are linear classifiers right so this perceptron is a linear classifier and it’s doing a perfect job of separating the red and the green classes. Then if you look at what the perceptron itself sees what can you say about the values y1 and y2 that are being fed to this perceptron. What must their characteristics be can anyone tell me. So this is a practice of the perceptron is a linear classifier right so as this is assume this network as we said so that it perfectly separates the red and the blue classes. Now this final perceptron gets as it inputs these two guys let me call them y1 and y2. So I have to plot the scatter of y1 and y2 for all of my training instances what must they look like for this perceptron to be able to cleanly separate the red and the blue classes. Anyone. Because we must be linearly separable right because this guy is a linear classifier. So if this is a linear classifier it is if it’s able to separate red and blue data in this y1 y2 space then in the y1 y2 space the red and blue data must be linearly separable. In other words for this complex network the output of the penultimate layer of the second to last layer must comprise of linearly separable data from the two classes that making sense. So the network in fact consists of two parts the first is this linear classifier which is the final classification layer and the rest of the network which starts off with data which have this ugly distribution and which somehow manipulates the data and transforms them so that you now get a modified representation for the data such that in this in this space. The classes are linearly separable. So the network actually has two parts the first is this guy which takes your data from the various classes and rearranges them so that they are linearly separable and then the last is the final output layer which actually performs a linear classification task. Is this making sense. So perfect. Now observe that this is true of any sufficient structure. So here this network was exactly what we needed for this double pentagon model right but I could have over parametrized large networks which also do a perfect job of separating the red and the blue classes. And so it’s not just the optimal structure but any sufficient structure the network consists of two portions the portion below the final output layer which converts these data into linearly separable classes and then the final classification layer which actually performs the job of linear separation. And now if for some reason the network below was somewhat lacking so it didn’t have a sufficient number of neurons or sufficient number of connections we know from what we’ve learned before that there’s such a thing as a sufficiency of architecture for any problem the network has to have the capacity to perform the specific compute this specific function that we wanted to compute. And if the network doesn’t have that capacity then when we train the model what will happen is that this portion of the network will nonetheless try to transform the data so that they are as linearly separable as possible. It won’t succeed because the model is insufficient but it will still get as much of it as it possibly can so maybe instead of being perfectly separable now the data have a little bit of overlap in the boundary region but it will bring them to ask close to linearly separable as possible and then the final output layer is going to try to perform linear separation on these data so that gives it a first one. When they launch versus pluck… Okay, 10 seconds, guys. Okay, does anybody want to answer the first question? What does the answer to the first question? The second one. And so the portion of the network until the second to last layer is essentially a feature extraction module that extracts linearly separable features for the classes. And the output layer is a linear classifier that can only perform well if the rest of the network transforms the input cities such as the so web classes are linearly separable. Both of these are true. The second one is not false, right? Because the output layer is a linear classifier. If the data input to it are not linearly separable, it’s going to fail. So for the output layer to do a good job, the rest of the network must transform the input space such that the classes are linearly separable. That make sense? Those of you want said false. Yeah, good. Now, this example over here, just assume that, you know, we have data of the sky where the two classes are can be separated by this model. More generally, you’re going to get data of the sky where you have blue data on the red side, red data on the blue side. It’s going to be fuzzy. It’s not going to be so clean. Even here, you’re going to have the same situation when the classes are not separable, it means they’re not separable with the specific by the specific architecture that you have chosen. Unless you have coincident data where you have data with the same X, but from different classes, which is highly unlikely in in real life. They’re always going to be slightly perturbed. The X is never going to be exactly the same. So when you have data of that kind, then as we know, because the neural network is a universal approximator, if I would have a large enough network. Eventually, I can always learn a model which perfectly models every instance, but that’s going to be kind of bonus. So you’re going to sort of limit the operation of the network and make sure that they doesn’t follow every little bump in your data. And that it gets a smoother surface by limiting the architecture of the network. And so when I say the classes are inseparable, it means the classes are not separable using the specific architecture that you have chosen. And so even in the case. What you will find is that the lower portion of the network is going to try to rearrange the data so that they are almost linearly separable like in this figure to the left. And then the classification layer on top is going to try to do the best job of of computing a linear classifier that separates the red and blue classes over here maximum accuracy maximum possible accuracy. This making sense. Okay. At least some. And so. Now let’s go back to what the output neuron is doing, right? This output neuron is actually what was it computing when I have a logistic function, what is the output neuron compute? What probability lecture. Class one. Yeah, it computes the a posteriori probability of the classes, right? But it actually computes the posterior probability of the classes given the input to the neuron and the input to the neuron is going to be f of x where f of x is the. And then the function is the function represented by this gray box, the rest of the network again the network comprises two components, this function shown by the gray box which I’m calling f of x, which tries to make the data linearly separable and then the classification layer, right? So, the output softmax that you compute is computing the a posteriori probability of the classes given f of x, but then being given f of x is basically the same as being given x. And so, which is one over one plus, which is basically the logistic computed at. And so, in fact, the output neuron computes the a posteriori probability of the classes given the input x regardless of the fact that the class that in the input space itself, the classes are not actually linearly separable, they have some ugly separating boundary. And so, when the data are not separable and the boundaries are not linear, you still have this situation where the output of the network is in fact the a posteriori probability of the classes for multi class networks is going to be the vector of a posteriori class probabilities. And as we saw earlier, when I when I just have this neuron, if I try to train it using to minimize the care to minimize the KL divergence is the same as performing maximum likely training of this neuron. But even when I think of the entire network as a single unit, if this output neuron has a softmax or a logistic function and we are trying to minimize the KL divergence between the actual output of the network and the desired class labels. And what we are actually performing is maximum likelihood training of the entire network, the entire network is now just a parametric model that is intended to capture the posteriori probability of the classes given the input and the training process simply is is a maximum likelihood. And then we go to them that learns the parameters of this model. So, anytime we train a neural network, we might think that we are just minimizing a loss, we are training it to perform classification, but what we are actually performing doing is we are learning a statistical estimator for the distribution of the data. And we are learning it using maximum likelihood training. And we are taking a second sense. Guys, any questions? Any questions? No, okay. Thank you. Okay, 10 seconds, guys. All right, is this first statement true? Classification neural network is just a statistical model that computers the aposteriori probabilities of the classes given the inputs, it is true. What about the second statement? Also true training the network to minimize the KL divergence is the same as maximum likelihood training of the network, what about the third one? Training the network by minimizing the KL divergence gives us an ML estimate only when the classes are set, that is not true, right. So, you can imagine that the valid and possibly beneficial to train the network and subsequently replace the final layer by other classifiers. You can imagine that, right. Basically, what is happening is that this portion of the network. Once you train the network, this portion of the network is actually transforming the input data to become linearly separable as linearly separable as possible, right. But once you’ve done that, you actually now have a function F, which takes your data and sort of rearranges it so that the classes are linearly separable. And now using using these new features, which are the lies, you could use any other linear classifier doesn’t have to be a logistic function. It could very, very well be for instance a support vector machine that you use to perform the classification. So the fourth statement is on central. And so the stories of our classification and LP actually comprises two components. A feature extraction network that converts the inputs into linearly separable features or nearly linearly separable features. And a final linear classifier that operates on the linearly separable or nearly linearly separable features. The softmax, the final layer of the network actually computes a posteriori probabilities of classes. And training the network to minimize the KL divergence is identical to maximum like the training of the network. But then here is the kicker right regardless of whether you’re trying to minimize the KL divergence or some other divergence like the L2 divergence. And then the same thing can be assumed to be at at the same point in terms of the parameter space. So this really means that regardless of how you train the network, you’re actually performing maximum like they’re training of the network. It’s just that some lost functions are going to be a cleaner representation of like later than others. So it’s very fine right we found out what’s what is happening at this Y space. What about the lower regions, how did they respond now so instead of Y what does this portion of the network compute. These two compute features but then what do these features look like. They are currently, manifold hypothesis, you had some excellent in the input space of X. The features were not linearly separable in the space of Y. The features became linearly separable, right so what would you expect. Was happening to the data into to lead as you went through the network. Keeping in mind that it starts off being not linearly separable, being arrange a somewhat horrible banana. And then finally, ending up being linearly separable by in terms of by class. So what? Yeah, you would sort of expect that as the data goes through the network, they become more and more linearly separable, right? So in fact, but then let’s look at exactly how this happens. This here’s a nice little example. Here the network is drawn top to bottom. So just a flip in my notation for the purpose of this illustration to match this bigger. So here we have data from two classes in the decision boundaries, circular. Have a bunch of blue dots from inside the circle and a bunch of red dots from the outside. And I’m trying to train this network. The input is in two dimensional space. The network has one hidden layer with three neurons and the activations for the three neurons at 10H and then a single output neuron, right? So what happens as the data go through this network? Initially, I’m going to just see this initial portion of the network, right? This data itself. So over here, you’re going to have data in to do two dimensional space and this is going to be the arrangement of my data. Now, the first thing that we do is to compute an affine transform, right? Remember when you implemented your network, there’s a linear transform followed by an activation, right? So when I go from here to this hidden layer, the first thing that happens is that I use an affine transform to transform my two dimensional data and because I had three hidden neurons, I’m going to go transform it from a two dimensional space, to a three dimensional space, right? That makes sense to everyone. Does that make sense? So basically what we’re doing is before the activation, we are applying an affine transform and when I’m going to apply the affine transform, what happens? This sheet is going to end up, which is a two dimensional sheet is now going to end up as a two dimensional sheet suspended in three dimensional space, like so, right? Because it’s a linear transform, it’s not going to do anything crazy with it, it’s just going to take the space as such and make it a two dimensional manifold and three dimensional space. And so now the arrangement of the data are going to look like something like this. And then you apply the activation, the activation is non-linear, right? When the activation is non-linear, it’s going to take this planar surface and now it’s going to warp it and it’s going to bend it. And then when you bend it, you’re going to end up with a non-linear surface. It’s no longer just a plane. Then the output of this non-linearly transformed data, the output of this neuron, which is this non-linearly transformed data, are now projected down to one dimension using another affine transform, which means that all of this is going to be zapped down onto an axis projected onto an axis, which is given by the set of weights of this activation. And so now the data are going to end up with a line scattered light. So, and then this final neuron is going to apply threshold on the scatter and that’s going to give you a decision boundary, right? So, let’s look at what it does. So, initially when I just initialized about the data, the first affine transform puts this in three dimensional space, then the tanage activation wops the surface to make it look large. So, then the second affine transform, which is it zaps it all down to a line. And then this final guy applies some threshold and says everything to the left is left is blue, so everything to the right is red. So now if I go back and say, what is the outcome of the decision of this threshold being applied to this data on the original two dimensional space, you’ll find that it hasn’t actually learned the circle, let’s learn something completely stupid. So, is this sequence of pictures making sense to you guys? Kind of, right? But it’ll make more sense when I actually play this activation, this animation. So now this animation shows what happens as you train the network. And observe what happened, right? This is so beautiful. The training sort of figured out first how to position this two dimensional surface in three dimensional space. So that’s what learning this affine, the brand, this first layer did. It figured out how to position this two dimensional data in three dimensional space such that when I apply this tanage activation, the center of the circle gets stretched out and the things to the boundaries go down the other side. It also learned how to project this, now it’s no longer a plane, now it’s going to look at this sheet looks a bit like a cone with the blue coming out and the red’s going down to the side. And then it also learned what was the line that it had to project the whole thing down on so that when you projected it, all the blues ended up on one side and the red’s ended up on the other side. Such that when this guy applies an activation, you end up learning a decision boundary that more or less clearly separates the blues from the reds. It hasn’t planned exactly a circle, but it’s done a pretty decent job. So are you able to see what’s going on when you learn the network over here? Any questions? So, you know, this is beautiful, right? You’re sort of repositioning the data and high dimensional space and then distorting it so that when you project it down, on the projected down dimension, single threshold captures the decision boundary of interest. This is for a trivial problem. Here’s something for a more complex problem. This is for SIPAR tan. This is a network worth 11 layers. And you can see 11 hidden layers. We’ve sort of projected the data down into two dimensions for illustration and you can see that the date when you’re at the beginning of the network, the training, none of the classes are linearly acceptable. But then as you train, here’s what happened. As you go through the layers, the classes become more and more linearly acceptable. And in fact, by the time you got this layer, which is not even the final layer, it’s three layers before the final layer, they are already linearly acceptable. All this last two layers do us to sort of increase the separation between the layers. But basically as you go through the network, the classes becoming increasingly linearly acceptable. You can see the same thing in three dimensions and you can see that the same thing happens again, right? As you go through the layers, the classes become increasingly linearly acceptable. And so by the time you get to the final layer, the classes are separable and so you’re able to learn the very nice linear classifier. But in fact, as you train the network, you’ll find that there, in fact, become linearly separable way before you actually got to the final layer. In this case, you didn’t actually need to get to the nine-therian layer or the 10th year in real impact to get the, for them to be linearly separable. By the seventh or the eighth layer, they’re already separable. So in fact, when you train a network, if you train the entire network and then throw away the final few layers and then just attach a linear classifier to whatever remains to the top of whatever remains and then fine tune, you should still get the same performance because the classes become linearly separable way before you actually get to the final layer. But the key point being that as the data pass through the network, the classes become increasingly linearly separable. separable, is this making sense? So questions, anyone? OK, so we get an idea of what the network is doing, right? What the lower layers of the network are doing. Now let’s change gears a bit, right? We’ve seen what the network learns here, what the network learns here. We’ve seen what happens to the data as it goes through the network. The overall patterns, what happens to the overall patterns of the data as it goes through the network. But what about the individual neurons? What do they capture? So now to understand this, let’s go back to the basic perceptron itself. The basic perceptron was just a function of the sky. Assuming a linear threshold activation, you computed a weighted sum of the inputs. If that exceeded a threshold, not what was 1, otherwise it was 0. So if you set all of the weights as a vector, then you’re basically computing the inner product between the input vector and the weight vector and comparing it to a threshold, right? But then here’s what this inner product means. When I’ve got the inner, assume that, firstly, here’s something surprising. In high dimensional spaces, almost all vectors are the same length. So this may shock you, but if I’m looking at something that’s in 100 dimensions, then when I consider a 100 dimensional sphere, almost the entire volume of the sphere is going to be very close to the surface. And as you increase the dimensionality of the sphere, more and more of the volume ends up being very close to the surface. And so as a result in high dimensional spaces, if I randomly choose a vector, the very high probability is going to be very close to the surface, which means all randomly chosen vectors are going to be approximately the same length. So now, if I’ve given that, if I assume that all of my vectors are the same length, when I say the inner product between two vectors exceeds a threshold, we know that the inner product between two vectors is the magnitude of the first vector times the magnitude of the second vector times the cosine of the angle between the two vectors. So when I say x transpose w is greater than t, it means cos theta, which is the angle between w and x is greater than some random. So if I say that this neuron fires, if this inner product is greater than a threshold, it’s the same as saying that this neuron fires if the angle between w and x is less than something. So in other words, you can think of it as saying that the w represents a template that this perceptron is looking for. And this perceptron itself is looking at a small cone around the template. And whenever with some of some angle over here, and whenever the input falls within this cone, this neuron is going to fire, otherwise it won’t fire. So is this making sense to you guys? This one. Any questions? I’m just going to build the next rest of this next 20 minutes on this. So let me know if this is making sense or not. I can explain again. Thank you, right? So Rohan, you are going to be my representative for the class. Is this first equation making sense? The simple perceptron is going to fire if the inner product between the weights, which is the same as saying that cos theta must be greater than some value, right? Which is the same as saying theta must be less than some of the value. Because cos theta is going to be maximum when theta is 0, in other words, this perceptron fires if the angle between w and x is less than something, some threshold. So in other words, the perceptron, the w represents a typical input that the perceptron is going to is searching for. So it doesn’t matter what the norm of w and x are, right? For any given norm of w, what we are saying is that all the x’s have more or less the same length. Because in high dimensional spaces, if I have for any sphere, the majority, 99.9% of the volume of the sphere is going to be very close to the surface. So this magnitude of x is going to be pretty much the same for any random V chosen x. So you can see this as you go from just a circle to a sphere, right? In a circle, if I take a small band around the edge of the disk, if I take a circular disk, if I consider a small band around the edge of the disk, there’s not a lot of the area of the circle within that band. But when I go up from a circle to a sphere, that thing is actually going to end up capturing that that surface is a small band near the surface of the sphere, captures a much greater fraction of the overall volume than the band near the circumference of a disk. So as you keep increasing the dimensionality of the space, the fraction of the volume that lies close to the surface keeps increasing. And in high dimensional spaces, pretty much all of it lies close to the surface, which means all of the x’s randomly chosen x’s are going to be the same length. So which means that for a high dimensional input, the perceptron is simply thinking of the w as a template that it’s searching for. It fires most strongly when x is exactly equal to w. And then as x goes away from w, it’s actually the firing becomes weaker and weaker, right? And so if I wanted to build a perceptron that was looking at, say, a grid pattern like this, like those old LED pictures watches, and it was trying to fire, and it was trying to detect if the input was a 2, all you would do was to set the rates to be in the pattern of a 2. And now if you have an input which is not very similar to a 2, the inner product between the 2, which is the correlation is going to be low and on fire, whereas if the input begins to be looking more like the 2, the inner product is going to be larger and the perceptron will fire. So the perceptron is in fact a correlation filter. And so now if I have something more complex like this one, where I have a network which looks at these grids, and it has to decide if the input is a digit or not, what you would expect is that each of these lower layer perceptrons is going to end up looking for specific features, meaning the weights of these perceptrons are going to be the actual patterns that each of these perceptrons is trying to detect. And the perceptron will fire if that pattern is detected. So you might find that you might try to say is this a digit or not? One perceptron may capture these horizontal bars on top. The other might capture these vertical ones. The third one might be capturing the lower vertical ones and so on. The second say a new runs are now going to be assembling these things to create individual digits. And the outermost one could fire if any of these dies fire, just one hypothetical possibility. So basically these lowest layer perceptrons are actually capturing salient features. And they fire if that salient feature is detected in the input. So what this means is that if all I did was to say, look at this perceptron. And if it fired, then I’m going to assume that this pattern was present. So I can just take a grid and fill up that pattern, which is the set of, which is the pattern of weights for this perceptron. Then I check this perceptron, check if it fired. And if it fired, I look at the weight pattern that it has. And then I fill up that weight pattern in my grid and go through these perceptron by perceptron and find out which perceptrons fired and then fill in their weight patterns in my grid. Could we expect that it’s going to reconstruct most of the input? What do you think, guys? So this makes sense. It’s a question I made. So first, there’s a question. If we train three different networks of the same data or the same neurons fired for the same input, no. There’s no guarantee, right? This is training. We have no idea what they learn. We can only make generic statement that the lower layer perceptrons neurons are going to learn low level features to detect low level features. But anyway, is this statement making sense that if I just found out which of these perceptrons fired and then reassembled their weight patterns, I would be reconstructing most of the salient features of the input. So something that looks like the input, right? That makes sense, right? Because each of the perceptrons is actually detecting features. And so I could sort of partially reconstruct the input using these features. In this particular example, these perceptrons are only going to be capturing features that are relevant to the detection of digits. So it will capture features that make these outputs distinctly like digits or not like digits. But in the more general case, I can always do this. I can have a bunch of perceptrons. And I can, if anytime the perceptron fires, I put back its weight pattern in the input. And then I would find that the output is going to look somewhat like the input, right? You can assume this. Now, let me formalize this, right? In this particular problem, it’s not going to look exactly like the inputs, because that network was optimized to recognize digits. And the lower layer of neurons is only the straighting distinctly digit like are obviously not digit-like features. The rest are going to be irrelevant, and really will be lost. But then let’s formalize it. Let me strain a neural network that is trained to predict the input itself. This is what we call an autoencoder. So then the autoencoder has a lower portion, which is a, which has a, we call an encoder, which learns to detect all the most significant patterns in the segment in the input. And the decoder, which learns to recompose the input from these patterns. So this can, in fact, and this is, in fact, an explicit instantiation of what I explained over here, where we said that by reassembling the weights of the perceptron z-fire, we could reconstruct the input. So here, I’m actually trying to train the network to build the network to do just this. Now, let’s consider the simplest instance of this guy. The simplest instance of this guy is just a single neuron, right? So the single neuron is going to have an input. It will fire if the input matches its weights. If it fires, I just reconstruct the output as just the weight pattern, right? Now, let me simplify this even further. Let me say that this doesn’t have any activation. It’s just a linear activation. So it’s going to compute the weighted sum of inputs comes in here. And this takes a value. And instead of being converted to a 1, 0 pattern, whatever value it has come out over here is directly being used to, is directly used to rescale the weights over here, right? And now I’m going to train this guy to minimize the error between x and x-hat. So when I do that, what will this perceptron learn? x-hat is going to be w transpose times wx. So if I minimize the L2 divergence between the reconstruction and the input, it’s going to be x minus x-hat squared, which is x minus w transpose wx whole squared. I’m learning the w to minimize this error. I learned this over a bunch of training data. And what we will find is that it’s basically this just ends up being, if your data are all 0 mean, this just ends up being pca. Any of you have ever dealt with pca are going to recognize this equation. This w now represents the principal component of the data collection of training data. And so basically, what you would be doing is to detect if this principal component has occurred in the data instance, and you’re going to be reconstructing the data as some weighted version of this principal component itself. Now, one outcome of this is that, regardless of what this guy fires, the output is going to be some scaling of w transpose. So the scaling of the w transpose is going to be basically some scaling of the vector w itself. So regardless of the input that goes into this network, the output is always going to be just a line, or a hyper plate, which is the scaling of w. Right? In this case, it’s just a line. So this autoencoder finds the direction of maximum energy or maximum variance if the input is 0 mean. And all input vectors are going to be mapped on to some point on this principal axis. And now, because of the nature of this, where the output is always some scaling of w transpose, regardless of the input over here, meaning regardless of the output of this orange ball, the final output of the network is going to be something on this line. It’s simply going to be an output that lies along the major axis with the data. And so this means that this network basically learns to reconstruct, to project the data down onto a single line such that the projections of the data onto this line have the lowest error of that results from the projection. So for example, if this data instance is projected onto the line, this length is going to be the error. So over the entire training data, the total squared error overall training instances is going to be minimized. And the decoder portion of the network, which is the upper portion of the network, is going to be capturing the slope of this line. This is the minimum error direction. It’s a principal line in that time. Is this making sense, guys? Yes, no? Right. OK. So that’s with one dimension, right? But then I can have a network which has multiple hidden neurons of this kind. And so when I have multiple hidden neurons, the equation is still the same. Assuming these are still have linear activations, the output of the hidden ray is WX. The output of the network itself is W transpose WX. And so if I find the W that minimizes the error squared error between the reconstruction and the input, this is going to find me the principle subspace for the data. And this is still PCA, right? And the output reconstructions are always going to lie on the principal subspace in regardless of the input to the network. So here is your last point. You don’t actually have the fourth point. OK, five seconds, guys. Does anyone want to answer the first question? Through our thoughts. Second one. Also, it’s right. An autoencoder with a linear activation of the hidden ray performs for the component analysis of the input. And an autoencoder with the linear activations in the head and there that has been trained on some data can only output values on the principal subspace regardless of the input. So this is our terminology. The portion, the reconstruct is what we will call the decoder. The portion of the network that that computes this lower dimensional representation is what we’ll call our encoder. So the encoder is the analysis network, which computes the hidden representation. The decoder is the synthesis network, which recomposes the data from the hidden representation. And what we’ve seen is that is the case where the hidden layer has linear activations. And the decoder has is just performing a linear combination. So when that happens, the weights of the decoder network represents a represent a principal subspace. And regardless of the input, the output is always going to be on this principal subspace. So does not guarantee, right? That they will learn different principal components. But if you’re trying to minimize the error because if you have linear activations, you should, if you have nonlinear activations, it gets a little more complex. In the case of linear activations, it’s a convex functions so together, these guys will learn the principal subspace. Although the individual neurons may not learn, may just end up learning some linear competition of the principal components. But they will learn the principal subspace. We are intrigued if it’s linear. And when the network is linear, then the output can only be on linear subspace, right? But when the network is nonlinear, if I throw in a bunch of nonlinear activations in the decoder, what happens then? Then if you look at the relationship between the input and the output, it’s actually going to capture a nonlinear manifold. When the hidden layers for the decoder have nonlinear activations, it’s still, when everything is linear, the network is performing principal component analysis. When the hidden layers have nonlinear activations, it force the surface as we saw earlier. And so the network is going to end up performing nonlinear principal component analysis. So here, for example, if I have an encoder and a decoder of a sky, but nonlinearities, this decoder is only going to represent some nonlinear surface of the sky. And the encoder is going to capture some hidden representation, which is essentially some position on this nonlinear surface. And now as your network becomes more and more complex, deeper with more complex architecture, the surface, the nonlinear surface of the network and capture becomes a more complicated manifold. These are the deep auto encoders. So here are some examples. Here, I have data which are lying on a spiral. Now, clearly, there’s only one primary direction of variation. Guys, I’m going to go a little bit over. Please bear with me five minutes over. There’s only one principal direction of variation. So if I were to ask you, where are the data on? I’ll say the data lie on different regions for different points on a spiral. So if I were to train, in this case, an auto encoder of this sky, here, we’ve used Reload, E-Lu activations of the architectures in a nonlinear. The hidden representation is just one variable, just a single neuron. Then what we find is that the decoder ends up learning something like this spiral. So regardless of the input that you give the decoder, it’s going to generate something on the spiral. And so this network actually ends up learning the structure of the data. But it’s not so simple. So once I train the network, the decoder learns a spiral. But it’s not monotonic. So for example, if I take the input over here and vary from say 0 to infinity or minus 0 to infinity, you would expect that the decoder monotonically generates a spiral. That’s not what it does. What it does over here is that it generates a spiral onto here. But then instead of continuing this way, it jumps to the side. So these are just four points, false. And then it goes back this way. And then jumps back over here and continues here. Moreover, it stays on the spiral only within the region that it actually saw the training data. If you have had, if you give it more training data, Z values, which are not seen in the input, which correspond to hypothetical inputs over here, it doesn’t continue to generate the spiral, but sort of goes away. But at least within the region of the spiral it actually learns the spiral. Or same thing over here. Here we have this data. And when you train an autoencoder, this decoder ends up learning till generate the sinusoidal like wave function, except of course, when you give the decoder inputs that it never saw in training, it doesn’t continue the sinusoid. It set up goes off along the line. So is this what’s happening over here? And the fact that the decoder and the autoencoder learns a nonlinear manifold is this making sense to you guys. So what does that mean? When the hidden representations of lower dimensionality than the input, we’ll often call this a bottleneck network. It’s a nonlinear PCA. It learns the manifold for the data. If properly trained. So if I train this network on lots and lots of data from a specific source, it turns out that in the real world data doesn’t lie scattered all over the space. When you take any particular source of a data of a specific kind, the fact that data is structured means that most of the data lie very close to some nonlinear manifold. And so when you train the network, the network actually learns the non-manifold that the data lies on. And now when it’s properly trained, the decoder can only generate data on the manifold that the training data lie on. And so this also makes it an excellent generator for the distribution of the training data, meaning once I train the network, because the decoder has learned the principal manifold of the data itself, regardless of what you give the decoder, it’s going to generate some data on this principal manifold, which you can expect will look like the training data. So for instance, if we trained our autoencoder on digits data, then you took the decoder regardless of the input that you give it over here, the output is going to end up looking something like a digit. It’s going to produce something that’s typical of the source. Here’s an example. In this case, we trained an autoencoder on spectrograms from saxophones. Then I just take the decoder and then excited with, in this case, a one over here and zeros here, here is what the decoder outputs. If I can lay it. It actually ends up sounding like something that could have come from a saxophone. Here’s something that I get when I give it a different input to the decoder. That’s a bottleneck because the hidden representation is lower than the dimensionality of the data. So that’s a bottleneck. But as you can see, the decoder is actually successfully learning the data management. And for the saxophone, when I train it or the saxophone, regardless of what I give the decoder, sound saxophonish, here I train it with a clarinet. It’s not pure notes, but regardless of what I give the decoder once it’s trained, it’s producing clarinet-like sounds. So I’ll skip this poll. But actually, let’s just go through this. Or OK, guys, just go ahead and do this poll. I’ll take five or a minute’s anyway. Just do this poll. 30 seconds for the poll. OK, five seconds, guys. OK. This first statement is a true or false. Second one. Third, and the fourth. Right. So the decoder is now a dictionary which composes data like the training data and responds to any of all of these statements are true. An autoencoder with nonlinear activation performs nonlinear PCA. It finds the principle manifold for the data and the average the training data lies. This may not be linear. The decoder of the nonlinear autoencoder can only generate data on this principle manifold regardless of the input. And so the decoder essentially can be thought of as a dictionary which can only compose data like the training data in response to any input. And so I’m going to use this in the next four minutes of your time for a very cute application, signal separation. I’m given a mix of multiple sources. I want to separate out the sources. So here’s the problem. I have a recording which includes guitars and drums. I want to process it so that the guitar is separated from the drums. So the standard approach here is something called a dictionary based approach, where I learn a dictionary of building blocks for each source. So I’d have a collection of training data for save the guitar. And I learn a model which can only generate sounds from the guitar. That sound like the guitar. Then similarly, I’d have a collection of training data from the drums. And then I’d learn a model which can only generate data that sound like the drums. And now when I have a mixed recording, I’m going to try to figure out how to select entries from the guitar dictionary and entries from the drum dictionary so that when these are summed up, the results sound like my mixed recording. And then once I do that, then the entries from the guitar dictionary when recombined gives me just the guitar portion of the recording. And the entries from the drums dictionary will give me just the drums portion of the recording. So that’s basically what I’m going to do. What are we composites? I’m going to be using these auto encoders. I train one auto encoder on my first source and a second auto encoder on the second source. And now I know that the dictionary, this decoder of this auto encoder can only generate sounds like the first source. And now when I’m given my mixed recording, I’m going to say that there was some sound produced by this guy and some sound produced by this guy, that when added, gave me my mixed recording. But then how do I generate sounds from this dictionary? This dictionary has to be excited in some manner. So I’m going to say that the decoder for the drum and the drum and the drum and the drum and the decoder for the first source, there was some excitation with which I had to excite the decoder for the sound, first source. And some other excitation with which I had to excite the decoder, the dictionary for the second source, such that when I summed there outputs up, the result looked like my mixed recording. And so given my mixed recording, I can use back propagation. These things are already fixed. These dictionaries are already learned. This has been learned from source 1. This has been learned from source 2. So now I can just give you the mixed recording and use back crop and say, what must the input to this guy be? And what must the input to this guy be? Such that using these inputs, when I summed the outputs of these two networks, it looks like my mixed recording. Once I learned those inputs, then the output of just this portion of the network is going to give me my first source. The output of just this portion of the network is going to give me my second source. Let’s see how well that works. So this is a mixture of two instruments, one of any instrument, one is a string instrument. And these dictionaries where the decoders of an autoencoder, each of them was at five layers, 600 units wide. And here’s what it separates out from the quad for the first source. And for reference, these are the two sounds we mixed. As you can do, see, it actually does a, in this case, a near-perfect job of separating the two sources from this mixture. So the point over here is that we are seeing that the decoder of the dictionary, the autoencoder over here, learns the underlying structure, the underlying manifold of the data. And as therefore, when it’s property train, it is therefore designed only to generate data from that manifold, which we can put, in this case, which we put into the problem, used for the problem of separating sounds. So the story for the day is that classification networks will not predict the posterioring probabilities of classes. The network until the final layer is a feature extractor that converts the input data to be almost linearly separable. The final layer is a classifier operator that operates on the linearly separable data. And neural networks can be used also to perform linear or nonlinear PCA autoencoders, which can be used to compose constructive dictionaries for the data, which in turn can be used in model data distributions. We’ll focus more on this second topic in the next week. So I’ll stop here. I’ll take some questions. And I’ll also stop my record. Any questions?