https://www.youtube.com/watch?v=yl8znINLXdg

S18 Lecture 20: Hopfield Networks 1

OK. Are we on? Are we not? Let me check if this works. Oh, perfect. Let me know when we are set. So, are we streaming? Yes. Perfect. You know it’s a sad state of affairs where we can actually fold 90% of the chairs. OK. Oh, yes. That’s happening. So, one of these ideas I’ve had is maybe I should not put up the slides on the web anymore. So, if you want to see the slides, you have to see the video. Right. That way, if you want to answer the quiz, you have to see the slides. So, put the slides up of you know 10 days later. Thank you. Look at the class. There’s nobody in the class. OK. You guys, have we said? All right. Morning, everyone. I’m still waiting for the day when the weather gods decide whether it’s actually officially spring or still winter. It’s not that cold. So, can we catch me here at this location? Do we have the other camera? Perfect. All right. So, wait. So far, we’ve been looking at our networks where we’ve considered a strictly feed forward. What I mean by that is any time an input went in, the processing was strictly unidirectional. No unit ever revisited an input. Right. So, even for record networks, once an input went in, it just kept moving forward. A unit that had processed that particular input would not see that input again. Today, we’re going to move on for the next three lectures to a different class of networks where in the new runs actually revisited inputs. And you’re familiar with this. The network on top is the standard feed forward structure we’ve seen. The kinds of structures we’re going to be looking at is something like this at the bottom where the output of a neuron might actually go back into other neurons. Right. Here’s a simpler, you know, less ugly, less complicated representation of the same network. So, we have in this example five neurons. I’m going to assume for the moment that actually for the rest of this lecture that the activations of the neurons are threshold functions. So, each neuron receives inputs from every other neuron. And in the weighted combination of the inputs from the other neurons exceeds your bias plus your bias is greater than zero. The neuron is going to output of one or the vice is going to output of zero. So, every neuron receives input from every other neuron. Every neuron outputs signals to every other neuron. Right. Now, this is fundamentally different from everything that we’ve seen. And specifically, we’re going to look at a version of this network where the weights are symmetric. Now, when every neuron talks both ways to every other neuron, then you have connections weights to connections going from neuron I to neuron J. You also have weights from connections going from neuron J to neuron I. We’re going to set both of them to be the same. So, these are symmetric networks. In other words, they are undirected. This is what is called a hub field network. So, how does this not feel like? So, if the weighted combination of everything that comes in is positive and the neuron is currently negative, it’s going to flip. If the neuron is also positive, it’s not. So, a neuron flips if the weighted sign of the other neurons outputs is of an opposite sign from the output, sign of the current output of the neuron. If both of them match, the neuron will not flip otherwise it’s going to flip. Now, but what happens if a neuron flips? If a neuron flips, this has a ripple effect on the other neurons. So, it’s going to affect the total input seen by the other neurons which may now be caused to flip. And if those flip, guess what everything is going to come back? Is this a PDF? Okay, good, because there’s some animations there. So, if those flip, then that may cause the first neuron to flip and the whole thing can keep cycling. So, here’s how we can think about it. At each time, each neuron receives an incoming field which is a weighted sum of the outputs of all the other neurons except itself plus a bias. If the sign of the field matches its own sign, the neuron is going to do nothing. If the sign of the field is opposite from its own sign, it’s going to flip. So, I can write it like so. If the product of the output of the neuron and the weighted sum of the outputs of the other neurons is negative, the neuron is going to change its sign. Again, these are binary neurons. Output is either 1 or 0. Very simple. That explains the logic of what’s going on over here. Now, we’re going to keep seeing that equation again and again. Or some version of it. So, here’s an example. Here, you see, red and blue is. So, the red edges, I’m assuming that all weights are plus or minus 1. So, the red edges are minus 1, the blue edges are plus 1. The yellow nodes are plus 1. The black nodes are minus 1. Now, consider this. Let’s say I started the network with this particular configuration. So, this yellow one is a plus 1, all the others are minus 1. Now, consider what happens at this guy. So, it’s getting a minus 1 from here. It’s getting what is the black, black times blue. That’s going to get 2 minus 1s. And, wait, no. So, these two are going to cancel themselves. I think I have my signs off. This should have been, this should have been a plus 1. So, in that case, that neuron would flip. I have some issue with the colors and the edges. But, basically, what you will see is that, depending on the input, the neuron would flip. Now, because that neuron flips, some other neuron might flip. And, because that neuron flips, this first guy might go back and black. And, the whole thing can continue. So, the sign of the field at any neuron opposes its own signs. It flips to match the field, which will change the field at other neurons, which may then flip, which may cause other neurons, including the first one to flip, and so on. So, here is a little animation of 20 evolutions of a little network with, I don’t know, maybe 20 neurons. I don’t remember, I don’t remember the exact, but you saw what happened. Or, to see, this is very brief. This was just 20 evolutions. But, then let’s look at something slightly longer. This is 120 evolutions of a loopy network. And, you can see what happens. Each neuron, you said the net work to some pattern. Each neuron is going to see some way that some incoming field. If the field doesn’t match its own sign, it will flip. If it flips, it’s going to change the field at other neurons, which might then flip, and so on. So, the state of the system depends on the order of processing of the node. Not necessarily. So, the actual manner in which it evolves well, yes and no. Yes and no. And, I will tell you why under which conditions it’s a no. But, in general, the order might change the final outcome. And, what do I mean by the final outcome? So, let’s ask this question. We just saw that the neurons are going to keep flipping. Will this behavior continue forever? Does anybody want to take a guess? Why? So, suppose I give you two neurons and the connection between the two is minus one. So, the point is yes. It will not know. It will not continue forever. And, why is that? And, I will show you why it is so. So, let’s consider what happens at a particular neuron. The neuron has some current output. I will call that y minus. Then, it senses the incoming field, which is this guy. The weighted sum of the outputs of the other neurons plus the bias. And, in response to the incoming field, it might or might not change its output. But, the behavior, there is the neuron takes a different value after it has sensed the incoming field. So, let me call the output of the neuron after it has sensed its incoming field is y plus. So, y minus is before it has sensed the incoming field, y plus is after it has sensed the incoming field. Now, if the output of the neuron is the same as the sign of the incoming field, then y plus equals y minus. So, if I take the output of the neuron times the incoming field after the change or after it has sensed the field, minus the same value before it has sensed the field. Because, these two are the same, it is not going to, you know, this value is 0. So, this particular value has not changed when the incoming field has matched the sign of the neuron itself. What if it causes it to flip? So, if y i minus is not equal to the sign, then y i plus is going to be minus y minus, meaning the neuron is going to flip. So, now let us consider what happens over here. So, this term, the same difference, I can write, I can group this out as this field times y i plus minus y i minus. But, then y minus equals is minus of y i plus. So, this simply comes out to two times y i plus times the sign of the field. But, then what is this guy? What happened? What did the neuron do in response to the field? It changed its output such that the sign of the field and the output of the neuron is have changed. It matched the sign of the field and the output of the neuron. So, this term is always going to be positive. So, in other words, every flip of a neuron is locally guaranteed to not change or increase the product of the output of the neuron and the field at the neuron. So, consider this that value summed across all neurons for every single neuron in the network. If I consider that, then I can write simply the weighted sum of the products. So, for every connection, I can just multiply the weight of the connection by the signs of the neurons on either end and add it across all pairs and also some of the bias terms. Now, this is the same definition as the one on top except I am not double-cutting. I have used i and j less than i just to make because I am assuming the network is symmetric. Now, for any neuron that flips because of its field, it is going to cause a change in this d value which is the sum over all neurons of the product of the output of the neuron and the field at the neuron. So, let us consider the kth neuron or the kth neuron responded to the field. What will this change in d be? The change in d is simply going to be because if I just consider that only the kth neuron, the rest of the terms sort of disappear. So, all you are left with is this little term which looks exactly like what we saw for a single neuron in the earlier example and this term is always positive. So, every single flip of a unit, if you consider the units one at a time, is going to cause an increase in d. And indeed, if I am processing the entire collection of neurons, eventually I can just decompose it into the response of individual neurons. So, every single flip of a neuron is going to result in an increase in d. Now, can this go on forever? It cannot. Why? So, here is my value of d which is simply the sum over all the neurons of the output of the neuron times the field at the neuron. And it is bounded because the weights, the double the values are all plus or minus one. So, in other words, the largest value this can take is simply going to be the sum of the magnitudes of the weights and the sum of the biases. Magnitudes of the biases. When I have a bounded value, upper bound on some value and I give you some function which always increases the value of this d. What can I say about how many steps, how many increments I can actually take when the system evolves. Unless each increment is infinitesimally small, there is going to be a finite number of steps before I hit the ceiling. So, all I have to show is that each step is not incrementing infinitesimally small. Now, what is the smallest value of an increment in d? So, we saw earlier in the previous slide that this is what the increment in d looks like. It is actually just this guy. The magnitude of y is bounded is just one. So, I can get it of y plus this term. I only have to consider what the smallest value for this term is going to be. And so, this term, if I take the smallest possible value across all combinations of y, that is the smallest increment that I can have in d when a particular neuron flips. In other words, there is a lower bound on the increment per evolution. There is an upper bound on this d value as the system evolves. So, there is an upper bound on how many times these neurons can flip before the system stops evolving. So, any sequence of flips must converge in a finite number of steps. So, if I just take this network where every neuron connects to every other neuron, set all the values to some initial value and just let it evolve. It is going to keep evolving, but eventually it is going to stop evolving. It is going to stop at some point. This is a guarantee. Very trivial guarantee. Now, this is a half-field network. So, here is how we are going to define this whole process. I am going to define the energy of this half-field network. I just kept calling this value d. We have a name for it. We will call it the energy of the half-field network, which is the negative of d. And we will see why we use this term energy in a bit. So, the evolution of a half-field network constantly decreases its energy. It is energy. Monotonically decreases its energy. So, I would not say constantly, but monotonically decreases its energy. And yes, I have an update from Windows. So, where did this energy term come from? We just sort of explain this whole network structure. Where did this energy term come from? How many of you here actually were here for Gerald Stark? I saw the video. Nobody. So, he used a magic word. He mentioned something called an icing model. Anybody familiar with the term? No. So, where did the term energy associated with the whole? Any questions about what we have discussed so far? About the nature of the network, how it evolves and overall behavior? And why it must converge? And where this objective function actually came from? We use this objective function to show that there is a property of the network that must converge as the network evolves. And that the convergence must happen in a finite number of steps. And that property is simply this energy term. I am calling it an energy. There is a lower bound on the energy and it must arrive at a lower bound, no matter which means that the number of steps of evolution is bounded. Now, this energy term actually comes from an analogy physics. In fact, the half-fields original network actually was derived I think from this analogy. And that is from something called spin glasses. The term glass is kind of freely used over here. It is not really speaking of glasses. We are speaking of magnetic materials, magnetic materials are known to have this phase transition where you can go from ferromagnetic to paramagnetic at some point. So, think of it this way. You have a lot of magnetic dipoles in a medium and they are disordered. Now, let us consider only two directional, I mean, the orientation of the magnetic field is two directional. So, every dipole is going to try to arrange itself to be aligned with the local field at the dipole. And the local field at the dipole is an effect of two influences. The arrangement of all of the other dipoles in the medium which are all causing a local field and any external magnetic field that you might actually impose on the material. Now, if the dipole senses a local field which is not south and the dipoles currently south not is going to flip to align itself. And if it aligns itself, this has an effect on all the other dipoles. It changes their fields. And so, when it changes their fields, some of them will flip. And when they flip, it is going to have an effect on the current dipole. The whole process is going to evolve. This is a natural physical phenomenon. And the kinds of material that are on the kinds of mediums on which this property can be observed are sometimes called spin glasses. So, here is how we can think about it. There is a total field at the current dipole, i which is the sum over all of the other dipoles times the field from the dipole. And j i is how much a dipole that is not at the current location influences the current location. It is going to be dependent on many different things including how far away the other dipole is. It is not going to be a pure medium with what else happens in between the geometric arrangement, the dipole electric of the medium etc. But this is going to be a specific to the other dipole. So, you have every other dipole is going to influence the field at the current location. And then you have an external magnetic field. And so, all of these together create a total magnetic field at the current location, location of the current dipole. The field at any dipole is the sum of the field contributions of all the other dipoles. And the contribution of a dipole to the field at any point depends on the interaction j. So, this model was originally proposed by a guy called Lens in 1920. And this was worked out by a fellow called Icing in 1924 in one dimension as part of his PhD thesis. And he came up with this conclusion that in one dimension it was kind of an interesting he wrote a paper about it. He found the whole thing so boring he gave up physics and went away and became a school teacher. This was 1924. 25 years later he ran them, we opened the physics book and he found that he was being cited all over the place. So, there is a kind of, anyway, 1949. Amazing. So, here is the overall model. So, we call it the Icing model after the guy. And Icing model is the spin glasses are specific instance where the Icing model applies. So, the total field of the current dipole has the value on top. The response of the current dipole is that it is going to remain the same if the field does not tell it to flip. Otherwise, it is going to flip if the field does tell it to flip. This is physics. So, dipoles will keep flipping. A flip dipole changes the field of the other dipoles. So, much virtual flip which will change the field of the current dipole which may flip. And so, looks just like your how field model does it not. And specifically when will this stop? You can define again this is a term that you might have heard Gerald pull up Hamiltonian. You can define a Hamiltonian of the system. The Hamiltonian is nothing but the total energy of the system. It has two components which is the, you know, what is the potential energy of at any dipole. It is going to be the value of the dipole times the field of the dipole. If you have gone through your first class in physics, you are familiar with this formula. Plus whatever energy it has due to the external field. And what happens to a system when you set it at any particular energy? It is going to keep losing energy till it comes to some configuration. To some steady state configuration. The steady state configuration is going to depend on the temperature, the external field, etc. So, this Hamiltonian measures the total energy of the system. And the system is going to evolve to keep reducing its energy until it cannot reduce the energy anymore. And again, so here is how you have this is the kind of behavior you expect to see. You have a state for the system. I am drawing it as a continuum. Clearly, it is not a continuum. It is a discrete thing. You have a bunch of dipoles. And depending on how many dipoles you have, the number of possible states is going to be 2 raised to n. If you have n dipoles, you are drawing it as a continuum. For each state, each configuration you have a total energy, Hamiltonian. And you can plot it. And if you set the system at any particular value, the system is going to evolve till it hits the valley. At which point it can’t go down anymore. You might have multiple valleys, but it has got to evolve till it hits the valley. And this is the standard evolution of the system. And if the system is in some stable configuration, if you jitter the system, slip a few dipoles. What will happen? It is going to start evolving again till it goes back to exactly the same steady state. So, in other words, the system remembers the stable state. And if you jitter it from the stable state, it is going to come back to the stable state. Now, this has a very nice analogy according to Hopfield to human memory. And we will see how. So, Hopfield network looks just like that. You have exactly the same properties. If you thought of each of these neurons as a dipole, it has an influence on every other neuron. You have a total field. And those things are going to align themselves to the total field. You can define the energy of the entire network, analogously to the Hamiltonian of the spin glass. And the system is going to evolve until the energy hits the local minimum. So, now when I wrote it this way, I included this bias term, which in the physics world would be the external magnetic field. So, typically when we design computational equivalence of Hopfield-Matz, we want to explicitly use biases. If you do not, so I want in some of my equations, I actually want to have the biases. Removing the bias doesn’t affect the overall interpretation of the network. But you are not going to kill the bias, it will come back into the discussion. Anyway, so here is what you would expect the network to behave. You would have some initial state. You have and then it is going to have some energy. And if I, I mean, if I let the system lose, it is going to keep evolving, the state is going to keep evolving. Till it hits some configuration where any further evolution is not going to decrease the energy of the system, at which point it will stop evolving. So, the network will evolve until it arrives at a local minimum and energy counter. This is what we will call a content addressable memory. So, having, if you heard this term, cams, content addressable memories. So, there is, what is a standard format for memory when you store memories in your network, in your computer. There they address an external address location address based on location, right. In this grid of memory locations, the location has a specific address. If you want to invoke whatever, you know, that particular element, you are going to have to pull up the location and say, give me the memory location, whatever is stored at this specific location. So, the location addressable, where the flip side is something called content content addressable memories, where you kind of partially remember patterns. But then when you start off from the partially remembered patterns, you eventually invoke entire memory. Now, human memory works in some, you know, in that same manner to some extent. You know, if you just start off something invokes, if you are walking somewhere, you smell something, which invokes some pattern. And then you suddenly remember, remember, you know, things that happened longer ago. But more generally, if you think about how you actually recall things, it starts over the partial memory, which just sort of evolves into a complete recall of whatever it is that you’re, you’re, you’re tried to consciously or unconsciously recall, right. So, it’s content addressable. This hoppy network is a content addressable memory. So, if I think of these locations as the patterns that the network stores, then if I said the network to some pattern, which is somewhat clues to it, it’s going to evolve till it arrives at the stored pattern. So, I give it a corrupted version of the content, it’s going to come down and give you the actual value. So, it’s content addressable memory, it’s also called associative memory. So, here’s how to work, you define the energy of the network. The energy is going to have some contour like this. And you’ll see in the next class, by all these figures are bogus, they are rubbish. But anyway, the energy is going to have some contour like this. If I start the system at some point, it’s going to begin evolving till it arrives at the closest energy minimum, local minimum, and then it’s going to stop there. So, you might think of this as, you know, partial content that you’ve set the network to and the system evolves till it recalls the entire content. So, it’s content addressable. Now, and since we’ve proved that every change in the network is guaranteed to result in a decrease in the energy of the network, the path from wherever you are to wherever you’re going to is going to be monotonic. This is to answer the question of the order of computation. Now, the path is monotonic, it doesn’t mean the path is the same. I could actually be going down like this, I could be going down like this, depending on the order in which you process it, but you’re going to end up at the same point. Now, I’ve used threshold activations in my function network, we don’t need threshold activations, you could use other activations. If I use threshold activations, I define the output of my network as being of my units as being strictly plus 1 or minus 1, then it’s the equivalent of saying that I have a hypercube in end dimensions. And every point is every possible configuration of the network is one corner of this hypercube. Some small subset of these corners are these memories that the system will recall if I set, if I start off somewhere in the vicinity of these memories and are the system evolved. Questions, anyone on the streaming? Any questions? So, instead of the threshold activations, if I use tanage activations, I’m going to get not just functions defined in the corners of the cube is going to be more continuous through the whole thing. So, I can now also mathematically, if I want to write this energy term on top, I can write this in matrix form like so. It is minus half y transpose w y, where w is the weights matrix. So, it is in a product, weighted in a product between the of the weighted norm of the state vector and y the minus half over here, if you compare it to the equation right here. So, because I am using everything just one, so if you look at the summation, you are saying, you are going left to right, you are not double counting j less than i. So, to our confolet, I need a minus half right. And what are the diagonal elements of this weight matrix going to be? We are all going to be zeros, because no unit connects to itself right. So, here is an example of the kind of energy contour that you might see. This is an example where I have only two neurons. The two neurons are connected to themselves by weights w 1 2 some way right. So, this is the input, this scales seem to be somewhat off, but then as the outputs of the neurons change, the overall energy of the system is going to change right. And you see that there is a unique minimum, where the energy is a minimum, but that is not unique, there are two minima. So, the structure seems to have two bowls right, why are they are two minima? Ignore the axis for now right, why would you expect to see two minima? Go back and look at this equation, if I have y transpose w y, if I replace y by minus y, does the energy change? Because, the two minuses will cancel out right. So, if for any pattern that is a memory, if I flip the sign of every single bit in the pattern, that too is going to be a memory right. So, that is why it has two minima in this case. So, it is very strict because, you know minus half y transpose w y is the same as minus half minus y transpose w minus y. So, if y hat is a local minimum, so it is minus y hat. There are some bogusities in the figures, because some of them are piled up in maybe out of context, when they are let you know. Now, here is a little three neuron network, this has eight possible states, because each of them can be minus 1 or plus 1. And, there are only two stable states, what do I mean by two stable states, these are the two local minima. So, that is the one neuron over here, one state over here and the other state over here. And, if I start anywhere, it is good end up at one of these two stable states. Now, this is here is an example of content addressable memory. So, how exactly does it work? I can set this guy, I have trained in hop field network, that is a picture. Every pixel is plus 1 or minus 1. And, I have trained a hop field network such that this pattern is the pattern at which the network has the minimum energy. Now, if I initialize the network with this guy, which is pretty crappy, the system is going to evolve and it ends up over here. It is basically recall the pattern. Here is yes. Similarly, here is another example, this thing is for the chip and Dale. So, it has been initialized with that little picture, it has been designed trained such that this chip and Dale pattern is the memory that it is going to recall. And, then, if you initialize the network using this half the picture, it kind of evolves and the animation is not exactly what it is supposed to be. But, it evolves and gives you this guy. So, you can see how the whole thing works. I think, can you play the video here? So, it is trying to, this is another example I got from YouTube, where they are trying to store different patterns. And, then, when they initialize it with different patterns, very noisy stuff. You see how it actually falls into, ends up recalling one of these guys. And, it actually rubbing things out, it is pretty cool. Now, you will observe that, he is actually managed to store multiple patterns. Why would you expect that to happen? Why would you expect the network to have multiple memory, is that it could recall? The energy contour has multiple local minima. So, you would expect it to have multiple values that it could recall. So, the computational algorithm, assuming that you have this, yes. Yeah. Is there any improvement in the explanation why, for the volume, it is like obvious, like, sort of little patterns of white and, how do you think that? Yes, we will get to that. And, it is a huge problem. I mean, it is not when I have almost trivialized the explanation that I have given, because reality is much more complex as you will see in a few slides. So, here is the, assuming that the network has been trained, here is the overall pattern, behavior that you would expect algorithm you would have. You would initialize the network with some initial pattern. And, then, until convergence, you are going to sort of cycle through each of the neurons, check the incoming field. And, the incoming field says the neuron must flip. It will be flipped. Otherwise, it would not be. And, then, you keep continuing to the whole thing stop changing. And, updates can be sit down sequentially, one at a time, or all at once. Meaning, you take the current state of the network, test every neuron, decide every neuron, be updated, and repeat the process. All of them are eventually going to end up at the same position, regardless of, or you would expect everything to end up at the same position, regardless of how you do it. Now, yes. How many neurons, how many are you going to make a neuron, and your network is going to have, so, like, what is the capacity of the neuron? Yours. Okay. So, I just gave you some examples, that I remembered some patterns. How did we make it remember those patterns, and how many patterns can we actually remember? This is to, and say, your question. In some great detail, right? And, there is a third question, that we can actually ask, can I do something to make up this whole business of retrieving the correct memories even better? And, this relates to, the other question, you know, what were this little white dots to it, right? First, how do we make the network store a specific pattern, or a specific set of patterns? We’re not going to get much beyond this. So, we’re going to make a base class, because, but, so, how do I teach the network to remember, say, this little image from Loni-Tunz, right? Now, this is, for an image with n pixels, obviously, we’re going to need a network with n neurons. Every pixel is one neuron, because you’re storing the bits, right? It’s content addressable. Every neuron is going to connect to every other neurons. We are assuming that the weights are symmetric, so how many weights will be required? n squared over 2, right? Or n times n minus 1 over 2. Again, we could cancel off the fact that you don’t need the diagonal terms, since it will be a few fewer, right? Now, this is, so, now, we must keep in mind that any network that stores a pattern also stores its bit-reverse. So, in other words, in this example, if you stored this guy, you’d also be automatically storing the exact inverse, where all the blacks are white and whites are black, and from our perspective, they are both the same thing. We shouldn’t care. Now, and that means that, you know, that’s because, that means the energy at p is going to be the same as the energy at minus p, and this is how we’ve defined the energy if you’ll recall. Now, a network can store multiple patterns if the energy contour has multiple, right? Every stable point is going to be a stored pattern, so we could design the net to store multiple patterns. And remember that every stored pattern is actually two stored patterns, p and minus p. Now, before I even continue, right? Those of you who are mathematically oriented should be screaming at me, because why? Let me go back a few slides, and I’m going to say, going back many, many, many slides, look at this equation over here, and I’ve just told you that it has multiple local minimum, right? So, why are you not screaming at me? What kind of a function is it? Yes, it is convex. It’s more than convex, right? What kind of a function is it? Do you mind shutting it laptop? You back there? Can you shut the laptop? Yeah. So, what kind of a function is this? It’s quadratic, right? How many minimum are a quadratic half? One. So, what do I mean by saying I have multiple local minimum and drawing all of these beautiful parts with multiple local minimum? Obviously, it does. I know. The entire field hasn’t been lying to you, right? But then, math tells you it shouldn’t be the case. Something is off. Can anybody take a guess as to what is off? You know the answer, so it’s a quadratic. There’s only one. There is no multiple local minimum. There are no multiple local minimum, right? And anyone? Okay. We’ll get to that in the next class. But suspense, right? But here, so here’s our task. We want to design the weights matrix W such that this energy has local minimum at the desired patterns that we want to store. Now, consider the case where I have only one pattern that I want to store. I want to store the lunitune to the picture. That’s it. I don’t want to store anything else. So, now in this case, I want to design the W ij such that the energy is minimum for the lunitune spectrum. So, we want what does it mean for that pattern to be a local minimum? You want the sign of the weighted combination of the outputs, all the other neurons to be the same as the sign of the neuron, target sign of the neuron itself for every single neuron. This is the objective we need, right? If this is satisfied, no bit is going to flip. That’s a stable pattern, correct? So, this is stationary pattern. There’s a difference between stationarity and stability. And we will get to that as well. So, we’re going to use heavy and learning. Anybody remember what heavy and learning was? Way back well. So, what was heavy and learning? We just sort of, you know, we just took the product of the input and the output. And that, and we incremented the weights where the product some some some some skilled version of the product of the input and the output because we are saying that for every input the output must be aligned with the input. This was this was the basic heavy and learning rule. Remember, so in our case heavy and learning is going to be that the weight that connects neurons i and j is simply going to be the product of neurons i and j. And now this makes sense, right? So, if i and j are different signs, then basically you want the weight to be negative. So, because if y i is minus 1 and the weight of w is minus 1, then y j is going to see minus 1 and times minus 1 which is plus 1. So, y j will not flip. And y c versus, whereas if the y i and y j have the same sign and the weight will negative, then it will be it’s going to be causing them to flip. So, it makes sense that each weight must simply be the product of the two neurons that it’s connected. So, this is the basic heavy and learning. So, here it is. Now, let’s see if this heavy and learning actually works. So, this is very trivial. I am trying to store one pattern. I took a very simple rule. I am going to connect every neuron to every neuron. And the way I said the weight of the neuron is that I am simply going to, it said the weight to be the product of the two bits that I want to store from in the target pattern. So, will this really work? Let’s check if this works. Now, you basically consider this that what is the total field at any incoming neuron? The total field is going to be the weight of the outputs of all the other neurons. Sign has to match the sign of the current bit itself. But the weight to be said it was simply w j i was simply y j times y i. So, I can say y j times y i times y j. But what is y j times y j? y j squared. This is a sum of all j not equal to i, y j squared times y i. And what is y j squared? It’s always one. And of course, y i is not something that I am summing over. So, this comes out. So, this is a positive constant. This is going to be some positive constant times y i. What is it sign? The same as y i. So, this is going to be the sign of y i, which is y i. So, if I use the simple heavy and low rule you are guaranteed that the current, the pattern is stationary. What do I mean by stationary? If I set the network to that pattern, it’s not going to begin evolving. Now, here’s another thing. The pattern is stationary. Now, let’s check something else. What will the energy be? Total energy be if I actually set it to be the heavy and use the heavy and rules to learn the network. That’s going to be the sum over all pairs of neurons of the product of the bit values times the weight. But if I use the heavy and low rule, w j i is simply going to be y j y i. So, this is going to be y i squared, y j squared. This is simply going to be all of these are one. These are one. Now, within this setting, what is the smallest value that the energy can possibly take? The smallest possible value that the energy can possibly take is that if every argument is one. And that’s exactly what we are getting. So, in other words, this is the global minimum of the energy as well. So, in other words, this pattern is stable. What do I mean by steady? If I flip a bit, it’s going to come back. If I flip four bits, it’s going to come back. So, these patterns are stable. So, we have shown that the heavy and learning rule actually gives you a network which stores the pattern that you want to store for one pattern. And let’s try to visualize this. So, here’s a little four bit pattern. So, I have two of the bits shown on x axis, two of the bits shown on y axis. The pattern that I’m trying to store is this guy, which is 1 minus 1, 1, 1. And if I use the heavy and learning rule to actually learn the weights and then compute the energy for each bit pattern. So, how many bit patterns do we have? We have 16 bit patterns and draw them as a map. Remember the Carnot map from lecture 2. So, this next to that edge, this bottom connects to the top. So, the whole thing is a toroid. I’ve drawn this as a map. And I’m plotting the energy for every single bit pattern on this map. And this is what the energy contour looks like. And observe that it has a minimum at the pattern that we want. It also has a minimum at the flip, right, where every side, which is what I’ll call its coast. And also observe the gradation of these energies, right. No matter where I begin, if I start at this point, it’s either going to travel this way or this way. So, this is to go back to the other answers. It’s not going to necessarily, if you’re on the ridge, you could go either way, depending on the order in which you begin processing things, right. And then once it goes here, there’s only one way for it to go. It can’t go here. It cannot go here because the energy must keep decreasing. So, it comes here. So, now this is stable, right. Regardless of where I begin, I’m actually going to end up at the pattern that I’m trying to store. So, this works, right. Now, if I’m trying to store multiple patterns, more than one pattern, can I use the same logic? So, I can still use heavy and learning. If I’m using heavy and learning, what I will do now is that I’m still going to use the same rule, except that I’m going to sum over all of the patterns that I’m trying to store. Right. Very simple. So, I have a set of patterns that I’m trying to store. And for each pattern, I took the product of the two pixels, which I’m connecting using this weight. I had sum over all of the patterns. I could average it to, it doesn’t really matter, the constants don’t matter. We only worried about sides. And that’s going to be my weight. So, if I use this kind of storing pattern, now we’ve shown that we can definitely store one pattern. It’s stable, it’s stationary. One pattern can definitely be stored. How many patterns can I store? Right. Now, John Hopfield came up with this arithmetic, which said, you can store up to 50, you know, for if you have n bits, you can store up to 0.15 n patterns. Now, keep in mind that there’s a difference between storing patterns and having stationary patterns. In other words, when I store patterns, these are intended patterns. These are the patterns I want to store. But in the process of storing those patterns, I may end up with a network, which also remembers other Spurious bogus patterns, which I don’t want to remember at all. So, these are what we will call parasitic memories. But then target patterns, John Hopfield claimed that he can actually get stored up to 0.15 n patterns. Provided they are far enough. Where did this number 0.15 come from? Now, take a look at the following. So, let’s say I want to store k and bit patterns of this form. So, this is just the indicator. This is just the representation y k is the kth pattern. But then on a represented explicitly is a bit string and moved the k to the superscript just because I need space to mark everything. And the subscript represents bit index. So, it’s y k 1 through y k n. There are n bits in the pattern. And then there are upper case k such patterns that I want to store. And let’s say I’m going to use heavy and learning. And remember, I mentioned that heavy and learning, you can scale the weights. It doesn’t really matter. It’s just all your word about is about the sign. So, I’m going to scale the weights just for mathematical convenience. So, each weight is going to be the sum over all patterns of the product of the two pixels. And I’m normalizing it by the length of the pattern 1 over n. The 1 over n is irrelevant for the actual operation of the network. So, for any network for any. So, this is not a minus. This is 1 over n is the factor. So, for any path y p to be stable, this, I should not have a. This is correct. This is the 1 over n here. So, this guy has to hold. The product of the field at each neuron times the value of the neuron itself must be greater than 0. Otherwise, it will slip. So, for the pattern to be stable, this must hold. And then plugging in the value for w i j from here, I get this behavior. The value of that of the. At each for each pattern for each picture for each for each bit. The value of the bit times this guy, which is 1 over n times the sum of the bits, products of the bits for overall patterns times the corresponding jth bit for that pattern. This must be positive. The actual arithmetic is not relevant. I am just going to. It sort of comes in. It is useful for the analysis. Now, observe over here that when I computed the weight, I actually computed the weights through a B and learning by summing over all the patterns. I am trying to store. Let me separate that into two components. So, if I am trying to analyze the stability of the pth pattern. So, let me separate out the contribution of the pth pattern and everything else. So, this guy here is the contribution of the pth pattern. I am summing over. So, if I set the network to the pth pattern, what is the contribution of the pth pattern when I was learning the weights to the overall energy. This other guy is the contribution of all of the other patterns to the energy term. So, again the details of the arithmetic are not so relevant. Except that here, I am going to be multiplying yi p, which is the red bit, which is stability and testing by the field, which is weights times the bit value at each of the other positions summed over all positions. And the weight I separated out into the contribution of the pth pattern and the rest of it. Now, look at the first guy. In the first guy, I have yjp, which is going to become the product of the two is always 1. Similarly, I have yi p and yi p, the product of the two is always 1. So, this entire term is simply 1 over n times n. So, the first term is simply going to be 1. Because this term that I am summing simply becomes 1. I am summing n of those terms. So, the sum becomes n, I divide by n, goes to go away, it becomes 1. So, the only thing that I have to be worried about is the second term. So, in other words, for the i f bit to be stable, I want the second term to be greater than minus 1. The second term is less than minus 1. If the second term is less than minus 1, then this term is going to turn negative. And the bit is going to flip. So, what is the probability? So, in other words, so I want this to be greater than minus 1 for stability or if I want it to be unstable, I want this to be less than minus 1. This is the contribution, this is the criterion that I need, have for the bit to be unstable. Now, the purpose of writing all of this little arithmetic out is simply to show that all I am really testing is the value of the sum of a certain number of binomial variables to see if that sum is actually less than minus 1. And this can be easily enough tested, it is the sum of a bunch of variables. All of them take value plus some minus 1 with probability 0.5. And if you work out the arithmetic, so let me just write this whole thing as some kind of sum term c. If you just write out the arithmetic, it turns out that you know for a large enough m. And for trying to store a large enough number of patterns, the whole thing can be approximated by Gaussian with 0 mean. And whose variance is q over n, where k is the number of patterns you are trying to store. And n is the number of bits in the pattern. And so you really want to say, what is for a Gaussian random variable, what is the if I randomly do something from the Gaussian random variable. What is the probability that that guy is less than minus 1. And if you work through the arithmetic, it tells you that this word, that if you want the probability that if I randomly choose a collection of k patterns and try to store them. And if you want to guarantee that probability that any one of them is going to be any any single bit is going to be unstable to be less than 0.004. Just in arbitrarily chosen number. Then you want k over n to be less than 0.14. So this is where Hopfield’s equation came from. And where he said that a network of n neurons trained by heavy and low thing can store up to 0.14 n patterns with low probability of error, which means the probability that any single bit will flip is going to be less than 410. So, 0.004. All of this is computed using the probabilities of the bits of 0.5. So this entire analysis is computed assume that the patterns that you are trying to store are very far apart. Meaning if I take any two patterns within any pattern, the probability of any one bit being plus 1 is 0.5. The probability of it by the bit being minus 1 is 0.5. So these patterns are very far apart. In that case, using heavy and learning, you can store up to 0.14 n patterns. What is the behavior you would expect if the probabilities of the bits were less than if the patterns were less far apart? That is if the probability of probability of any bit taking the value 1 was less than 0.5. In that case, the target patterns are closer, because they are not perfectly random. So what would you expect the behavior of network that is trained with heavy and learning to be in those patterns? Would you expect to be able to store more patterns or fewer patterns, because they are closer and you expect it to be more confusible. So let us see that really is what happens. Here is the example again where I am trying to store only one pattern simple heavy and learning. And not exactly the same thing we saw. But so that we see that for every pattern we see a shadow pattern, which is the flip when you flip all the sides. Now consider this business trying to store multiple patterns. I will say two patterns are orthogonal. If I will say two n bit patterns are orthogonal, if the two of them differ in at most n minus in exactly at over two bits. That is the largest distance between two patterns. For us a pattern is the same as the minus of the pattern. So if I give you two patterns, what is the big largest distance between two patterns? It is going to be n over two, which means half of them match, the other half don’t match randomly. For every bit you are basically flipping a coin to the side of the two patterns match or not. So if you actually think of the if you compute the inner product between these two patterns, you are going to find that the inner product is 0. So the pattern is going to pattern is orthogonal. So two patterns are different in our bits are going to be called orthogonal. And for any n, which can be written as two raised to m times l, where l is an odd number, you can show that there are at most two raised to m orthogonal binary patterns. We won’t work through that with metric, but pretty much any number on earth can be written in this manner. Where l could be 1 and n could be 0. So any number on earth can be written in this manner. That’s how many binary patterns you are going to be able to or orthogonal patterns you are going to be able to find. So now let me try to store two four bit orthogonal patterns. The two patterns are 1, 1 minus 1 minus 1. This is 1, 1, 1, 1. And you can check where the two are orthogonal. Many if I multiply the two, where you know component wise and sum them up, the sum is 0. And now I am plotting the energy contour, meaning what is the energy of the network for different configurations of four bits. And you can see sure enough that these four guys have the lowest energy, but the patterns are perfectly confused for recall. They are as far apart as possible. On the other hand, if I started any point, there is no saying which pattern I will end up recalling. If I flip the first bit, I am going to end up here. If I flip the second bit, I am going to end up here. So as the probability of recalling the correct pattern from any initial location is just 0.5. So although the patterns are nice and stationary, as far as recall is concerned, if you are more than 1 bit away, even if you are even only 1 bit away, observe this. If this is the pattern, I am trying to remember. If I am out here, I could flip it out and go to that guy, which is the ghost of this fellow. So even if only 1 bit has changed from your pattern, your ability to recall anything is perfectly random. So while you are able to say whether something is a valid memory, you cannot recall anything. And these are two completely orthogonal patterns I am trying to remember. As far apart as possible. Let me take two non-orthogonal patterns. So these two patterns are no longer orthogonal. This is 1 minus 1 minus 1 minus 1, and this is 1, 1, 1, 1. So if you just compute the inner product between the two, you will find that it is not 0. In this case, here is the energy contour. Something surprising, right? You are not only able to remember both patterns. They are perfectly recall, you know, they can recall them. If you start from here, if you start from here, you are going to end up here. You are actually able to recall the patterns, although they were actually closer than the orthogonal case. So you know, it is kind of counterintuitive, right? If you try storing patterns, which are really far apart, you are less able to recall these things than when they are less far apart. But then something else that comes in, what is 0.14 times n over here? So n was 4, 0.14 times n is less than 1. Using Hopfield’s, you know, formula, you would say that you cannot store even one pattern using 4 bits. I have shown you an example where I can actually store 2 patterns and they are recallable, right? So all patterns are local minimum, they are stationary and stable. But recall from perturb pattern, something fun is going to happen. I do not know what the text there says. So can we get rid of this text? Pick a time, say pick a time, I do not know where to go. Now this text is going to move. So here, when I try to store 3 orthogonal 4-bit patterns, again all 3 are local minima as you will find. But you cannot really recall any of these 3, right? And whereas if I use non orthogonal 4-bit patterns, again all 3 are local minima, but some of the patterns can be recalled. In fact, in this case, all of the patterns can be recalled in the sense that if you start close to the pattern, natural evolution of the system is going to go into the pattern. If I try to store 4 orthogonal 4-bit patterns, all of the patterns are stationary, but guess what? Any pattern is stationary. I do not know what the initial value of the network to be, it is not going to change. So you do not remember anything. And if they are not orthogonal, then there is at least one stable pattern that you can recall, although the rest cannot be recalled, right? So now Hopfield said that for a network of N neurons, you can store up to 0.14 and patterns. And this is a fuzzy statement, because clearly I have shown you that his analysis was assuming that the patterns were as far apart as possible, the probability of any bit was 0.5. Whereas, if I give you non orthogonal patterns, you are actually able to store more in many cases, right? And so his worst case analysis doesn’t quite give us the behavior that we actually see. And it is a fuzzy statement, you know, what does he mean to say store 0.14 and patterns? Does he mean stationary? Does he mean stable? And in any case, N equals 4 is probably not a good example, because 0.14 times N is less than 1 pattern, right? So let us try something a little more complex. This is a 6-bit pattern. Now here, this is a crazy kind of map. So to think about what this map really is, you should fold it down the middle on itself, fold it down the middle vertically on itself, and then roll the whole thing and make it a toroid. So it is like a 4-dimensional toroid, to actually visualize neighborhood over here. I just sort of open it up so that you can see what happens. And if I try to store only one pattern, you see that it is clearly the pattern and its coast are very clear local minimum, right? But if I try to store two orthogonal 6-bit patterns, both of them are perfectly stationary, perfectly stable, but then you see several, spurious minimum, which goes back to your question. Where did the dots come from? So suppose I started over here, this guy, this location is perfectly stable. And if I were to start here, this is not going to this location and not evolve any further, right? So that is a fake memory, something that I am remembering, which is not something that I really want to remember. I had to close enough to the actual memories, to remember those yellow spots, but there are other spots in the map which can end up being recalled, which are not things that I really want to recall, right? Now this again, 3D came up, right? If I try to store two non-orthogonal patterns, it turns out that the number of fake memories are actually fewer things get disorganized a little bit, but you can still quite nicely recall both patterns. If I try to recall three non-orthogonal 6-bit patterns, observe that when I have six patterns, I cannot have more than two orthogonal patterns, because six is 2 times 3 is an odd number. The largest number of orthogonal patterns I can have is still, as I explained earlier, right? So if I try to have three non-orthogonal patterns, I can still find three perfectly stationary patterns, and many of them can be recalled, they are stationary and stable, right? In fact, it is crazy. When I try to store three non-orthogonal patterns, I end up with fewer fake memories than when I try to store two orthogonal patterns. So if I try to store two patterns which are very far apart, and instead I get more fake memories, then if I try to store three patterns which are not that far apart, which is kind of crazy, right? So, and these kinds of behaviors sort of thought of as being somewhat analogous to how we remember things. If everything you remember is very far away from everything else you remember, then you have lots of room to create bogus memories. If things are close enough, there isn’t that much room to create bogus memories anyway. So this is with four. And again, in the case of four, you get some interesting patterns. You can keep going on, right? By the time I get to six non-orthogonal patterns, nothing can get in my mind. I don’t think I have a break stuff. Because what happens is that the reflection of each pattern, the negative ends up being too close to some other pattern and the interfere with one another, which is one of the things that happens. Now, I can go on and take eight bit patterns. This is like a four dimensional test set act. I can store one pattern, I can store two patterns and eight bits. But you can see all of these nice fake memory structures, right? And I have, this is with three, I think, right? With two non-orthogonal patterns, you don’t actually get any fake memories in the eight bit case. I can store four orthogonal patterns. I can store four non-orthogonal patterns. You get the idea, right? By the time I get to eight patterns, the whole thing is wiped out. So the question is this. This is neither stationary or stable. So what we have is that we have many parasitic patterns, undesired patterns that also become stable or attract us. What I mean by attract us is that if you start off close to the pattern, it’s going to end up in that pattern, right? But we also observe that apparently there’s a capacity to store more than 0.14 n patterns, right? As you can see, in the sense that I was able to store with eight bits, I was easily able to store three and even four patterns, that’s 0.5 n, right? So let’s consider this business of parasitic patterns. Why do I get all of these fake patterns which do end up being remembered? The two reasons for this. The first is this. If you work out the arithmetic, if I try to store patterns via Y, B, N, Y, C, right, then the sum of these three patterns also ends up being a local minimum in the entire energy contour. So the sum of an odd number of target patterns also ends up being a local minimum in the energy contour. So here for example, if I try to store these two, then I’m going to get a small little dip in the energy somewhere exactly. I mean, this is, these are just two in theory, it’s going to be sum of all three. So if I look at the sum of all three, I’m going to get a dip. The more patterns I add, I’m going to get various combinations of odd numbers of patterns. And those are highly likely to be local minimum, right? And then of course, there’s a purely random behavior. You’re just going to have local random local minimum if you have. If you, if you design your network then this way. Now, this is only true if you’re using heavy and learning everything that I’ve been saying so far, assumes that you’re using Hopfield’s original proposal for heavy and learning, right? So now let’s look at the second thing, capacity. How many neurons can I, how many patterns can I actually stir? It seems that I can store more than 0.14 ampattons, right? So in other words, I should be able to obtain a weights matrix that stores more than 0.14 ampattons. But then I don’t get the nice, but you know, I don’t have guarantees. The probabilistic guarantees that we saw earlier are based on the heavy and learning rule because you just plugged in the heavy and formula for the weights when you were computing the energy and computer your patterns, right? Also, we find that patterns that are non orthogonal are easier to remember. Meaning patterns that are closer seem to be easier to recall than patterns that are farther apart. So, this seems to be a certain degree of inability to predict exactly what is remembered and you know, what is going to be remembered well? What are the fake pattern memories going to be? Whether the target memories are going to be remembered at all or you know, how well they will be remembered? This lack of control is because we don’t really have a nice controlled way of designing the weight matrix. We just take any odd set of patterns, you compute the use the heavy and rule to learn the weights and you say this is it, this is my network. And whatever behavior you get from it is what you get. So, the question is if I change the learning rule and say go to something that is non-heavy and can I actually do better? Can I get better capacity and have fewer spewious memories? And here is the bold claim. I can always store up to n orthogonal or orthogonal patterns with perfect stationarity. Why? We will get to that. I can also store n arbitrary patterns. So, the so long as they are linearly independent, any set of n linearly independent patterns provided, you know, such that they are stationary and stable. Not only that, I can avoid spewious memories by adding some noise during recall. So, you saw those little dips we saw in the energy contour. During recall, if I add a little bit of noise which allows the system to evolve, even if I am in a local minimum, then it could escape the small dips and go into the larger depths. And going to real memories as supposed to fake memories. So, these are the two claims and these are the claims I am not actually going to back up today for this I am going to sort of hold you in suspense till the next class. Let us stop here. But, even before I continue, there are a couple of just a bit of left hand. A, have you thought about why the quadratic business does not work? Because, why is that discrete? So, what really happens is when I define a quadratic and defining a quadratic in continuous space, but the values that the wise can take are on the corners of a grid. So, on the corners of the grid, I can have any pattern at all. Whereas, in continuous space, there is only one global minimum. And we will see an exact why this is so. Number 1, right. Number 2, why would I always be able to store n orthogonal patterns? Exactly, right. If you give me a set of n orthogonal patterns, I can always compute a matrix w for which they are the basis. And if I just set my weights to be that matrix, they become stable. So, which basically means that everything that we have seen so far is not quite optimal. And then, of course, you can optimize things on top of that. We will get that. Now, the third question is, can I store more than n patterns? Clearly, not by design. You might get more than n memories, but you know, all but n are going to be fake, right. Clearly, not by design. But, I am going to claim that yes, I can store more than n patterns. How would I do that? This is when you think outside the box, right. You become such no skewer hint. That is one, that is cheating, right. So, this then, you say if I am going to be storing, there is a relationship we had over here. If I am trying to store n bit patterns, what was the size of my network? n neurons, right. And times n minus 1 over 2 weights. Who said the neuron network must have only n neurons? So, if I want to store more than n patterns, why don’t I just append a bunch of don’t care neurons? If I append a bunch of don’t care neurons, then the capacity of the network scales with the bit, you know, there is a number of bits, correct. So, which means the capacity of the network is going to be if I add another k neurons to my n neurons, the capacity is going to be n plus k. I can just keep adding more and more new don’ts and store more and more patterns, right. That gives us a bolster machine. So, just take your standard hop-in network, append a bunch of stuff you don’t care about, you end up with a bolster machine. And in fact, if you separate those two guys in a manner in which the two sets of neurons communicate, you get restricted bolster machine. So, what about that in the next two classes? Questions? Guys, any questions on streaming? Yes. So, this is literally content, the hop-in network originally versus literally content addressable memory, right. So, you could store patterns and you could start off with some random thing, closed it and it would remember things. Or things like you know denoising whatever, right. You could also use it for example, for pattern recognition. In the sense that if you store 10 digits and you got a corrupted image, which digit was it, right. But then if you go one step ahead and if you begin, if you begin to think of adding extra stuff, you can begin thinking of it as a generative model. And that’s what bolster machines are used for, right. Primarily as generative models. You can use it for classification, for election, what have you. The reason I put this off to the level, I know three lectures on this is that this is not the primary thrust of our course. But if you took for example, seven or seven, they actually spend a lot more time talking about, we were restricted bolster machines, both the machines and you know deep bolster machines and they had that entire hierarchy of models. All right, thank you.

AI video(s) you might be interested in …