11-785 Deep Learning Recitation 11: Transformers Part 1
Hi everyone. So this is the first part of the transformers recitation where we will quote from scratch. So this one will be more about the basics of transformers and how we coded so that for the future recitations we can just go over what researchers in the community have done and what the architectures, how they change the architectures and their use cases in different fields such as vision and natural language. Although once this is so one of the cool things about transformers are that they are data-admostic in terms of it could be vision, it could be different parts of it could be the data could be from camera, it could be from lighter, it could be so many things. It could also be multimodal. So yeah that’s that’s the cool part of and this really stems from the fact that you have these cell phone cross attention blocks so you could join the information. Yeah with that let’s begin. So before we jump into the code I wanted to go over some things. So this is the transformer that professor has been through that Abo has been through in the last lecture and I will not go into the detail of it because that’s what happened in lecture. I will be going into the implementation of how we implement each part and then link it together. Yeah so I’ll be coding that while on this meeting. So I’ll be coding some part of it and I’ll just jump through some other parts of it because they’ll either be repeating whatever has happened before or they’ll be easy for you guys to do. So some enter cancees to get out of the way because it’s easier that way. So I’ll not again be going into the detail of this because I guess in the poll it was quite clear that people don’t want to go into the intricacies which is which is understandable. So positional coding and self-attention, cross-intention. This this has been through multiple times in the lecture but in case you don’t remember or you’re not sure about it this is the easiest way of understanding. So key and value will always be from the same source but your query can be from the same sources. So if your query is from the same source as a key and value it’s self-attention and if your query is from a different source it’s cross-intention. That’s it. That’s what you that’s not always true but in most cases it’s true and that’s that’s all you need. So and the reason for that is we I’ll explain it during the coding session also but just to get it out of the way the reason for that is because you effectively find trying to find queries which are closest to the sum key and you’re taking that projection and adding it to the value to find basically the value which is closest to the key which is closest to the query and that’s basically the attention code as well which is this one. So you have some query find the projection of the query closest to the key and then you use that to get the final value using the as a weighted sum of these. We’ll come to that. So that’s one part. Now let’s go to positional encoding part. So it’s basically your inputs you get an input input embedding and then you add it to the position code. How does that happen? Well your inputs let’s consider your input to be a sentence and in that sentence you have words. Those words could be part of some vocabulary and in that vocabulary you have you have the indices or you you can call them well you can think of them as their numbers so each word will have some number and that you use that to create random initial embeddings which will encode the information about the word. You add those initial embeddings or those embeddings which we will train over time also. We will use those embeddings and add the position part to it which is which is what is being done here. And the way we do it is we have these exponential sign-in cosine terms which depend on the position and so and the interesting part is they are made in such a way that no two words would be unique because of the way this sum is being added. That’s all you need to know. It makes it unique the position of each word. Yes so that’s one part. Okay the other part to the masking. So if you’ve attended both of both professors and now whose lecture or if you read the slides you would know that what you’re doing is next sentence prediction, next word prediction sorry. So for that you basically need to mask the input so that you only see the words which are which have already been predicted. So this is why it only attends to x on the next two y you don’t get the end of x on the next one. When you reach the last word you would have seen towards before it. This is to prevent the this is to prevent the network from seeing the next word because it could definitely just see the next word in predict quarter test right. So if you see x for a quarter predictor, x for that. But for the for the encoder block that’s not true. So let’s look at the architecture. This is the same from the this is the same as the one that you’ve seen in the paper. So you get inputs you made input embeddings add the positional encodings you pass it on. Now here this is the encoder block. So in the encoder block you have this first block which is self-attention. You add you you you have a skip connect and normalization followed by a feed forward network. And again one more skip plus normalization. The the thing to note here is the fact that this block it could be multiple such blocks stack together. That’s where this nx comes which is you could have multiple such blocks creating the final encoder. So the way we will be coding it is we’ll create each module. We link those modules to make one decoder one encoder block and then you create the encoder finally by using these encoder blocks. Now looking at the decoder it’s it’s quite similar in the in the notion that once you get the once you get the input you have a self-attention but it’s mass self-attention. So the rewards are only attending to the words that they’ve already seen. And and then you use that my my my my, and then you use that to basically get the queries. But this is the part which gets interesting. If you see this part it’s actually this part. So it’s actually the same these two. The only difference being the the way the keys and values key value and the query have been passed. So in this case the key value and query were the same. So this was a self-attention block. Well this is a cross-attention block. The queries are coming from the decoder that we already have written and then the key and value are coming from the encoder. So this forms one decoder block and we’ll have and such decoder blocks to create one decoder. I hope that makes it clear. Okay I won’t worry about it. Let’s jump into the code. So we’ll begin by writing the self-attention block. So yeah you please feel free to speed it up if you feel that I’m going slow. This is more for me to write the code because I feel one is writing. There are some small parts which you think you know but till the time you really don’t write it down. You don’t quite understand. That’s how I like to code it. One of the videos that I’m referring to here is one where actually a couple of them are where the people actually coded it out and that made it made more sense to me. So that’s how I’ll be doing not all of it to be coded. I’ll be skipping through some parts. I’ll be using those. Okay so I’m creating the self-attention block which is this part here. Right and then it will be multi-headed. So you have your values. He is where he’s coming in and it will be multi-headed. So we look at that as well. And yeah I’ll be skipping through the visuals also in between so that it’s easier to understand what part of the code it is. Okay I’m going to point. Okay so you are embedding size. You have your heads. Now the thing to note here is that the input embeddings will be split into equal parts. Okay and the reason that happens is these these embeddings you’ll be passing them to the heads. Maybe you have these heads. You’ll be splitting these into parts into equal parts and then you’ll be concatenating it again so that it will be the same shape. So that’s why I’ll split the embedding that I’ve had that I’ve got and because for this plot it’s self-attention I know for a fact that they’ll be the same size. Now one more thing to note here is we want to make sure that the embed size is actually visible by heads because otherwise when you concatenate it this won’t work. So we’ll have this assert statement here which basically says that if the embedding size is and use it is visible by embedding. Okay now we’ll be creating some linear layers which is just so that which is not really clear but after the inputs. So before I pass it directly to the network I actually have this linear inputs. I’ll get this I convert it into linear and then I split it. Yes so actually we’ll go to the part when we skip it but there’s a linear layer before you pass it off. So let’s let’s look at that. This is the embedding size and again to the embedding size so you pass it to the linear and then you split it. I’m just going to copy this because it’s self-attentioned and all of it is seen. Okay now this is the query. Over this is the output and again this will also be the same size because of well not necessarily it’ll just be the same size in this case. So let’s move forward so here you’ll be passing your values keys query and the mask. The mask here if you’re paying attention you would not intend it you would know that the values if you’re using it just for the encoder block it doesn’t really matter because we won’t be using a mask but the cool thing about this is that this block along with the feed-forward network the structure is actually what we need in the decoder as well. So this structure is what we need in the decoder as well. So that’s why we will be looking at the mask also so that we can reuse this code again. And we need the mask or next word production. What I’m doing here is trying to get the length of each input and that would be clear later why we would need it because okay so now we need to in order of what comes in the block to look in the block. So you get the values keys and the queries then pass them to the linear and then you pass them to the multi-add attention. What happens in the multi-add attention is you have these query and the query in key coming in you do a I mean it’s multiply we’ll be doing it differently but you can do a BMM then you scale it which is this you put a mask if you want to and then you do the softmax and then again I’m going to multiply between this project return and the value that’s your attention. You have multiple of those because it’s a multi-add attention and the inputs of those were split. So let’s look at that. So this is some passing it through the some passing it through the linear layers. Okay now we will split the embeddings before passing it to the multi-add attention. So we will reshape the value length. This is one of the reasons why we need the length because we want to reshape it in where the length doesn’t change but your embeddings the way the embeddings are represented change. So you put on cell dot heads and cell dot head dimension. So it was before it was cell dot head dimension times cell 4 heads but now we have split it so that we pass it to the cross attention. So this is the same thing for queries as well. So you have queries on each shape and then again that size. So you can feel free to skip if you understood this. I just find it useful to still write down. So now that we have this so we have split it we have passed we have split it after the linear. Now we will be going into the multi-head part which is basically this. So I will just add some comment shares. So the way the matrix multiplies done can be done multiple ways. You can do BMM or you can do tensor dot. I am at mall but we will be using Einstein. And I will try to give a small introduction by some here but if not and you can search online there is a great video I can link it as well. As long as you get to the initial barrier I believe it is a lot easier to implement and it is one of the easiest to not make a mistake because it is the way you actually code. And I will just explain what is being done so that it is easier for you. And this part would be useful for homework 4 as well because you will be implementing attention there. The attention the way the attention is calculated will be the same. And you could also use multi-head attention block in homework 4. So this part actually is quite useful for homework 4 as well. Okay so we will turn some energy. Okay so we know the shapes of both the key and the query. So the first thing we are actually doing is this matrix multiplication between query and key. And the way we maybe I can just keep it like this. Yeah it is a little short but I think it should still be visible. So the first thing you need is this and the way you would do it is you know the shapes. So for query let us call the shape n which is for the that size q for the query length, h for the number of heads and d for the dimension. But this heads in dimension would be the same for both the query and the key. We are doing a matrix multiplication between the query and the key. So let us add the key. The key dimension of the n, key length and dimension. And now what are you doing here? Well let us make it to the end. What are we doing here? We are doing this initial part. Where are we doing a adopt product? Let us not call it adopt product. Let us call it an inner product of the query and the key. So you are doing it across the last dimension. Where is you doing it across the embeddings? So you will basically be removing that term away. So let us look at this. So it becomes n, h, q, k, which is you have the head number of heads. So the back size and number of heads, the query length and the key length. What this is giving if you look at it closely is for every query length, you will have something corresponding to each k key length. So for every element at the query, you would have something in the key and that is what you would be multiplying with the value as well. So you will basically be taking a weighted sum of each part of the value for each query. Now for yes. So this is the query dimension, which is this, this is the keys dimension, which is this, then you basically multiply it says that you remove the last dimension, and what you get finally is you get your back size, your number of heads. And for each query in that in each head, you will get k values. And though you will be using those to get a weighted sum of the values. Okay, so if now if mask is not none. So this is for the case where that you’ll be using this same block in the in the decoder. So we’re looking at next word prediction. So we don’t want to be looking at words which are in the future. So in this case, your energy is actually energy, but mask fill. This will be useful for this will be useful for whom it’s for as well, this mask filling. Okay, so. And what you what this is doing basically is for the parts where you have a mask, you’re filling it with very small values. So. And you could do it differently. So let me look at let me show you the mask part. What it’s doing is for the parts that you have. For the parts that you have. That that is to be empty so that you don’t look at the other parts you actually. Make the value to be really small so that when it does multiply with the raw tension it’s those parts are reduced to zero, but in this case are very small. Now just add some comments so that it’s easier to understand. Yeah, so now that we’re again. We’re done with this part so now we’ll focus on this last part and adding it to the values. So. Let’s look at that. So we’ve now we need to divide it by the square root lens which is the scaling and then. We need to softmax it so the weights have between zero to one which is what you’ll be multiplying with the values. And for each head we have each element paying attention to each part of the key so we have for each head. You have for each element you have some which is the query length you have some part which is attending to each of the query key length elements and the way the reason why we’re doing it is you want to map each key. Each part in the query to the value. So. So you basically made this part the last part you softmaxed over the last part and you have that between zero and one. And that will help us with the with the weighted sum eventually. So another point. This is the last part so. We’ve done the matrix multiplication we’ve now scaled it and. Apologies so we’ve now scaled that and we’ll add the mask if necessary. Okay. Which we did already and then we added the softmax which is what we did just now which is we made the. The dimension the product the last dimension which is where the key values were. So to lie between zero to one and now we’ll be the matrix multiplication of this final thing with the value so that you get something which is in the values base again. So let’s look at that so you have a torture time some again and what was the input the input is the attention again. Which we already know the shapes of so it’s basically for each head for the for for the for one batches for each head for each query length you actually have the parts that it’s being attached to across all the key lengths. And let’s call that dimension L and this is value so notice the way the value was which is. N value length the heads and the head dimension so it’s the same things it’s N L HD the L would be the same because key and the same and the heads and the value dimension now what we want is we want for this part this L length to. To attend to each part in the value so that is why we would be removing this L basically because each each key part is being mapped to some part in the value. And you want to wait for some of that for your query so you for so what you effectively get is you get. You get you get for each query element you get heads and the heads and the for each add you have some dimension and this is useful because we’ll be concatenating that to finally get into the same dimension space as into the same space as the values. So now let’s combine which is. So as I said we we combine these two let’s do once this is done let’s do a walk through this again so that it’s quite clear what happens so. Now this final output will be passing it through the through the final linear layer so. And this is not really modify the shape so we have the same shape as before. Let’s look at this again so it’s completely clear what you have is you have your query key coming in and then you do a basically do a. I know product to find the projection of this query in the key space right and then you scale it and a mask if it’s necessary so only when you’re looking at it in the record of the block you. So you do a mask it so that you’re only looking at words that are preceding and once you do that you do a softmax along the last dimension because you want the final waiting to be a weighted some and that weighted some you want the weights to be to live within 0 to 1 and that’s some to live some to go up to 1. And that’s where this term comes in you use that to basically get a weighted some of the terms that rated some of the value terms and you’re doing that for each query. And that’s where that’s what we have in the code basically so you’re doing it for each query and that’s why what you get is in queue actually. And you then just pass it through the final linear after concatenating that’s where we reshaped it so we split it here before after passing through a layer and then now we concatenate it again and pass it through the linear that actually completely finishes our. Self attention this was the self attention block and I know this to color time but other blocks are quite easy compared to this. So let’s move to something called the cloud transformer block but that’s basically just a combination of what we’ve done. Now what will be passing through this transformer block? Sorry, we will be passing the embedding size the heads drop out and forward expansion of this forward expansion is just we’ll come to this. You can also keep this one if you want. So now this is the attention module of the transformer block which is what we have already implemented. So you have the size and you have your heads and you could. Now we’ll be adding the layer norms so I’ll show the architecture and forward. So now let’s look at the feed forward that comes after the attention so you have. And it’s a linear layer. So the input is the embedding size that we got from the attention block which was the same as the. Which is the same as the size that we had from the input directly now this is the part that are talking about this forward expansion you can keep it the same. Or you could change it the authors change it in the paper but it depends on how you want to implement it. You could keep this as relu or I would suggest keeping it a galu because you’ve seen that galu performs better. But that’s really I have parameter choice for you in choose whatever activation you found to perform the best. Okay. So wait, this should actually be. And I believe this is the same. Now let’s look at the forward. And what you’re passing to the forward is the now let’s look at the. The forward you’ll be passing the keys the values and the queries right here. And you’ll be passing to the self attention then the layer norm you have a skip connector so this is important again passing it through the field forward that teacher showed and have a skip connect and a normalization of that as well and then that’s the output of our class on a block. So value key query and the mask. Now I’m passing it through the attention block directly and what I’m passing and passing passing the value the key query in the mask. Now after the attention there’s a skip connect and you also have a drop out you can however not have the drop out it’s completely up to you. So we pass it to one one. And we add. Oh, we add the input here directly, which was the query. That’s the now this x is the input to the next feed forward. So forward. And again there’s a new skip connect so it becomes out. Typically having just one should also work. But that’s again based on the regulations. So now we will be using this transformable to create the encoder. So if you’ve already done the multi head attention we’ve done this combined it together with the feed forward now we’ll be using that to create the encoder. So you have. So what is the input to the encoder. This is the part where we be actually looking at the position embedding also but we will be implementing it in a simpler way you could use the class that I find here as well. So let’s just look at this briefly although I’m not using it right now for the code let’s just look at it briefly this is from the annotated transformer of blog. What’s happening here is you actually create the position term using this using this sign and cosine which is added to the input like was so what’s happening here you have the input embedding is coming use the position and coding and you add them and we pass it’s basically what’s happening here so you have your input x being added to this variable which is generated. Here and doing a dropout so. So this is you don’t optimize over this but if you wanted to you could so you have the you have this position which is the same length as whatever your maximum length is and then this did term is a part this did term along with the sign is that part which actually encodes the information about the position. So this did term is actually created using using using using an exponential and exponential term which is multiplied with the position this exponential term I don’t want to go too much into the depth because we don’t it will take some time it basically what basically what it does is maybe I can annotate. So you have signs of different frequencies. Then you have this looks terrible but what I’m trying to say is you have signs of different sequence frequencies and those actually encode the information of the position. That is where this sign term and cosine term actually comes in and helps and the reason you do that is so that because these terms actually periodic when you encode all this information you make sure that elements which are in the sequence never have the same value and this this idea is actually really useful with some other places. So if you heard of something called the neural radiance field paper which is CV community they use this position and coding for for making the network better basically they wanted whenever you want some if you have let’s say two elements which are close together but you want them to have some when you pass it to the network you want them to have some some way to distinguish between these two you could have a position and coding term and. That make sure that even if they look the same and they are at different positions you actually pass that information on the network and that’s why this is required you actually don’t optimize over this because this is just to add information you don’t want to optimize this just tells you that these two things are not missing. So yeah now in the encoder I think I digress a bit but it’s useful so you have the source woke up right I’m sorry the embed size and the number of layers of these encoder of the encoder block that we just created. So we just created those and let’s just pass it to the cells. Oh my bad there actually some other terms let’s do this first we’ll also be passing device the forward expansion term that we used in the encoder block. Drop out and max length need this for the position. So all right set of devices decided let’s go to the word embedding so. Now like I said this is the input embedding and we create this using the. So this is the first embedding and this is where we use the. So you have the source woke up size then that’s why you have number of embedding and the embedding dimension which is the embed size. Let’s do this obviously. Just to make it happen and here. Now I’m doing a hacky way but you could use to code the time and it above as well. Let’s make it again easier stuff again to make it. A free limit. is this not working? Oh my god, yeah, my bad. This should be number 6. And then you have the embedded touch which is embedded size. What this is showing is basically your word embeddings, the number of embeddings are the lookapses. So like I said at the beginning of this recitation was you have some indices for you for each word in the vocabulary and you’ll be using those to create an embedding which is random initially. You use those and and the size of those is obviously the embed size that you’ve already passed is parameter. You use those. Now think of this as the positional encoding that I just explained about but what you have is you have this max length and that max length is the number of embeddings you have because you’re looking at the terms of the position of each word. So this encoding doesn’t take into account what the word is. It will be we will be using that later but we won’t really look at what the word is. We are only looking at the position. So that’s right. Making learned it makes sense that you don’t want to keep it. You don’t want to learn this part which is fine. You could actually make it true and you could there could be ways in which that could be done. And then you again have the embedding size to be the same because you be adamant. Now let’s look at the layers. This is for the transform block. So you have embed size. You have herds, drop out and you have a forward expansion. These have been passed to the network. I am not okay. I can’t have missed defining heritage. This is basically the number of times the transform block will be called. So this is basically this server player. This is the number of times you can go and do a transform block. This is all there is. We again added drop out. Now to the forward. The input and the mask again. The mask is a part that you move forward. Okay. So you have to end and sequence them. In full to structure. Then you have positions. In full to. Now this is for the position. So what you have done here is passing the maximum length. The sequence length in this case and you do not change the dimension and pass it to the mask. This is the input. Again passing and drop out to. Now we will be working with the embedding. So you have self-dot word embedding which depends only on the words you pass the x plus the self of position embedding where you pass the positions. So what you created here was a list of a list which is from 0 to sequence length. So basically that is for each word. Then you expanded it end times. Now you just pass that to the position embedding. So that gives for each word. Each word has comes out with an embedding dimension and to that we add for that word the location and the size you add the position embedding booked. That is all that this is doing. Now let us look at the final part of this of the n code basically. So in this n code the key value and query are the same because it is self-attached and the decoder this will change. What we do here is basically now go through all the layers and you do it out. So you can go to layer of out okay this is not close this layer. So you have layer out of out mask. E value and query are the same. That is why it is happening. So it looks a little weird but that is what it is because it is heat self-attention. Block is the end code. All I do is pass now. Now just to make it clear what is happening here is you have this. We did the self-attention we did the feed forward and make the transformable block. Now we are doing the encoder. For that encoder this is actually stacked on top of each other. So the inputs from this one block you have some output. You pass the same outputs again here which might seem a little weird but that is actually what self-attention is. You have the same source, you have the value and query coming to the same source. So you have each part in the query actually attending to itself and that is what self-attention is. I hope that makes it clearer and it is sort of like a per movement if you had marked understood before. Now once this is done we will be actually passing it to the decoder. However for the future recitations itself this would be useful for the future transformable recitations. It would be a useful thing to remember because other variants of transformers, bird and such also just use these separately and just work with those because they have a lot of information in them. They use them for different tasks. Now let us go to the decoder and the decoder block. If you are tired just to make it clear the self-attention part was the only involved one. These others are just engineering but we would be going to them. We will still go through them to make sure that we are not make clear about the smallest thing in this. What is the input to the decoder block? The input is embed size because again this is also taking input, it adds forward expansion. I would keep the forward expansion to be one but we would change it to one. This is actually that transformable block would come in useful because we will be reusing that piece of code. Let us have a layer number fine. For the initialization I will not go into the architecture when we come to forward because it is easy to understand the sequence of. In which these are called I will go into the architecture. Sorry so if the expansion by the compressor, and the heads. Then you have a transformer block. And to that you pass the embed size again the heads, the drop out and the forward expansion. Okay, I’ll just make this one more drop out. Okay, now let’s get into the forward which is where I’ll explain the sequence. Okay, so first we do self-attention in the beginning. So you have attention equal to self, not attention. And you pass the inputs and you pass the master. So this is something interesting to note here. Let me finish this and then you can look at it. So you have the skip connection query. Okay, now look at these. Now let’s look at this one by one. So you have a mass multi-head attention which is the input and self again. So you have self-attention, you pass the x again which is the key value and query. Then you have drop out on top of this. So this is the first part, the key query and value. Then you have a skip connection. So you add the x with the attention again. Remember we made the, made sure that the dimensions were the same. And then you have a norm and then a drop out. So this is this part. The interesting part of this is you have the cellular transformer block. Except now the value and key are coming from before. So that’s why if you see what’s happening in here in the first and the cellular attention. So this is the, this xx is actually coming from the output. And so what you have is a training mask, a mask on the input that’s coming in from the training data. Now for this part for the, for the self-attention that you have after this, the first attention. So now you could actually call the transformer block directly, which is what we are doing. Because this is actually the same as this except this, the input is now different. If you remember this, this was the block that was for the encoder. You can clearly see that the inputs are the same. Because it’s, it’s the same because it’s the encoder and you have self-attention all across. In the decoder, the first part is self-attention. So you have xx and x. The key value query are coming from the same data. But for the next part, you have key and value coming from the encoder. And so that’s why the key and value are passed, which is what we, which is what we are sending. And the query comes from what we have before. So this part, the decoder. And so for that reason, the mask is actually the source mask. So for this, the mask is from the source. So it’s the source mask. And you just return it. These are the small things which when you code out are quite clear. So that’s the decoder block. Now we’ll make the decoder using the decoder block. So copy some stuff because it’s the same. Okay. Just like the encoder block, you have the training of capsize because that’s what we’ll be using for the embedding from the training site. The embed size, num layers, heads, number of heads you need for those multi attention, multi attention, the forward expansion, the drop of the device, the maximum. So you have self-proc layers is equal to the modulus the way we had for for the trial for the encoder block. And in here, you have the decoder block and you’ll be passing the embed size heads forward expansion, the drop out and the device. I mean, this might seem intimidating, but how the things are just trivial. And forward expansion, the drop out and the device, all you really care about the number of heads and the embedding size, that’s it. And even those are true. So I see out because you have a linear layer after everything the same as the encoder. So very, very cool thing to see here is the same thing actually works for vision as well. We look at that in the next research, the same block actually works for recitation. Actually works for vision. I’m sorry. And there is no, there is no convolution. It’s just a linear and attention, which makes it very impressive. So you have the embed size and the group of captains for the last one is this. Now, on the forward and go through each part again, just to clear. So what do we have here? We have N, sequence length. The same way we did it for the encoder really. So I’m just repeating that stuff because the only thing that has changed here is the target, who capsize which. Which we look at. Right. So now you have. Again, like before, we’re just adding the position and learnings. This is cross extension. So. Just copy what we did in encoder again, which is for each there. You have the encoder. Right. So because it’s cross-citon, you can see that your queries are from what you have coming from your. The queries are basically what you’re coming what is coming on from each layer. Right. So this layer outputs the query and that queries again passed and the in the key and value are actually the encoder outputs. Now finally after those end times, those are done end times, you pass it through the game again. Let’s look at it again just so that we’re sure. So we looked at this, which is we have the self attention here for each query and then after that we have this query and the key and value coming in from your encoder and this didn’t again the. Transurfing block that you’ve defined already and. This entire block the decoder block will be done end times such the decoders that’s what you’ve done in the code if you look at it. If you have this decoder and the decoder you have the X which is the input coming in the encoder out actually acts as the key and value, but that’s only during the cross it. So you have the end consequence land positions and then you create the position embedding just like you did before. You actually pass it through the decoder multiple times that’s why you have for layer and self filter layers. There is in this entire thing this this block will be stuck on top of each other. So you have that X coming in that that X actually is doing self attention on the last multi edit attention that it has and the encoder outputs are past. Here again are are past along with the X here again. The interesting thing to note here is the encoder output is the same. So you’re really passing the encoder output is the same across all these layers so for the first layer the encoder output is the same again the X changes encoder outputs same the exchanges the encoder output is the same your X which is the information that you’re getting from your output is what is changing. And then you have the final in layer and output. Okay. We can now and again this is of training work up says this is interesting because you’re doing next work prediction. All right. So now we combine the encoder and the decoder to create a transform this is the finale the parts that you made for. Now combine the encoder and the people. And what you pass you pass the source of the upsides training up says. Source for next. This is I’ll look at this again once you are in the forward. This is just so that we know what are not all are passing. The embed size we take it to be 512 can be decreased or decreased depending on how much you want to train. Now layers for 6 forward expansion for I would keep it one. Or just for the papers. Okay. Now again the encoder the same way we defined before pass the source of capsize pass the embedding size. Pass the number of layers. It’s it’s now at this point it’s trivial. We just have to pass all the parameters and the good thing to notice here is what this parameter really needs in the grind scheme of events. Because that would actually help kind of make it clearer in your head. How many components there are and how each small thing is connected to the entire the entire transform. The comm size the handler size. On less. Hands forward expansion. From point device. Max length. You need that. Okay. So now we have the encoder and the depoder defined we’ll be calling them in the forward. What we have. I’ll go through all this in the forward and then that make it easier. So now. There are a bunch of functions that we want to look at here right now because those are the ones that we’ve been till now passing but not really looked at which is the mask. How do you create the mask. So there are let’s look at that and then we’ll tie it all up in the forward. So you have to make. So a mask. Which is. It takes them. So. And then you have so a mask. Is equal to. Now if this is where the indices really come in. So when the source is not equal to the. Self not source. Bad and this is. You have to mask. Right. So I’ll again go through this in case this is not completely clear. So the shape here is. And because you have squeezed twice that will also become clear why have you done that. Okay. I can actually just say that right now it’s because you multiply them. So you have to mask with the input and you need them to be the same dimension. I mean in this case it’s not the same dimension but we’ll look fast. So you have your padding and indices. And. The parts where it’s. Okay. So the source mask comes in. The source mask might have different lens. And you have now padded this is also quite important for homebook 4 where you have the same scenario. So the parts you’ve padded you don’t want to include that. So. This actually makes it zero at the parts where you don’t have you you have. Okay. Let me do this. This will actually zero out the parts where you have padding. That’s what it does so that you only take the parts which are important. This. This. Okay. Now that the training mask is the mask that we’ve been talking about till now which is you have. Let me show it so then once I write the code and after that we look up again. You have this triangular mask that’s what we’re trying to create now which is so that you have this part where it only attention to the part which are parts which are. Before it. So. Training mask is actually self-cult. Sorry. This is the triangular takes this is not quite straight forward. I mean it is straightforward but I would suggest looking at the documentation that would help you completely understand what’s going on. So you basically just created a triangle you just created a square but often it is yeah the best part for this would be to actually just look at the. The documentation. Now again just the way we did it before we wanted to the same shape because it would be multiplying it as. Now we have this. And we will be redundant so we’re done. I have that much of it. Now the final part where forward will be look at it again. So you have your source mask. Okay we created the masks the way we had done before where the training mask is the one which is triangular because you don’t want to. So you have the city values that are in the future and the source mask is the ones where you remove the padding. So you have now I can actually call the corner of the deeper. So you have the source in the source mask and then you have the depoder. So basically self-rot depoder times basically training. So this is again cross mention remember that this is called cross invention so the TRG training is basically the input to the depoder and then. Both the source mask and the training mask. And now we return. And that’s it we are done with the. I think I’m saying. So this is the transformer there might be some errors here but this is mostly it because of the flag this is mostly it. It might have seen a little long not so how long the video is cut some parts. It might have seen the little long but this is sort of necessary because of the things that I will explain in the next recitation because we know these things now it makes it a lot easier to understand and appreciate the different parts of the transformer in the next in the next recitation so because we’ll be using different parts of this and how they all come together it’s quite. It’s quite impressive. All right see you guys in the next recitation.