1-785 Deep Learning Recitation 11: Transformers Part 2

All right, welcome to the Transformers Recitation Part 2. This will be in continuation from part 1 where we coded it from scratch. One thing I’d like to note before we begin is I reviewed the rest of the presentation that I’ve recorded before. And I think I spoke very slowly in terms of my pace. So please feel free to 2x this and watch it. So I’ll just go over the architecture once more. So that because the vision transformers and so in this part we’ll be focusing on the vision. Users of the transformer in computer vision. And so the architecture itself doesn’t really change a lot. And so I’ll just review. What we did in the recitation and the two lectures before just so in case you. You know, you’re being busy and not looked at it. So that it’s useful. So if you remember this is the transformer architecture looks like you have input and buildings coming. So this is the encoded block. The input and buildings coming in. There’s a self attention block. You attend to the different parts of the input. So in language, you basically have a sentence and you’re attending to different parts of the sentence throughout the sentence. And then you pass this and the final features from here going to the decoder block, which has the output. I’m getting in our case, we were actually using. So this was trained for the next next word prediction task. Though this will change with vision and that’s that’s one of the interesting parts like what what have the researchers done in this aspect. So you get your input from here. You have a self self attention part here first, it’s generally masked. And in this case, you had your key and value come in from here, you’re query from this part and that’s where this was across attention block. And then you pass that and there was a final classifier layer for predicting the world. So let’s begin. So for the vision part, we’ll be covering VIT Dieter and language be in our next recitation after that depending on when we get time. So because vision was the most popular one, we’ll cover VIT, which is the vision transformer and Dieter in detail. So let’s look at them. So this is from a survey paper on use of transformers and vision. So some of the places where some of the architectures which were useful were vision transformers, which is kind of the one which broke the field in terms of the fact that they didn’t use any scene in the talk. The swing as well, which I’ll just briefly mention later on. And then for object detection, there’s Dieter, which is from Meta and did quite well because they basically show how a transformer can be used for object detection and the way that train it is quite interesting. All right. So it’s called VIT. It’s basically vision transformer. The paper comes from this paper and images worth 16×16 words. This is quite interesting by the 16×16 words. So that’s basically a 256 dimensional feature and transformers for image recognition scale. So if you see this, you have the transformer encoder and what’s being passed into the network is basically patches. So the reason why they do this is, let’s say, and that’s why we review. What was being passed into the transformer for natural language tasks were sentences with words. But if you were to now use images for the same task, how would you do that? So you would, ideally, want to break the images because you want to, you would want to attend some part with the other part. Now which parts would you attend to? Would you attend to pixels, in which case, depending on how large the images, your input would increase significantly. And the reason why that causes an issue is your attention module is basically big of n squared. So because of this, this t compute effectively increases significantly as your length increases on the order of n squared. So what would you do? So the authors here suggest you basically break it down into 16×16 patches, patches of 16×16. So your input actually becomes a 256 dimensional vector. Because the way they do this is they basically pass these patches and then do a linear projection of the flattened patches. So they first flatten this, unroll it, and then they pass those flattened patches. And with those flattened patches, they concatenate the position embedding. Just the way you had it with sentences, except here they are concatenating and adding. Yeah, so the main reason this is being done is it’s because it’s the images can. So images could basically be thought of as sentences and that’s what they’re trying to say here. And the way you want to think about the sentences, you want to break it down into smaller parts. And the smaller parts here become these 16×16 patches, you could, you could also think of it as doing global attention instead of local attention. So instead of attending each small part with a different part, what you’re doing is you’re attending these bigger parts directly with the other bigger parts of these 16×16. And that is being passed after this to the transformer encoder. The transformer encoder is actually the same as it was before. And after that, you pass it to the, what you’re doing here is something called CLS token, which you have with sentences also. You’re passing that CLS token, the embedding that attend to that CLS token to the MLP head, which is then used for classification. So let’s look at this in detail. So now you have these patches and you then flatten the patches. You flatten the patches and add the weight matrix, project linearly project using the weight matrix and then add the positional encodings to get a final patch. So these are the patches that you then passed to your encoder. And now from the output that you get, you actually use the, so this, in this output, you’re basically passing this something called CLS token, which is basically, you can just think of it like this start of sentence token. Basically you’re trying to see because that token itself doesn’t have any value, you’re trying to see what does it attend to the most. And basically through that, you’re trying to see what does the image attend to the most in this, in the given image. So it could be the trees, it could be the building itself, the structure. And one way to look at that is through the CLS token. And that’s done a lot with, with language tasks and that’s, that is where this idea came from. So this is basically what is done in the VIT paper. And if you look at it, it’s quite simple. They haven’t really changed the architecture at all. It’s just the way they’re treating the images and how, and then attend, and then the attention between these parts. So nothing really extraordinary being done here, but the fact that they’re not using a convolutional network is quite, was quite revolutionary because till that time, it was just convolutional networks or convolutional network based features being passed into transformers. But what they really showed was that you could get it to be as good as, you could, you could basically outperform convolutional networks by using the vision transformed. However, if you have now done how homework to part two, you know that comics actually came after this. And their entire claim was that if you use the strategies that you use for transform, training and transformers, you effectively beat them. So that was another ablation study. So there’s nothing really conclusive here which says that, oh, transformers are better than the convolutional networks or convolutional networks are better than transformers. It really depends on a lot of factors which you have probably realized by the time since the course is almost coming to an end. You probably know that there are a lot of factors which basically define how value network performs. So this actually brings us to the next part. Before we do that, let’s just let’s just look at this part in terms of why attention is better than convolution. So what this graph is showing is depending on how deep the layer is, it shows how far your network can attend to the other parts. So when you have a convolutional network, what you effectively see is something like this. Basically, unless your kernel sizes are really large, your network initially will not be able to attend to parts which are really far away, but attention right from the beginning can do that because of the fact that it has attention. So this is why attention performs really well even with networks which are not very deep as opposed to convolution. So this kind of shows and if you remember, context really performs better when you have quite a deep network while transformers are not that deep. You can I mean this LX can be changed but still compared to context a lot less deeper. Yes. All right. So from now from this we can actually move to the data architecture. All right. So this is for object detection. What this paper really talks about is how you can use a transformer based architecture in conjunction with a CNN front end to detect objects. And the way this works is so you have these birds and the network effectively predicts the set of box predictions which are let’s call them n in the original paper there are 100. So no matter what you give as input, this network will output 100 predictions right. So with these 100 predictions what each prediction is basically maybe draw this stuff. So what each prediction basically gives is the class and the bounding box. So you could have since it’s 100 in this case it could be some two of them for the trained one there would be for two of them it would give the correct class. And for the others it would give no box. So it would still be outputting these 100 outputs with some bounding box as well but it could give a no class output. So if you see this no object, yeah so this no object is 5. So let’s now come to the training of this. So the way this network is trained is because you’re basically doing a object detection task there are a lot of things that come into question here which is is there an object what is the object where is the object and it does so by using this bi-partite matching loss right what that does is for any box you basically try to find the box that matches it the closest and you’re only doing a one to one correspondence. So you’re not matching each box with multiple boxes you’re just matching one box with the on the right is the ground truth on the left is what you predicted and for each box you try to find one box that matches it. For that box you then have a loss function depending because you’re so based on the channel and bounding box you’ll come to that you have a loss function and then you try to minimize that over all the end observations. So let’s look at this in detail. So what you have is it’s also called the Hungarian matching algorithm. The way it works is there are some others also if you’re interested you can look at them. What this does is you have this initial loss which is basically based on the classification. So this is a cross entropy but based on first the matching being done so this is the matching function and not go into detail in this based on the matching you basically get a one to one correspondence for each of the predictions. After you get that you do a cross entropy loss on the classification itself based on what is predicted and what the output should have been and after that you do if it’s not a if it’s not outputting something being no box right when it’s not when it says that there is actually something you also look at the bounding box loss and the bounding box loss is actually a weighted sum of the intersection over the union and the element loss. The intersection over union is quite a famous loss function I guess most of you might have heard of it in case you haven’t just let you know what it is. So in this case let’s look at the intersection over union it’s generally not how this is done. It would be for the same box so let me just draw that. Let’s say your predicted box is this while your ground truth box is this your intersection over union becomes the intersection over the entire union so that’s it maybe I shall tell you that yeah but anyway we’ll move from here and the other loss is this L1 loss between the bounding boxes so depending on how you define the bounding boxes it will be a four dimensional vector you do a L1 loss on that and that’s your loss function. So what I’ve done till now is just talk about the loss function we haven’t really gone into the network let’s go into the network now right so from the CNN you’ll get this set of image features and what they actually look like is basically you’ll get some you get some m channels of height comma width right this is what your dimension will be apologies for my handwriting you get this and then you add the positional encodings to that and after that you flatten it so you’d flatten it to this so now you get vectors for each channel you get some height cross width vectors or you could also think of it in terms of this and then you pass these to your transformer encoder let’s just look at the encoder also my bad here we are so this is the encoder for the data paper it’s basically like the way like we encoder that we had for transformer same goes for decoder although the way the decoder functions is different so you have the same encoder n times this block of channel height cross width this then passed into this transformer encoder and you get the same number of outputs from the encoder again this is then passed through the decoder the same way you did it with a normal transformer for cross intensity this is then passed through the transformer decoder and the transformer decoder actually has these inputs being object queries right so for for the sentence class we had mass inputs being given in and then we had a final classification but here you have object queries coming in and you have object queries being sent out and these object queries are basically then passed through a feed forward network and those give you the outputs so there are a hundred of these so this is actually some n cross hundred this is also some hundred object queries and you get hundreds of these I’m sorry 100 just just 100 of these because that’s the number of outputs you’re given so no matter what your input is it’s still it’s always going to send out 100 outputs so if you had let’s say 100 objects in this it could classify those hundred objects theoretically so the question here is what are these object queries when you send in you send them random one queries but the goal actually is to learn these object queries and what I’ve taught it here below are from the paper the object queries that learned object queries so what they’re doing effective is these are 20 of them I guess 36 yeah so what they’re doing is they’re attending to different parts of the network to see if there’s an object there and that’s what you do with these object queries you learn the object queries itself also so these are not exactly the object queries that from the attention but think of think of it this way it’s basically asking for the asking the image for different parts of the image to see where does the object actually lie and after that it passes the after after after the stuff is done the decoder which is here which if you see is actually again similar to the decoder that we had for the language tasks so this is really interesting because if you see this it’s the same structure basically which was used with language is now being used with images and that’s what vision transforms also talking about 16 the fact that you’re calling image patch 16 to 16 word so that’s something to think about the way that this is moving around that’s why transform are really consider something to be considered to be really revolutionary because you’re now really joining both the things and using the same architecture for that well you you are using a scene and plus LST and also but but still okay so now after this let’s look at some interesting results this is the part which actually gets it makes it really cool which is these self these these are basically the they plotted the self the self attention block from the encoder right the encoded image and then the self attention block after that so you took in the input image and then you scaled it after the attention and you get these outputs which are basically the attend the part that the image based attention to and then you pass those to some through a resident feature and you get this finally which looks a lot like segmentation of different aspects right so there’s a the cow the ground the bush line behind it and the sky and that’s what is happening here so you’re having a you basically get a pixel wise classification here which is just segmentation so you’re using the same architecture and using that to do segmentation this is another cool image basically what they’ve done is they’re trying to do it done the self they’ve taken the self attention of the images the image blocks that are that is output from the encoder and try to see what the image really attends to and what you see is it’s attending to different parts different parts of the image which are the cows in this case and if you go below okay sorry so what you see here is basically the fact that you can use this architecture to do classification and pixel wise classification and object detection there’s something else I would like to show you guys which is this so what you can see here is this is from the same paper what you can see here is for the so because it’s outputting multiple objects in those hundred it treats them so it’s basically giving out hundred outputs right and for those hundred outputs there’s one box which is of the class elephant there’s another box which is of the class elephant and it can actually attend to different parts of this and know that this belongs to the same group and this belongs to a different group which is really cool you can see that same way here so it can really figure out which part belongs to which which part belongs to which predicted output so yes so this is the this is data I’ve rushed this is because I mean when you go into detail there’s a lot of there I have briefly wanted to mention really briefly some two other architectures as well which would be which you should know about so there’s this when a transformer architecture and what happens here is whether what they’re basically doing a shifted window so they’re doing something like strided convolution they break it down into different parts and then do a strided convolution on it and then use those to pass them into transformer but what corn next basically showed is swim that it’s better than swim transformer and vision transformer so if you want to do object classification using convolutions feel free to using context feel free to do that and there’s this newer transformer per seaver what that basically does is it’s for multimodal stuff so one of the issues that you face with transformers is because it goes by big of n square you can’t really give in bigger inputs what they try to do in this architecture is basically try to breed that by just passing the inputs as keys and values through cross-attention and then pass the output from that into the latent transformer so they try to solve the issue of the issue of large inputs yeah so these are two feel free to ask questions of paza that’s about it for the vision task for the language task we’ll probably record something later. Thank you guys.

AI video(s) you might be interested in …