HW4 Bootcamp C (Part 2 Starter Notebook)
Hi everyone, this is Swati and I will be going to the start and work for Homework 4P2. As you all know, this homework deals with implementation of an attention based end-to-end speech to text deep neural network. Our main aim is to implement an attention based system which solves the sequence to sequence problem. As you might have seen in all the previous homeworks, we have a vertical architecture where the input is passed sequentially through several layers until the output is computed from the final layer. But in this particular homework, we have the sequence to sequence problem where there is no input and output correspondence. In such kind of situation, we have an encoder decoder architecture which contains two separate modules and encoder which processes the input and a decoder which generates the output and these two are linked together through an attention module. So what does the encoder actually contain? The encoder basically takes in the input feature vectors and produces a high dimensional feature representation of the input vector. The input features have strong structurally correspondents with the adjacent vectors that is in its neighborhood and also there are certain long term contextual dependencies with distant vectors. So mainly usually the encoders contain CNNs and RNNs. CNNs are mainly used because they have the ability to find the structural correspondence between the adjacent or neighboring vectors and RNNs are used because we have to capture long term contextual dependencies. Then we have the decoder. The decoder mainly uses the feature vectors that are produced by the encoder and it is used to produce an output probability distribution over the output sequence. And then we have the attention module. So the attention module can be seen as a module which creates a weighted sum of the sequences of the encoder representation vectors and the way to which the vectors are combined is in such a way that the context pays most attention to the most relevant part of the input. So we will explore the ways of computing attention where the network is able to learn these weights to give most attention to the most relevant part of the input for each of the output sequence. So in this homework the baseline architecture that we are using is the LAS paper. It is the Listen, Attendance, Bell Paper by Google Bray. So Listen module is the encoder. The attend module is the attention mechanism and the spell is the decoder module. And as we always tell you, please get started early. And this homework will take significant amount of time in debugging and getting the pipeline working. And also this is the hardest homework that you would encounter in this course. So please do get started early. And the models also typically take longer time to convert and produce good results. The training time is a lot more compared to homework 2p2. And this homework is like the true litmus test for this course. So please do get started earlier. So let’s go through the start and output. As usual, the start and output is very similar to all your previous homeworks where we have the data set. In this, the data set is very similar to homework 1p2 and homework 3p2. The input contains sequence of 15-dimensional Mell spectral rembectors. And the corresponding output contains the output text transcription. The transcriptions have 30 characters, that is 26 characters from the English alphabet. And then we have the blank space, the comma and two special characters that is US and SOS. So totally the output dimension of your model would be 30 character output, which is nothing but a probability distribution. And we also provide the vocabulary map where you have to use the indexes, the care to index and index to care mapping. And then we have the data loader. Data loader is to divide your entire data set into many batches and load them onto the model and then the model. So in the model, we are going to implement the baseline model that is the LAS and the encoder. So in the encoder, we are going to have CNNs, LSTMs, or pyramidal by LSTMs. And then we have the attention mechanism where we implement a single head attention and the multi head attention, the decoder contains LSTM cells. And then we write the pipeline for training and then we write the pipeline for inference. So let’s quickly go through the starter input. So the homework write up is put up on Piazza. Then the Kaggle competition link is provided, kindly join the Kaggle competition. And then go through these two papers, the LAS paper gives you the model architecture and attention, the attention paper will help you understand what the single head attention and multi head attention is. So these are all your initial setup. We’re going to be using the Devonshine distance as a metric in this homework, which is very similar to your homework 3P2. And then we import all the necessary libraries and our configuration. We’re free to change the configuration and train for longer epochs if you want better convergence. We usually start with the learning rate that has been provided here. So we provide a toy dataset, which is very essential for you in this homework. This is just to make sure that you’re able to get a attention diagonal and just to get your pipeline working. Kindly proceed with the actual training only after you got a diagonal attention in the toy dataset setup. And it is to be noted that the toy dataset is just for a reference. The hyper parameters that you’re using here is not necessarily the same hyper parameters that you should be using for your training. It could vary. So this is for loading our toy dataset and creating the data loader. The vocabulary map is for your correspondence between your character to ID and ID to character. Kind of use the vocabulary map for all the transcript processing. Here, this is the toy dataset class that is provided. And this is the Kaggle setup just to download the actual data from Kaggle. Like I said, the vocabulary map contains 30 characters, 26 from the English alphabet, the blank space, the comma, and two special characters, the EOS and the SOS. And the vocabulary map is provided here. This is for the character to index and index to character mapping. Here in the dataset and data loader, this is very similar to your homework 3P2. You can feel free to use the same code. And in the dataset, you can implement the sectioning normalization. It is recommended that you normalize all your inputs for this homework. And so in your data loader in the call aid function, the outputs are different for different inputs. And when you’re using a mini batch, we cap the maximum length of your generated output to 600. And the transcript length, however, for tests, since we do not have any transcripts, we recommend that you use a maximum length of 600 for generating your output, the S and the validation data. And the model, as I mentioned earlier, the baseline approach for this assignment is derived from the LAS paper. It describes an encoder decoder approach. The listener is listens to the audio. The decoder, it spells out the transcription. That’s why it’s called as a spell. The attend is, of course, for the attention module. And the listener mainly comprises of pyramidal bi-illustium network. It takes the input audio feature vectors of the given utterances and outputs a sequence of high-level representation vectors that is approximately at the same rate as expected out of the Speller network. The Speller network takes the high-level feature output from the listener network and uses it to compute a probability distribution over sequence of characters using the attention mechanism. We can look at attention as an intuition of understanding or trying to understand and learn a mapping from word vector to some areas of the utterance map. So a word, which is at time step P8, might have a long term dependency on some input feature vector that is at time instant P0. So that is why this attention module is used so that we are able to focus on the parts that positively correspond to the output probability distribution of that particular transcript. So we elaborate that you implement this LAS baseline architecture. However, you can use other methods and try various other things and you can feel free to develop your own module and write a node from scratch. So the listener, that is the encoder, the encoder actually contains a base LASDM and a ABLASDM, pyramidal biolasDM. The listener typically contains 1D CNN layers, which captures the structural dependency between the adjacent vectors and the input. And then we have bi-directional LASDM, which captures long term contextual dependencies. And we have a new variant of biolasDM called as the pyramidal biolasDM. The pyramidal biolasDM can be seen as CNNs with a stride equal to 2, which reduce the time resolution of the input by a factor of 2. So the pyramidal biolasDM’s downsample, the downsample, the inputs by a factor of 2, it is either by concatenating adjacent pairs of inputs before running a conventional biolasDM on the reduced length sequence or it could also be averaging the 2 adjacent vectors and passing it to the next layer. But however, we recommend that you use concatenation of the adjacent vectors instead of averaging them. So in pyramidal biolasDM, say for example, we have the input sequence as batch, time, 15-dimensional vector. The input to the next layer would be batch, T by 2 and 15 into 2. The feature dimension size increases because we are concatenating and we are reshaping the output from the first layer. So there are 2 approaches, one is you can concatenate them or you can average them, but a sedition would be to concatenate them so that we retain all the information in the sequence. And also we reduce the rate. And so we basically suggest having a baseline biolasDM and then 3 layers of PBLSDM, which reduces the time sequence by a factor of 8. There can also be other CNN networks that could be used here in the listener. You can replace the 1D CNN network with your resonant blocks or connexed blocks from your previous homework. But the baseline LAS paper just has the LASDM and PBLSDM, they do not have any 1D CNNs. You can always feel free to explore that dimension. We write the forward function here. So the forward function is again similar to your homework 3P2. We pack the sequence, pass it through the LASDM, then pack the sequence, pass it through your pyramidal biolasDM layers. So this is the encoder. We initialize the listener here. And then next we have the attention block. So a typical implementation of attention block is by having the key value and query projections. So we project our embeddings from the encoder and through a linear transformation and compute the key projections. And then similarly we have the value projections, the embeddings are passed, are linearly projected and we linearly projected and we compute the value projections. So the weights WK and WV, that is, weights for the keys and weights for the value of the parameters for each output produced by the decoder, the decoder computes a query. It is a query to the attention module. The attention module uses this query and the keys K to compute a set of attention weights. This attention weights should strictly be non negative and must sum to 1. And they are typically computed by a softmax applied to the normalized inner product of query and the key. And then this is used to get the context vector CK, which is nothing but your weights WKJ multiplied by your values. So this context vector actually contains the portion of the encoded embeddings, which most positively correspond to the output that is produced at that instant of time. So we write the key value and query prediction in the softmax here. And we have the forward function for this. The pseudo code for attention is been provided in the handout. So please do go through that. And then we have the Speller Speller is the decoder architecture. The main purpose of this Speller is to provide a probability distribution over the output sequence. Here the Speller is auto regressive in nature, which means that the output from the previous time step is fed to the next time step for predicting the sequence of outputs. And here we have the LSTM cells. We cannot directly use the LSTM layer. You are supposed to be using the LSTM cells according to the LSTM architecture. And then you have to run a follow over all your time steps in your forward function to pass your embeddings through the decoder for the entire time sequence. So when we use this Speller, we also pass the attention module here. And then we return what are the predictions and the attention plot. For LSTM model, here we initialize the encoder. We initialize our attention module. We pass the attention module to the decoder called the forward function here. This is the model setup. We initialize our model, define what is the embedded dimension size, what is the input, what is the output size, etc. And then here we are defining the optimizers. The optimizers that we recommend using are Adam or Adam W. The loss that we are using here is cross entropy loss. If you want to have mixed precision training, then you can use the scalar for mixed precision training. But make sure you are using the scaling properly. Kindly go through the by-dodge documentation for the scalar. And yes, so here we introduce another concept called as a teacher posing. So teacher posing can actually be viewed as a gate which controls the percentage of ground truth that is being fed into the network. And also the auto-regressive input that is fed into the network since our decoder is an auto-regressive network. In the initial stages of training are modern. The output that is produced by the decoder is erinous. So passing an erinous input of the previous step to the current time step can lead to a faulty prediction at this time step as well. So that is why instead of feeding the previous time step inputs, we feed the ground truth inputs in the initial stages of training. So this teacher posing is when the teacher posing rate is set at 1, the ground truth is only fed into the network whereas the auto-regressive inputs are cut off as we reduce the teacher forcing rate. The previous time step inputs are being passed at a higher weightage in contrast to the ground truth vectors. So this helps the network to perform better as we keep training the model. So initially the teacher posing rate should be 1 as the loss decreases and as we approach good converges the teacher posing schedule is reduced. So you can have your own custom class for teacher pose rating rate schedule. And after that here we have the Levenson distance. Levenson distance is basically the number of modifications that you need to make for your predicted sequence to match the ground truth sequence. So the Levenson distance is calculated here. The Levenson distance for each transmitter is calculated and we have reduced the Levenson distance for the VINI VAT. This is the training loop. We take our Mell’s Petrogram and then we take the transcript pass it through our model and here just be careful when you are using the makes a precision. Make sure that you unscaled the optimizer pose before doing the gradient flipping. You can use gradient flipping to prevent exploding gradient problems. And now we have the validated function. This is the inference. This is for plotting our attention. We can save our model and checkpoints if our validation distance is less than the best Levenson distance that we have obtained so far. And then you can write the code for testing here. This can be very similar to your previous homework. So some debugging tips. So make sure that you use the toy dataset. Ensure that you have attention plots that are diagonal like this. Use mixed precision training so that you improve the training time. So there are two ways of approaching this sequence. Sequence problem. One is word base and one is character base. The word base models won’t have incorrect spellings and are very quick in training. But however, the problem is they cannot predict a rare words. So we refer that you use character base models. Even the paper also prescribes that you use character base models. Most of the TAs are familiar with character base models so you can receive more help if you stick to character base predictions. And then when you’re processing the transcripts use the vocab map that has been provided for index to care conversion and care to index conversion. And then use the built in by dodge libraries. The bad bad bad sequence and bad bad sequence use the enforce sorted equal to false flag. Just to make sure that you don’t want to sort the inputs for a complete patch. And then you can experiment using CNNs over pyramidal by LSTMs. You can use resident blocks, connoisseur or mobile led, feel free to explore various other CNN architectures. But just make sure that you don’t down sample your input vectors below a factor of 8. And then the pseudo code for single head attention has been provided in the handout. You can follow them. And similarly multi head attention, multi head attention is having various single head attention blocks right after the encoder and concatenating all of these and passing it to the linear layer to give us multi head attention. So in inference, we want to select the most probable output sequence. So there are three methods. One is the grid research. But this grid research is very time consuming. Then we have the random search. Random search might or might not give you the most probable output sequence. And then we have the name search which iteratively expands out the K most probable parts until you encounter end of sentence. And then we have dimension distance. In the dimension distance, you do not use the US and SOS. There are other way to initialization techniques that you can use like the Xavier initialization or combination initialization and for regularization, we can have dropouts, log dropouts, wait time and embedding dropouts. This is similar to what was told in homework 3. 2 and feel free to use data augmentations, the time and frequency masking. Because as we have suggested, we use adam and adam w and use the decaying parameter. Actually, LAS paper has a sgd, but we suggest adam and adam w, but you can always try a sgd. The learning rate schedule, the learning rate initial learning rate is 1 e per minus 3. By using different scheduleers, we might have already used the use of LRM platform and post-signalling exponential LR scheduleer etc. So we can try them. The last part is pre-training the listener. The listener can be pre-trained using an auto-inputer architectures separately. The auto-inputer contains a separate encoder and decoder block. You would pass the audio input to it to get a high hidden representation. You can use this pre-trained listener so that you will have a better convergence. We can also pre-trained the speller. You train the transcripts as a train data and train your model. Like you would train a language model as it was covered in the P1. Yes, I think all of these should definitely help. You’ll reach high-cutos. I can’t stress much more than this, but please do start early because this is one of the most hardest and challenging tasks in this course. Thank you.