Hey, in this video, I'm going to cover one nice paper about summarization. This is a very recent paper from Chris Manning Group, and it is nice because it tells us that on the one hand, we can use encoder-decoder architecture, and it will work somehow. On the other hand, we can think a little bit and improve a lot. So, the improvement will be based on pointer networks, which are also a very useful tool to be aware of. Also sometimes, we have rather hand-wavy explanations of the architectures with the pictures. Sometimes, it is good to go into details and to see some actual formulas. That's why I want to be very precise in this video, and in the end of this video, you will be able to understand all the details of the architecture. So, this is just a recap, first of all that we have usually some encoder, for example bidirectional LSTM and then we have some attention mechanism, which means that we produce some probabilities that tells us what are the most important moments in our input sentence. Now, you see there is some arrow on the right of the slide. Do you have any idea what does this arrow means? Where does it comes from? Well, the attention mechanism is about the important moments of the encoder based on the current moment of the decoder. So, now we definitely have the yellow part which is decoder, and then the current state of this decoder tells us how to compute attention. Just to have the complete scheme, we can say that we use this attention mechanism to generate our distribution or vocabulary. Awesome. So, this is just a recap of encoder-decoder attention architecture. Let us see how it works. So, we have some sentences, and we try to get a summary. So, the summary would be like that. First, we see some UNK tokens because the vocabulary is not big enough. Then, we also have some problems in this paragraph that we will try to improve. One problem is that the model is abstractive, so the model generates a lot, but it doesn't know that sometimes, it will be better just to copy something from the input. So, the next architecture will tell us how to do it. Let us have a closer look into the formulas and then see how we can improve the model. So, first, attention distribution. Do you remember notation? Do you remember what is H and what is S? Well, H is the encoder states and S is the decoder states. So, we use both of them to compute the attention weights, and we apply softmax to get probabilities. Then, we use these probabilities to weigh encoder states and get v_j. v_j is the context vector specific for the position j over the decoder. Then how do we use it? We have seen in some other videos that we can use it to compute the next state of the decoder. In this model, we will go in a little bit more simple way. Our decoder will be just normal RNN model, but we will take the state of this RNN model s_j and concatenate with v_j and use it to produce the probabilities of the outcomes. So, we just concatenate them, apply some transformations, and do softmax to get the probabilities of the words in our vocabulary. Now, how can we improve our model? We would want to have some copy distribution. So, this distribution should tell us that sometimes it is nice just to copy something from the input. How can we do this? Well, we have attention distribution that already have the probabilities of different moments in the input. What if we just sum them by the words? So, for example, we have seen as two times in our input sequence. Let us say the probability of as should be equal to the sum of those two. And in this way, we'll get some distribution over words that occurred in our input. Now, the final thing to do will be just to have a mixture of those two distributions. So, one is this copy distribution that tells that some words from the input are good, and another distribution is our generative model that we have discussed before. So just a little bit more formulas. How do we weigh these two distributions? We weigh them with some probability p generation here, which is also sum function. So every thing which is in green on this slide is some parameters. So, you just learn these parameters and you learn to produce this probability to weigh two kinds of distributions. And this weighting coefficient depends on everything that you have, on the context vector v_j, on the decoder state s_j, on the current inputs to the decoder. So you just apply transformations to everything that you have and then sigmoid to get probability. The training objective for our model would be, as usual, cross-entropy loss with this final distribution. So, we will try to predict those words that we need to predict. This is similar to likelihood maximization, and we will need to optimize the subjective. Now, this is just the whole architecture, just once again. We have encoder with attention, we have yellow decoder, and then we have two kinds of distributions that we weigh together and get the final distribution on top. Let us see how it works. This is called pointer-generation model because it has two pieces, generative model and pointer network. So this part about copying some phrases from the input would be called pointer network here. Now, you see that we are good, so we can learn to extract some pieces from the text, but there is one drawback here. So you see that the model repeats some sentences or some pieces of sentences. We need one more trick here, and the trick will be called coverage mechanism. Remember you have attention probabilities. You know how much attention you give to every distinct piece of the input. Now, let us just accumulate it. So at every step, we are going to sum all those attention distributions to some coverage vector, and this coverage vector will know that certain pieces have been attended already many times. How do you compute the attention then? Well, to compute attention, you would also need to take into account the coverage vector. So the only difference here is that you have one more term there, the coverage vector multiplied by some parameters, green as usual, and this is not enough. So you also need to put it to the loss. Apart from the loss that you had before, you will have one more term for the loss. It will be called coverage loss and the idea is to minimize the minimum of the attention probabilities and the coverage vector. Take a moment to understand that. So imagine you want to attend some moment that has been already attended a lot, then this minimum will be high and you will want to minimize it. And that's why you will have to have small attention probability at this moment. On the opposite, if you have some moment with low coverage value, then you are safe to try to have high attention weight here because the minimum will be still the low coverage value, so the loss will not be high. So this loss motivates you to attend those places that haven't been attended a lot yet. Let us see whether the model works nice and whether the coverage trick helps us to avoid repetitions. We can compute the ratio of duplicates in our produced outcomes, and also we can compute the same ratio for human reference summaries, and you can see that it is okay to duplicate unigrams, but it is not okay to duplicate sentences because the green level there is really low, it is zero. So the model before coverage, the red one, didn't know that and it duplicated a lot of three-grams and four-grams and sentences. The blue one doesn't duplicate that, and this is really nice. However, we have another problem here. The summary becomes really extractive, which means that we do not generate new sentences, we just extract them from our input. Again, we can try to compare what we have with reference summaries. Let us compute the ratio of those n-grams that are novel. And you can see that for the reference summaries, you have rather high bars for all of them. So, the model with coverage mechanism has sufficiently lower levels than the model without the coverage mechanism. So in this case, our coverage spoils a model a little bit. And again for the real example, this is the summary generated by pointer-generator network plus coverage, and actually let us see. Somebody says he plans to something. And here in the original text, we see exactly the same sentences but they are somehow linked. So, we just link them with he says that and so on. Otherwise, it is just extractive model that extracts these three important sentences. Now, I want to show you quantitative comparison of different approaches. ROUGE score is an automatic measure for summarization. You can think about it as something as BLEU, but for summarization instead of machine translation. Now, you can see that pointer-generator networks perform better than vanilla seq2seq plus attention, and coverage mechanism improves the system even more. However, all those models are not that good if we compare them to some baselines. One very competitive baseline would be just to take first three sentences over the text. But it is very simple and extractive baseline, so there is no idea how to improve it. I mean, this is just something that you get out of this very straightforward approach. On the contrary, for those models for attention and coverage, there are some ideas how to improve them even more, so in future everybody hopes that neural systems will be able to improve on that, and it is absolutely obvious that in a few years, we will be able to beat those baselines.