In this video, we will speak about simple Recurrent Neural Network,

and how to use Backpropagation to train it.

In the previous video, we spoke about a recurrent architecture.

In this architecture, we work with sequences, element by element.

Here, Xs are the elements of the input sequence and at each time step,

we have some output or a prediction y-hat.

We can use this architecture for a lot of different tasks.

For example, if we want to use it for language modeling,

then Xs should be word embeddings and the output

of our model at each time step is a probability distribution over a vocabulary,

which says which word is more probable to be the next one in our sequence.

Of course, we can generate this word from this probability distribution and give

it to our model as an input in the next time step so we can generate text.

We can use a recurrent architecture to solve a POS tagging task.

In this task, we want to tag each word in the text,

with a part of speech label.

Here we have the same inputs, word embeddings,

but the output is different now.

Now, at each time step,

the output is a probability distribution over the set of

all possible POS tags which says which POS tag is more probable to the current word.

Now we see that we can use this architecture for different tasks and here,

in the middle of this architecture,

we see this box MLP,

which can be quite different inside.

Let's use the simplest MLP, which exists.

The MLP with one layer of hidden neurons.

Then, we have a vector of hidden neurons in the recurrent layer of our network

and we can use these hidden neurons both

to calculate the output of our network or the predictions y-hat,

and to calculate the hidden neurons from the next time step.

On the slide, you can see formulas for a forward pass in our network,

and actually these formulas are pretty similar to the ones you already saw in the course.

Here, again, you have some nonlinear functions f

of linear combinations of the inputs.

But here you need to remember that all the parameters are shared between

time steps so we use same weight matrices and the same bias vectors at each time step.

This type of visualization of recurrent neural network usually called an unfolded version.

And there is another version of the visualization for recurrent neural network,

the more compact one.

Actually, this compact visualization is more general

because it doesn't depend on the length of the input sequence.

While the unfolded version is actually depends on it,

but in the compact visualization,

we have this recurrent connection,

which forms a loop in our scheme and it's actually

not clear how to train a network with a loop.

That's why we will use an unfolded visualization when we will speak about training.

Now, let's speak about training of Recurrent Neural Network.

Here we see a Recurrent Neural Network in the unfolded form.

To train it we actually need some data.

Let's say that we have a training dataset so for some training sequence we know

all the true labels y and we make

some predictions y-hat and to understand how wrong are we,

we use some loss function L but we use this loss function separately at each time step,

and then we sum up all of these losses to take their total loss, to have a total loss.

Now, we have our usual neural network,

we have data, we have a loss, what should we do?

We need to backpropagate.

As usual, in backpropagation algorithm,

we need to make a forward pass and a backward pass.

In forward pass, we need to calculate all of the elements in

our own neural network so we calculate the hidden elements,

then we make our predictions, and we calculate the loss.

In the backward pass,

we need to calculate the gradient of our loss function with respect to

all the parameters so with respect to weight matrices and bias vectors.

In usual feedforward neural network,

we make a backpropagation only through layers so in the vertical direction,

but here we have sequences.

This why we need to backpropagate not only through layers but

also through time, in horizontal direction.

That's why the algorithm of training of recurrent neural networks,

usually called backpropagation through time.

Additionally, we need to remember that all the parameters are shared between time steps.

Therefore, if we want to calculate the gradient

of our loss with respect to some parameter,

we actually need to sum up the gradients from all the time steps.

Let's, for example, calculate the gradient of

the loss function with respect to weight matrix U.

As I already said, we need to sum up the gradients from all the time steps in

our sequence. And to calculate the gradient of the loss at the specific time step t,

we can simply use a chain rule.

As a result, we have this product and the first element in this product,

we can calculate if we know which loss function we use,

and the second element,

we can actually also calculate very simply

because our prediction depends on the matrix U only in one point.

Now let's consider a more difficult example.

Let's calculate the gradient of our loss function with respect to

the recurrent weight matrix W.

In the first step, we will do the same thing as earlier.

We simply sum up the gradients from all the time steps in

our sequence and actually we want to do the same thing thing as earlier in the next step,

but if we look at the formula for the hidden units,

we see that there is not only one dependence

between our weight matrix W and hidden units at

time step t because hidden units in the previous time step actually also depend on W.

So, here, we need to go deeper and use not a simple chain rule,

but use a formula of total derivative.

After applying this formula,

we end up with the following equation for gradient of

our loss function with respect to W.

At the last part of this equation,

you can see that first we take into

account dependence between hidden units at time steps t and the

weight matrix W.

When we take one step back and take into account the dependence

between previous hidden units and weight matrix W,

and then we take one step back and so on and so on while we

don't reach the beginning of our sequence.

As a result, here,

the last part of this equation is the sum of

the contributions from the all previous time steps

to the gradient at time step t.

And to calculate the contribution from time step k to the gradient at time step t,

we actually need to go from the hidden units at time step t

to hidden units at time step k. In each element on the sum,

we have this product of the jacobian matrices of the gradients of hidden units

at one time step with respect to the hidden units at the previous time step.

Now, we calculated the gradient of our loss function

with respect to two weight matrices, U and W.

When we calculated the gradient of our loss function with respect to U,

we needed only to go backwards in layers,

so in vertical direction.

When we calculated the gradient with respect to W, we need to go

backwards both in vertical direction and in horizontal direction,

so backwards in layers and time.

Now we have the last weight matrix V,

and it's a question for you.

Do we really to go backwards in time to calculate the gradient of

our loss function with respect to weight matrix V?

Yes, here we have the same situation as with

the recurrent weight matrix W because the dependence

between hidden units and weight matrix V is actually not only in one place.

Hidden units from all the previous time steps also depend

from V so we need to go backwards in time to calculate this gradient.

Let's summarize what we’ve learned in this video.

We now know what is the simple Recurrent Neural Network is

and how to train it using backpropagation through time.

This backpropagation through time algorithm is actually a simple backpropagation,

but with a fancy name.

In the next video, we will discuss

the difficulties that arise during the training of Recurrent Neural Networks.