Today, we will cover one main idea of statistical machine translations.

Imagine you have a sentence, let's say,

in French or in some other foreign language and then,

you want to have its translation to English.

How do you do this? Well, you can try to compute

the probability of the English sentence given your French sentence.

And then, you want to maximize this probability and

take the sentence that gives you this maximum probability,

right? Sounds very intuitively.

Now, let us apply base rule here.

So let us say that instead of computing the probabilities of E given F,

we would better compute probabilities of F given E.

And multiply it by some probability of the English sentence.

And also, normalize it by some denominator.

Now, do you have any idea?

Can we further simplify this formula?

Well, actually, we can.

So, the denominator doesn't depend on the English sentence,

which means that we can just get rid of it, okay.

Now, we have this formula and now,

the question is, why is that easier?

Why we like it more than the original formula?

This slide is going to explain why.

So, we have two models now.

We have decoupled our complicated problem to two more simple problems.

One problem is language modeling.

And actually, you know a lot about it.

So, this is how to produce some meaningful probability of the sentence of words.

Now, the other problem is translation model.

And this model doesn't think about some coherent sentences.

It just thinks about some good translation of E to F,

so that you do not end up with something that is not related to your source sentence.

So, you have two models about language and about adequacy of the translation.

And then you have argmax to perform the search in

your space and find the sentence in English that gives you the best probability.

Now, I have one more interpretation for you.

The Noisy Channel is a super popular idea,

so you definitely need to know about it.

And it is actually super simple.

So, you have your source sentence and you have some probability of this source sentence.

And then, it goes through the noisy channel.

The noisy channel is represented by the conditional probability of

what you get as the output given your input for the channel.

So, as the output,

you obtain your French sentence.

So, let's say that your source sentence was

spoilt with the channel and now you obtained it in French.

Now, the rest of the video is about how to model these two probabilities,

the probability of the sentence and the probability of

the translation given some sentence.

Okay. First, about the language model.

You know a lot about it so we covered this in the week two.

So, I will have just one slide to have a recap for you.

So, we need to compute the probability of a sentence of words.

We apply chain rule and then we know that we can factorize

it into the probabilities of the next word given some previous history.

You can use Markov assumption and then end up with n-gram language models.

Or you can use some neural language models such as LSTM to produce the next word,

you will need previous words.

Now, translation model.

Well, it is not so easy.

So, imagine you have a sequence of words in one language and you need to

produce the probability of a sequence or words in some other language.

For example, this is foreign language,

like Russian and English language,

and these two sentences.

How do you produce these probabilities?

Well, it is not obvious for me.

So, let us start with words level.

We can understand something for the level of separate words in these sentences.

Okay. What can we do?

We can have a translation table.

So, here, I have the probabilities of Russian words given some English words.

And they are normalized, right.

So, each row in this matrix is normalized into one.

And this are just translations that I learn or

that I look up in the dictionary or built somehow.

Okay, it's doable.

Now, how do I build the probability of

the whole sentence given these separate probabilities?

We need some word alignments.

So, the problem is that we can have some reorderings in the language like here,

or even worse, we can have some one to many or many to one correspondence.

For example, the word appetit here corresponds to the appetite.

And the word with here corresponds to two Russian words [FOREIGN]

It means that we need some model to build those alignments.

Now, another example would be words that can appear or disappear.

For example, some articles or some auxiliary words can happen in one language and then,

they can't just vanish in some other language.

This is a very unique word alignment models

and this is the topic will fall when next video.