[NOISE] In this example,

we will see linear regression.

But before we start, we need to define the multivariate and

univariate normal distributions.

The univariate normal distribution has the following probability density function.

It has two parameters, mu and sigma.

The mu is a mean of the random variable, and the sigma squared is its variance.

Its functional form is given as follows.

It is some normalization constant that ensures that this probability

density function integrates to 1, times the exponent of the parabola.

The maximum value of this parabola is at point mu.

And so the mode of the distribution would also be the point mu.

If we vary the parameter mu, we will get different probability densities.

For example, for the green one, we'll have the mu equal to -4, and for

the red one, we'll have mu equal to 4.

If we vary the parameter sigma squared,

we will get either sharp distribution or wide.

The blue curve has the variance equal to 1, and

the red one has variance equal to 9.

The multivariate case looks exactly the same.

We have two parameters, mu and sigma.

The mu is the mean vector, and the sigma is a covariance matrix.

We, again, have some normalization constant, to ensure that the probability

density function integrates to 1, and some quadratic term under the exponent.

Again, the maximum value of the probability density function is at mu,

and so the mode of distribution will also be equal to mu.

In neural networks, for example, where we have a lot of parameters.

Let's note the number of parameters as t.

The sigma matrix has a lot of parameters, about D squared.

Actually, since sigma is symmetric, we need D (D+1) / 2 parameters.

It may be really costly to store such matrix, so we can use approximation.

For example, we can use diagonal matrices.

In this case, all elements that are not on the diagonal will be zero,

and then we will have only D parameters.

An even more simple case has only one parameter,

it is called a spherical normal distribution.

In this case, the signal matrix equals to some scalar times the identity matrix.

Now let's talk about linear regression.

In linear regression, we want to fit a straight line into data.

We fit it in the following way.

You want to minimize the errors, and those are,

the red line is the prediction and the blue points are the true values.

And you want, somehow, to minimize those black lines.

The line is usually found with so-called least squares problem.

Our straight line is parameterized by weights, vector, and w.

The prediction of each point is computed as w transposed times xi,

where xi is our point.

Then, we compute the total sum squares, that is,

the difference between the prediction and the true value square.

And we try to find the vector w that minimizes this function.

Let's see how this one works for the Bayesian perspective.

Here's our model.

We have three random variables, the weights, the data, and the target.

We're actually not interested in modeling the data, so we can write down the joint

probability of the weights and the target, given the data.

This will be given by the following formula.

It would be the probability of target given the weights of the data, and

the probability of the weights.

Now we need to define these two distributions.

Let's assume them to be normal.

The probability of target given the weights and

data would be a Gaussian centered as a prediction that is double transposed X,

and the variance equal to sigma squared times the identity matrix.

Finally, the probability of the weights would be a Gaussian centered around zero,

with the covariance matrix sigma squared times identity matrix.