In the last two videos,

we learned about the most important idea of machinery,

namely the idea of

generalization as the purpose of learning.

We found that for a general regression problem,

the generalization error is

given by the expected square loss,

where the expectation is taken over all data,

both seen and unseen.

Lastly, the no free lunch theorem

stated that a generalization error for

a given machine learning algorithm is determined by

properties of the data-generating process.

So the key point here is that

we should care about all data,

both seen and unseen.

But how can we do it if all

we have is one particular dataset?

By definition, we can't have any unseen data.

What we do is, we split our dataset into two parts.

Let's say we have our whole data

in the form of a design matrix X.

The design matrix stores the data row wise so that

each record has D features and there are N rows in total.

For each record, we have an output Y.

So a vector Y of

two outputs is a second part of our data.

The key assumption we make here is that all pairs

of rows in X and Y are I.i.ds.

That is, independent identically distributed

samples from some unknown data

generating distribution P data.

Now let us split the data matrix X

into two data matrices,

that we call X_train and X_test of

the same width D and length N_train and N_test.

The same way we split the vector Y

of length N into two vectors,

Y_train and Y_test of lengths N_train and N_test.

Now as all entries in the original set of pairs of X and

Ys are I.i.d samples from

some data generating distribution P_data,

the same is true for both sets X_train,

y_train, and X_test, y_test separately.

Now we do a small cheek which is

absolutely critical for most machine learning tasks.

We simply withhold X_test

and y_test from the process of training the model.

That is, the model is first trained using

only a training set X_train, y_train.

In doing this, the modal fine tunes its parameters.

So it's predicted values of Y match

the values Y_train as close as possible.

Now, when the model is trained,

we use the second set of pairs,

X_test and y_test to estimate

the unknown generalization error

using its empirical mean.

That is, we run the model on yet unseen inputs X_test and

y_test and compare predictions of

the model with the true outputs y_test.

Because the model didn't see or use the pairs X_test,

y_test in training, this is a fair game.

First, we care about the empirical error

as an estimate for a true generalization error,

and then we test performance of the model using

the holdout set of pairs X_test and y_test.

As long as all data ING IT IID samples from

some true data generating

distribution and as long as this set X_tests,

y_test is large enough,

such procedure gives a good estimation of

the true and unknown generalization error.

Now the main question is how to find

the model with the smallest generalization error.

To illustrate this process,

let's consider this graph.

On the x-axis, we plot what is called the modal capacity.

We don't know an exact definition of this term,

but we can think of modal capacity

as conditions on the model architecture,

that eventually determine the model ability

to fit arbitrary functions.

For example, a restriction to linear architecture,

means that we can only

consider the class of linear regressions.

So we have the model capacity on

the x-axis and on the y-axis,

we will have either the train error or the test error.

First, let's see the behavior of

the train error when we increase the modal capacity.

As we see this is a typically

and monotonically decreasing function

of model capacity.

This is because by making

a model progressively more complex,

we can eventually fit

any given training data set arbitrarily well.

So a training set,

training error always decreases

with increased complexity of the model,

and still reduces [inaudible].

Now, let us see a typical behavior of the test error.

Here, we see something different.

Initially, the test error

drops roughly in sync with the drop of the train error.

But then it typically reaches

some minimum point after which it starts to grow.

This behavior when the test error starts to grow at

a certain level of model complexity

is called over-fitting.

A proper balance in this example is attained at

this point where the test error is at the minimum.

To the left of this point,

our model under-fits because it's too simple,

it has high bias as a result.

To the right of this point,

our model over-fits because it's too complex,

leading to a high variance.

The position and the height of

the optimal point is appropriately of data.

In particular because the test error picks

any irreducible moistened data

itself as we remember

from the bias-variance decomposition.

In other words, optimal model complexity needs to match

data complexity in order to

avoid either under-fit or over-fit.