In this video, I'd like to talk about how to evaluate a hypothesis that has been
learned by your algorithm. In later videos we'll build on this to
talk about how to prevent the problems of over fitting and under fitting as well.
When we fit the parameters of our learning algorithm, we think about
choosing the parameters to minimize the training error.
One might think that getting a really low value of training error might be a good
thing, but we've already seen that just because a hypothesis has low training
error that doesn't mean it's necessarily. [INAUDIBLE] hypothesis.
And we've already seen the example of how hypotheses can overfit, and therefore
fail to generalize to new examples, not in the training set.
So, how do you tell if a hypothesis might be overfitting?
In this simple example, we could plot the hypothesis, H of X, and just see what was
going on. But in general, for problems with more
features than just one feature. For problems with a large number of
features like these, it becomes hard, or pr, you know maybe impossible to plot
what the hypothesis. What does this function look like, and so
we need some other way to validate our hypothesis.
The standard way to evaluate a learned hypothesis is as follows.
Suppose we have a data set like this. Here, I've just shown ten training
examples, but of course, usually we may have dozens or hundreds or maybe
thousands of training examples. In order to make sure we can evaluate our
hypothesis, what we are going to do is split the data we have into two portions.
The first portion is going to be our usual training set.
And the second portion is going to be our test set.
And a pretty typical split of this, of all the data we have into a training set
and test set might be around, say, a 70%-30% split with more of the data going
to the training set and relatively less to the test set.
And so now. If we have some data set, we've got
assigned only, say 70% of the data to the I training set.
Right here m is as usual on over training examples.
And the remainder of our data might then be assigned to become our test sets.
And here I'm going to use the notation m subscript tests to denote the number of
test examples and so in general this. Subsequent test is going to denote
examples that come from my test set. So that X1 subsequent test, comma Y1
subsequent test, is my first test example, which I guess, in this example,
might be this example over here. Finally, one last detail.
Whereas here I've drawn this as though the first 70% goes to the training set
and the last 30% to the test set, if there is any sort of ordering to the data
that should be better to send a random 70% of your data to the training set and
a random 30%. Be a deterrent to the test set and so if
your data were already randomly sorted, you could just take the first 70% and
last 30%. But if your data were not randomly
ordered, it'd be better to randomly shuffle or to randomly reorder the
examples in your training set before, you know, sending the first 70% to the
training set and the last 30% to the test set.
Here then, is a very typical procedure for how you would train and test a
learning algorithm, maybe linear regression.
First, you learn the parameters theta from the training sets.
So you minimize the usual training error objective j of theta, where j f theta
here was defined using that 70% of all the data you have.
There's only the training data. And then you would compute the test
error. And a way to denote the test error as j
subscript test. And so what you do is you take your
parameter theta that you've learned from the training set.
And plug it in here and compute your test set error, which I'm going to write as
follows. So this is basically the average squared
error as measured on your test set. It's been what you would expect so run
every test example through your hypothesis with parametered data and just
measure the squared error the hypothesis on your M subscript test test examples.
And of course, this is the definition of the test set
error if we are using linear regression and using the squared error metric.
How about if we were doing a classification problem.
And say, using logistic regression instead.
In that case, the procedure for training and testing, say logistic regression is
pretty similar. First, we will learn the parameters from
the training data. That first 70% of the data.
And it will compute the test error as follows.
As the same objective function as we always use for logistic Russian. Except
it now is defined using our m sub script test, test examples.
While this definition of the test set error J sub script test is perfectly
reasonable sometimes there's an alternative, test set metric that might
be easier to interpret and that's the, misclassification error, it's also called
the zero one misclassification error, with zero one denoting that, you either
get an example right, or you get an example wrong.
Here's what I mean, let me define the error, of a prediction, that is H of X.
And given the label Y as, equal to one, if my hypothesis, outputs a value greater
than equal to five, and Y is equal to zero or if my hypothesis outputs a value
less than O.5 and Y is equal to one. Alright.
So both of these cases basic respond, to if, your hypothesis mislabeled the
example, assuming your threshold it at 1.5.
So either thought it was more likely to be one but it was actually zero or, your
hypothesis was more likely the zero, but that label was actually one, and
otherwise we define, this error function to be zero, if your hypothesis basicly
classify the, example Y directly. We could then define the test error,
using the misclassification error metric to be, one over M test of sum from I
equals one to M, so scrute test, of the error.
Of h of x I test comma y i. And so that's just, my way of writing out
that this is exactly the fraction of the examples in my test set that my
hypothesis has mislabeled. And so that's the definition of the test
set error using the misclassification error or the zero one misclassification
error metric. So that's the standard technique for
evaluating how good a learned hypothesis is.
In the next video we'll adapt these ideas, to helping us do things like,
choose what features like the degree of polynomial to use with the learning
algorithm or choose the regularization parameter for learning algorithm.