So, you've just seen what those evolution strategies are capable
of and hopefully gotten all impressed with the practical results on the previous slide.
Now, it's high time to find out whether they
are actually that good or there is a drawback there.
It turns out that there is,
otherwise there would be no point in extending our course on five more weeks.
And the drawbacks turn out to be shared between
the evolution strategies and other black-box based methods like cross-entropy method.
Since we take the entire interaction process,
the entire trajectory sampling as more of
a black-box and they are only considered full trajectories.
The first drawback is that to even begin training,
each of those methods need full trajectory from start to a terminal state.
This is required to estimate the reward and this is of course,
more or less a reasonable thing to demand
depending on our current formalization of the reward function.
However, in many cases it will be logically impractical.
For example, in some cases you may have
an infinite amount of time steps in trajectory technically,
or you might want to move your foot forward indefinitely along,
or instead of having a fixed step of 10 seconds there.
And in other cases,
the trajectories might be finite but as long as say,
five minutes of sampling per trajectory on your current PC.
This does break your algorithm but you can of course fix
them with a more darker flare of duct tape.
But to even think about it,
this reveals that they train in different way from how we humans train.
You there on the other side of the screen probably know the cool stuff now,
how to walk upright, use a computer,
surf the internet, maybe ride a bicycle, maybe swim.
There's a lot of cool capabilities you've learned, at least reading.
Now, all of them are kind of complicated but you've learned them
by not getting any single full trajectory of your life.
So, you're probably alive by now therefore,
you have not seen a full trajectory of a life from birth to demise.
And this basically means that you managed to train from partial experience.
This is one property that we humans have that our current algorithms do not have.
This in fact, is the entire objective for firm next week to find another family of
algorithms that works from partial experience and
is capable of training even before it finished just one session.
Another common drawback is that generally,
the cross-entropy and other strategies generally require a lot of samples.
Just think about it.
The cross-entropy method asks you to say,
sample 100 full trajectories.
You go out of your way to sample them and then just casually discards each of them.
So just it extracts no information there,
it just drops them.
Technically, it does estimate thresholds based on them but in general,
this is not the thing you would
want your method to do with your being filled up the trajectories.
The evolution strategies don't have this problem,
they do have the problem but they don't formally throw away everything,
but on every iteration to meet just one improvement.
So it's recommended that you make several full trajectories,
sample them from different values of your data from the [inaudible] squared,
which is also kind of redundant most of the time.
So instead, there is a way to train from partial experience,
again which is going to be covered in more detail next week.
But so far, the main condition in which
those algorithm apply is that you require a lot of cheap samples.
So, if you have not only emulate a small cheap model of your environment,
it's okay to use any of them.
They'll convert pretty fast and sometimes they're going to be
even more efficient than the more complicated methods we're going to see later.
But, once the sampling gets more complicated,
say, you are seeing an actual physical robotic car from an actual physical street,
you won't be able to apply those others as
efficiently because the main bottleneck is going to be sampling 100 trajectories,
which is like 100 cheap through a busy city
by a physical robotic car under oversight of a physical driver.
So, this is where the limitation comes.
And in general, you can think of
those black-box algorithms are those that sacrifice a lot of agents,
a lot of sample interaction per one unit of improvement.
But otherwise when this is available, they're not that bad.
Now for the third time,
the entire set of the week two,
week three, and partially week four,
it's going to be dedicated to learning a way to fix this problem.
To train from partial experience,
to be more sample efficient.
We will find a lot of cool ways you can improve there. Until then.