Application of the heuristic search to planning in reinforcement learning has
many advantages.
Basically resources are focused only on valuable path and
nearest states contribute the most to the return.
However, if we don't know the precise model of the world and
approximate it with some sort of function approximation,
the model can be in fact worse than the current value estimate.
In this case, the lookaheads based on approximate model can spoil the learning
and turn the estimates of a reliable value function to become less precise.
Remember, it only makes sense to perform planning in the model of the world
if more precise than the current value function estimates.
So beware.
Another disadvantage of using heuristic
search is that it obviously depends on the quality of heuristic.
We will talk about it a little bit later.
One way to obtain heuristic is to estimate the returns with Monte Carlo.
And I have previously said, if you limit the horizon of a lookahead search,
we should estimate the value of possible continuations from the leaves onward.
These leave node values can be computed with the help of function approximation.
We can learn the state values with function approximation
as we did it before in a model-free setting.
However, before today, we were not allowed to use the model of the world and
now we can try to make use of such a model
instead of relying on complex parametric function approximation.