Now, let's talk about how we can compute the optimal policy in this model. We already know that the optimal policy, pi*, is a policy that maximizes the value function Vt. And the optimal value function, V*, satisfies the Bellman optimality equation, shown here. If we know the dynamics, then this equation can be solved using methods such as Value Iteration that we reviewed in our previous course. But especially for teaching the second and more interesting case when the dynamics are unknown, it turns out that another version of the value function is more useful for such tasks. This another version is called the action-value function. The action-value function Q at times t is the function of two arguments, rather than one argument Xt, as was the case for value function V of Xt. The first argument in the Q-function is the same Xt, while the second argument is the time t action at. The Q-function is determined as an expectation of the same expression that determines the value function V. But this time it's conditioned on the first action at to be specifically equal a, while all subsequent actions at will follow policy pi. Now by using the same splitting in the last term here into the time t term and sum over future values of t, we can obtain the Bellman equation for the action-value function. Now, this equation holds for the Q-function for any policy pi, but we are more interested in finding the optimal policy, pi*. This is defined in the same way as before, the optimal policy pi* should maximize the action-value function as shown in the last equation here. Now, the Bellman equation for the Q-function relates it not with itself, but rather with the value function V. But if we look at the optimal Q-function, Q*, we can obtain the Bellman equation that involves only this function. To this end, let's note that we actually have two equations here. The first one is an equation for Q* that is obtained by replacing pi by pi* in the Bellman equation for the Q-function. But there is also a second equation here that says that the optimal value function, V*, is obtained from Q* when its second argument at is chosen optimally. So if we substitute the second of these equations into the first one, we obtain an equation that contains only Q*. This is called the Bellman optimality equation for the action-value function, and it plays a central role in reinforcement learning. The Bellman optimality equation is a backward equation that should be solved backward in times starting from time capital T- 1. Those who have a terminal condition for Q* in terms of the distribution of the portfolio value, pi T at maturity T. The optimal action at time t is therefore given by a value of at that maximizes the Q-function. This is sometimes referred to as a greedy policy. It just picks a current action that maximizes the Q-function without worrying how this action will impact other actions in the future. Now, if we substitute the explicit form of the reward function r into the Bellman optimality equation, we get this equation. What this equation shows that the Q-function is quadratic in action at. Therefore, it's easy to maximize with respect to at. We can also check what happens with these formulas if we take the limit of this conversion lambda going to 0. In this limit, we get the first equation shown here. But now we can replace the optimal Q-function with minus the portfolio value whose expectation is exactly the mean option price C hat as we discussed earlier. Therefore, by using this and flipping the overall sign, the first equation can be written as the second one. But the second equation is something that we already saw. It's a recursive relation for the mean option price that has right Black-Scholes limit when the time steps delta t are very small. Therefore, we see that our formulas reduce to the Black-Scholes model when both these conversion lambda and time steps delta t vanish.