In the previous video, we introduced the value function and the optimal value function, and discussed how it can be found by a numerical solution of the Bellman optimality equation. The optimal policy can be computed from the optimal value function once the letter is computed. But it turns out that such approach is not the only possible approach, and there is a more convenient approach that as soon as goal is of Reinforcement Learning better. This approach is based on their so-called action-value function. Let's see how the action-value function is different from the value function. Let's first recall that the value function at the given time depends only on the state S and policy pi. It does not explicitly contain information about an action that the agent takes now. The action-value function Q is a function of both the current state ST and the action AT taken in this state. It computes the same expected cumulative reward as the value function does, but conditions suit on both ST and AT. Similar to the value function, the action-value function Q or simply the Q-function depends on the policy pi, two arguments S and A of the Q-function of the current state and current action. This means that in the Q-function, the first action is fixed and given by the value of A, while all later actions are determined by policy pi. We can use this definition in order to obtain the Bellman equation for the Q-function in a similar way to how we obtain the Bellman equation for the value function. What we have to do is again to split the sum into the first term and the rest. The first term is a reward from taking the action A now. Because the action A is fixed as it's an argument of the Q-function, the reward is fixed as well. Now, the second term is the same sum, but starting with the next time stamp. As we just said, this term does not depend on A, it depends only on policy pi. So, it's simply the value function for the same problem but start with the next time format. And this gives us a Bellman equation for the Q-function which is shown here. It expresses the Q-function at time T in terms of a reward at the same time T plus a discounted expected value of the value function at the next moment. Such equation would not be very convenient to work with as it now involves two unknown functions instead of just one. But a more convenient equation can be obtained if we looked at the optimal Q-function. Now, the optimal Q-function is defined in the same way as the optimal value function. Mainly the optimal action-value function Q star is the maximum of all Q-functions and all possible policies pi. So, the optimal Q-function says something like that, take action A now and then follow the optimal policy P star later. Now, let's compare it with the definition of the optimal value function, which essentially says, take the optimal action as determined by the optimal policy both now and later. But this means that the optimal value function V star is simply the maximum of the optimal action-value function Q star over all possible actions A taken now. In other words, it's a maximum of Q star with respect to its second argument. Now, we can use these definitions in order to obtain a Bellman equation that involves the optimal action-state function Q star. First, we take the maximum over all policies pi in the Bellman equation for the Q-function and use the definition of Q star. This produces the first equation on the bottom of the slide. This equation relates Q star and V star. Now, we use the definition of V star as a maximum of Q star over all actions A, and this produces the last equation that involves only the Q star function. This equation is called the Bellman optimality equation for the action-state function Q. This is the equation that we'll be working on in the next week when we look into Q learning. For now, let me make a few comments on how the state value function or the action-state value function can be estimated from experience, following the machine learning paradigm. An agent can follow policy pi and maintain the average reward for each state it visited. If the number of visits to each state is large enough, then the average will converge to the value function for the state. We can also compute such averages for separate actions taken at each step. And in this case, the averages will converge to the values of the action-state value function. In Reinforcement Learning, estimations of these kinds are called Monte Carlo methods. They are simple enough as long as the number of states and actions is low. But if this number is large or if states or actions are continuous, it may not be practical to keep separate averages for each state action combination. In such cases, we should rely on some function approximation methods for the action-state value function. We will get back to these topics later on in this course. But for now, I wanted to show you one of the most classical examples of Reinforcement Learning, the Pole Balancing Problem. The task for this problem is to apply horizontal forces to a cart that can move to the left or right on the track of some lengths, so as to keep pole hinged to the car from falling from the initial vertical position. Failure happens when the pole falls past a given angle from the vertical line. Now, this can be formulated as an episodic task for a specified time period C. For each time step one failure did not occur, the agent gets the reward of plus one. If a failure occurs, then the episode terminates and the agent starts a new episode. This problem can also be formulated as a continuing task without a fix time period T. It continues indefinitely or until the pole falls. This setting requires using some discounting gamma to handle infinite sums in a definition of a value functions. In this case, the reward for a given step would be minus one for each failure and zero otherwise. For MDP formulation, for the pole balancing problem, we need a set of states and a set of actions. The state, in this case, will be positions and velocities of the cart and the pole, and actions will be horizontal forces applied to the cart. The cart balancing problem is one of the oldest problems seen in Reinforcement Learning and is used to, these days, as a standard test case for a new Reinforcement Learning algorithms. Here, we can see a video showing actions of a train Reinforcement Learning agent for this task. More details on a cart balancing problem can be found in the book by Seth and Bartold. As we want to focus on the Reinforcement Learning in finance, specifically, we will not go into much details on that. In setting the next lesson, we will develop a simple financial model that is simple enough to serve as a test case for financial implications of Reinforcement Learning. So, let's wrap up this lesson with our traditional questions and quickly move to the next lesson.