We said that the goal of Markov decision process model is to maximize the expected total reward. Because this problem has to be solved now, but rewards will be obtained in the future, this problem is solved in terms of a policy that it gives a rule of what actions should be taken in any possible state of the world. But how can we quantitatively compare different possible policies for the same MDP problem? We simply compute the expected total rewards, the same quantity that we want to maximize, but with one important addition. Namely, in this calculation we condition on the current state of this system, ST. This is done in recognition of the fact that we can get different total rewards if we start with different initial states of the system. Such conditional expectation of the total cumulative framework is called the value function. For each possible state, ST, the value function gives us the value of the state for our ultimate task of maximizing the total reward using policy pi. This data sees therefore an argument of the value function. In addition, the value function V has an upper script pi to emphasize it's dependence on the policy chosen to accumulate the rewards. Finally, it has a lower index that specifies the time at which the system has the state vector S. We use a discrete notation for time as we work in discrete time with Markov decision processes. And therefore, in this formula we show the value function defined for time zero by using the lower index of zero here. Next, let's see how we can evaluate this expression. Let's separate the first term from the sum and write it as the first term plus a similar sum, but starting with the second term in the original sum. Next, we replace the expectation of the sum with the sum of expectations. Now, the first term is just the rewards from the current step. But the second term is the expectation of the same value function at the next time moment when the two have some other state as prime as it's arguments multiplied by the discount, the factor Gamma. We can now replace the expectation sign in this term by an explicit sum that involves probabilities of transitions to all possible next states from the current state. And this produces a recursive relation for the value function that is called the Bellman equation. It was proposed in 1950s by Richard Bellman in the context of his pioneering work on dynamic programming. We will encounter a few other equations a bit later that also carry the name of Bellman, so it will be important to understand the differences between them. Now, a slightly more general version of the same Bellman equation that I just presented refers to the case when the reward R depends on the next state. All that is needed for such case is to put the reward inside the expectations so that the Bellman equation takes the form shown here. Now, a special case arises when Markov decision process is such that time does not appear in it as an independent variable. That means that for such MDP transition probabilities and reward cannot explicitly depend on time. And the final time, T, is infinite in the problem. So, there is an infinite sum in the definition of the value function. In such case, the value function will also be independent on time, and therefore, we can drop the time index from it in this cases. The Bellman equation for such case becomes a simpler, a system of linear equations, one equation for each possible state as T. So, if our discrete set of states has N states, we will have N such linear equations. If transition probabilities are known, we can easily solve this linear system using methods of linear algebra. Next, we introduce an optimal value function called V-star. The optimal value function for a state is simply the highest value of function for the state among all possible policies. So, the optimal value function is attained for some optimal policy that we will call V-star. The important point here is that an optimal policy V-star is optimal for all states of the system. This means that V-star should be larger or equal than the with any other policy and for any state S. Now, we can express the optimal value function in terms of itself, similarly to how we derive the Bellman equation for a value function with a fixed given policy pi. Formally, it can be done by simply applying the max operator to both sides of the Bellman equation. Then in the left hand side we will have V-star by definition, while in the right hand side we will have the same optimal value function at the later time. But in addition to replacing the by V-star in the right hand side, we also have to make sure that the current action at time T is also optimal. This expresses the famous Bellman's principle of optimality, which states that optimal cumulative rewards should be obtained by taking an optimal action now, and following an optimal policy later. Therefore, by following this principle we obtain the Bellman optimality equation for V-star shown in this slide. Now, because of the max operator standing in the right hand side, this is now a non-linear equation that usually needs to be solved numerically. In the next video we will talk about simple numerical algorithms that can be used to descent. But before that, let's discuss how we can find the optimal policy if we already know the optimal value function. This part is easy. The second term in the Bellman optimality equation involves taking the maximum among all possible choices of the action at time T of this term. But this separation is exactly what should be described by the optimal policy itself. And this gives us the way to find the optimal policy for a given state and time as they choice of action for this state in time that gives a maximum to the second term in the Bellman optimality equation. So, let's take a quick look at questions for this video and move on to methods to solve the Bellman optimality equation in the next video.