Reinforcement learning is a problem formulation for sequential decision making under uncertainty. Earlier, we learned that the agent's role in this interaction is to choose an action on each time step. The choice of action has an immediate impact on both the immediate reward, and the next state. In this video, we will describe policies. How an agent selects these actions. By the end of this video, you'll be able to; recognize that a policy is a distribution over actions for each state, describe the similarities and differences between stochastic and deterministic policies, and generate valid policies for a given MDP or Markup Decision Process. In the simplest case, a policy maps each state to a single action. This kind of policy is called the deterministic policy. We will use the fancy Greek letter Pi to denote a policy. Pi of S represents the action selected in state S by the policy Pi. In this example, Pi selects the action A1 in state S0 and action A0 in states S1 and S2. We can visualize a deterministic policy with a table. Each row describes the action chosen by Pi in each state. Notice that the agent can select the same action in multiple states, and some actions might not be selected in any state. Consider the example shown here where an agent moves towards its house on a grid. The states correspond to the locations on the grid. The actions move the agent up, down, left, and right. The arrows describe one possible policy, which moves the agent towards its house. Each arrow tells the agent which direction to move in each state. In general, a policy assigns probabilities to each action in each state. We use the notation Pi of A given S, to represent the probability of selecting action A in a state S. A stochastic policy is one where multiple actions may be selected with non-zero probability. Here we show the distribution over actions for state S0 according to Pi. Remember that Pi specifies a separate distribution over actions for each state. So we have to follow some basic rules. The sum over all action probabilities must be one for each state, and each action probability must be non-negative. Let's look at another state, S1. Pi in S1 corresponds to a completely different distribution over actions. In this example, the set of available actions is the same in each state. But in general, this set can be different in each state. Most of the time we won't need this extra generality, but it's important nonetheless. Let's go back to our house example. A stochastic policy might choose up or right with equal probability in the bottom row. Notice the stochastic policy will take the same number of steps to reach the house as the deterministic policy we discussed before. Previously we discussed how a stochastic policy, like Epsilon greedy, can be useful for exploration. The same kind of exploration-exploitation trade-off exists in MDPs. Let's talk more about that later. It's important that policies depend only on the current state, not on other things like time or previous states. The state defines all the information used to select the current action. In this MDP, we can define a policy that chooses to go either left or right with equal probability. We might also want to define a policy that chooses the opposite of what it did last, alternating between left and right actions. However, that would not be a valid policy because this is conditional on the last action. That means the action depends on something other than the state. It is better to think of this as a requirement on the state, not a limitation on the agent. In MDPs, we assume that the state includes all the information required for decision-making. If alternating between left and right would yield a higher return, then the last action should be included in the state. That's it for this video. The most important things to remember are; one, an agent's behavior is specified by a policy that maps the state to a probability distribution over actions, and two, the policy can depend only on the current state, and not other things like time or previous states. See you next time.