[MUSIC] This week you were introduced to the k-armed bandit problem, and several fundamental concepts and reinforcement learning. We started by describing the problem setting. We have an agent taking actions and receiving rewards based on the action selected. We defined the value of each action, as the expected reward received, when taking that action. The value function q star is unknown to the agent. We then introduced the Sample-Average Method for estimating q star. It uses the sum of rewards received when action a is taken, divided by the number of times a has been taken. We derived an Incremental update rule by finding a recursive definition of the sample average. By doing so, we need only store the number of times each action has been taken, and the previous estimate. We then show that we can make the StepSize more generic, and in fact, by using a constant StepSize, we could more effectively solve the bandit problems that change over time. Next, we ran into the issue of Exploration versus Exploitation. We defined Exploration as the chance for the agent to improve its knowledge, some actions may be better than we realize, we have to try the actions we may not think are the best to improve our estimates. We defined Exploitation as the agent taking the action it currently thinks is best, in the hopes that it will generate the most reward. [SOUND] The agent cannot explore and exploit simultaneously, so how do we choose when to explore and when to exploit? To answer this question, we introduce Epsilon-Greedy Action Selection. Epsilon-Greedy explores Epsilon percentage of the time, and exploits 1 minus Epsilon percentage of the time. When Epsilon-Greedy exploits, it chooses the action which maximizes the current value estimate. When Epsilon-Greedy explores, it chooses an action Uniform randomly. We also investigated the effects of Optimistic-Initial Values. If the initial values are larger than Q star, the agent will systematically explore the actions. The optimism fades with time and the agent eventually stops exploring. Finally, we discussed Upper-Confidence Bound Action Selection. UCB mixes exploitation and exploration through the use of confidence intervals. UCB uses the strategy of Optimism in the Face of Uncertainty. [SOUND] And that's it for bandits. This week we introduce some of the most fundamental concepts in reinforcement learning. We can largely view the bandit problem as a subset of the larger reinforcement learning problem. Concepts like maximizing reward and choosing actions when faced with uncertainty, are key to reinforcement learning. We're excited to begin introducing the full reinforcement learning problem next week. Be sure to read the textbook for the next week's lectures.