Imagine you maintained a website that received millions of hits a day. You can frame this as a k-armed bandit problem. The estimated values correspond to the total amount of money generated by each ad, and your goal is to display the ad which will generate the most money. How can you keep your value estimates up-to-date without storing the data for millions of clicks? By the end of this video, you'll be able to: describe how action values can be estimated incrementally, identify how the incremental update rule is an instance of a more general learning rule, and describe how the general learning rule can be used in non-stationary problems. The sample average method can be written in a recursive manner. By doing so, we can avoid storing all the previous data. Let's see how that works. To do so, we're going to pull the current reward out of the sum. Now, the sum only goes until the previous reward. We write our next value estimate in terms of our previous value estimate. To do so, we multiply and divide the current sum by N minus one. By multiplying and dividing by the same thing, we are effectively multiplying by one. The circle term should look familiar to you. This is just our definition of Q_n, the current value estimate. We can simplify this equation a little further. We distribute Q_n to get this form. Finally, we pull nQ_n out of the sum and multiply by one over n. Now, we have an incremental update rule for estimating values. This equation is of a form that will show up many times throughout the course. Let's define some of these terms. The error in the estimate is the difference between the old estimate and the new target. Taking a step towards that new target will create a new estimate that reduces our error. Here, the new reward is our target. The size of the step is determined by our step size parameter and the error of our old estimate. We now have a general rule for updating the estimate incrementally. The step size can be a function of n that produces a number from zero to one. In the specific case of the sample average, the step size is equal to one over n. Let's go back to our friendly doctor. What if one of the treatments was more effective under certain conditions? Specifically, let's say the treatment B is more effective during the winter months. This is an example of a non-stationary bandit problem. These problems are like the bandit problems we've discussed before, except the distribution of rewards changes with time. The doctor is unaware of this change but would like to adapt to it. One option is to use a fixed step size. If Alpha_n is constant like 0.1, then the most recent rewards affect the estimate more than older rewards. This graph shows the amount of weight the most recent award receives versus the reward received T time steps ago. The weighting fades exponentially with time. As we move to the right on the x-axis, we go further back in time. See why this is true, let's take a look again at the incremental update equation. Let's first distribute Alpha and rearrange. We can write the next value as a weighted sum of the reward and the last value. Notice the recursive form. We can substitute the definition of Q_n into this equation and we get this. Then we can distribute 1 minus Alpha, and we get an equation that shows the relationship between R_n and R_n minus one. We can unroll this recursive relationship further. We continually replace Q with its definition until we get all the way back to the initial value Q_1. Finally, let's clean up this long equation. We write it as our initial value of Q plus a weighted sum of the rewards over time. What does this equation tell us? It relates our current estimate of the value, Q_n plus one to Q one, and all the observed rewards. The first term tells us that the contribution of Q_1 decreases exponentially with time. The second term tells us the rewards further back in time contribute exponentially less to the sum. Taken all together, we see that the influence of our initialization of Q goes to zero with more and more data. The most recent rewards contribute most to our current estimate. In this video, we derived an incremental update for the sample average method. We generalized the incremental update rule into a form that will be revisited throughout the course. Finally, we discussed how a constant step size might help solve non-stationary bandit problems.