Welcome back. Today we're going to talk about maximizing data source quality for design data. In a previous lecture, we talked about optimal sample design, and today we're going to look at an example of actually doing optimal sample design. There are many different ways to do sample design, many different objectives, and for a fuller treatment, you can look at textbooks on sampling theory. We're going to focus on one particular sample design and look at how an optimization might work. We're going to start with this idea of a name and allocation. What the name and allocation does is it minimizes sampling variance for a fixed sample size. We'll come back to a key assumption later because as I said, there's many ways to optimize a sample design. We're going to get a little bit of notation here, and so we're going to start with capital W_h, which is equal to N_h divided by N. In other words, the proportion of the population in stratum h, and we're going to have H strata. We'll show an example in a minute if this isn't completely clear. The next additional piece of information we'll need is the sample standard deviation. Here you'll see again we're using population values and we're taking the sum of the squared differences between each of the values in a stratum. We're subtracting the stratum mean and squaring that term. Then we're going to sum those squared differences across all of the elements in stratum h and divide by N_h minus 1. That gives us the population variability of elements within a stratum. We'll come back to that quantity as well. We need to know W_h and have estimates of S_h. Once we've done that, then we can allocate the sample based on that information. Now, I also need to know n and the sample size, the total sample size that I have available. Then the name and allocation is simply the formula you see at the bottom of the slide. The sample size allocated to stratum n_h is equal to n, the total sample size times W_h times S_h, divided by the sum of W_h times S_h across all H strata. Therefore, the name and allocation is using the proportion of the population in the stratum, and the variability within that stratum as a way to allocate that sample n efficiently across all of the H strata. Let's take a look at how that works in practice to help us understand it a little better. I'm going to look at made-up example, a commercial building survey. I'm looking at energy consumption. The natural choice for stratification in this case is building size because it's highly correlated with energy consumption. Now in this example, I've got four strata, so H is equal to 4. The strata are defined by the square footage of the building. The stratum with the smallest buildings are those less than 50,000 square feet. The stratum with the largest buildings are those with 4 million or more square feet, so we've created these strata and we categorize all buildings into one of the four strata. For example, in that stratum with the smallest building sizes, there are 1,182 buildings. In the stratum with the largest buildings, there are 895 buildings, so that's our population. A total of 5,318 commercial buildings. Now, in the next column, we have their mean energy consumption. This is an example we've made it up so we can have those numbers and we can look at the impact of our sample design on our outcome. 72.7 is the average energy consumption in the smaller buildings, while the much larger buildings have an average consumption of 11,659. The last column then is the stratum level estimates of variability. So the first stratum with the smallest buildings has that population variability of 11,783, the largest buildings, on the other hand, have a variability of 183 million approximately. Those are the characteristics of the strata in our problem. Now, I can use this information to create a name and allocation, which is what you see happening in this table. The first step is to take W_h times S_h, and that's what you see in this fourth column. For the first stratum, that quantity is 24.1. I should mention I've also calculated in this table, the W_h. The proportion of buildings, the proportion of the population in Stratum 1 is about 22.2 percent. The proportion in the largest stratum, on the other hand, is 16.8 percent. You can see that the W_h all sum up to one. Well, then I've divided W_h, S_h by 2,949.4 and multiply it by my final sample size of 500 to get a sample size in each of the strata, you'll see the formula in the last column on the right-hand side at the top, and that allocation round it to integer values is four units are allocated to Stratum 1. Twenty-seven of my 500 units are allocated to Stratum 2, and 385 units are allocated to Stratum 4 for a total of 500 sample units. Now, that might seem inefficient because there were few buildings in that fourth stratum, the largest buildings. However, the other thing to note about that stratum is it had by far the largest population variability. In order to get the most precise estimates, we need to allocate a big portion of our sample to that stratum. That's one way to interpret this name and allocation. Now, I needed an integer allocation to start with. I also needed input estimates of S_h. The first question people often ask me is, how do you get these? Where do I get those estimates? That's actually the hard part, I either have to have a previous survey with similar estimates that I can use to try and get estimates of S_h, or I may have to figure out to guess accurate values to make some starting values that I can use in order to create this allocation. Now, presuming that I can get those estimates of S_h, then I can develop the name and allocation which minimizes sampling variance of the estimated mean for a fixed sample size. Therefore, it maximizes data source quality. Now, there are packages that carry out these allocations, for example, the package PracTools has a function called strAlloc, which will do this name and allocation. You just need to give the inputs the W_h, the S_h, and the fixed sample size that you hope to achieve. That's how we do the name and allocation. Next, we're going to turn our attention to maximizing data source quality for gathered data.