Okay. This is the last lab that we're going to look at as part of this module in generalization and sampling, and it's pretty comprehensive. So, if it took you quite a while to work through it and work through all the steps, that's completely expected. So, let's take a look at a solution walk through right now. If you haven't attempted it yet, go ahead and try to pull up the data lab notebook, IPython notebook yourself, and run through the code that you see there in the cells, and then come back to this solution walk through video. Alright. For those of you sticking around, let's go ahead and take a look at what we got here. So here I have pulled up the Google Cloud taxicab estimation notebook and ultimately, what we want to do, is we want to explore, remember those three steps, right, we have to explore the data, we have to create those data sets, so you're now getting really familiar with how to deal with those hash functions, and that again those three steps are the training data set, the evaluation data set and the testing data set, and the last thing that you might not have seen already, is how to create a benchmark, so that we can pummel it later when you learn a lot more about machine learning, and beat that simplistic model with some of the more advanced things that you're going to learn in future courses, like how to build deep neural network using TensorFlow. So before we do that, we've got to start from zero and go all the way, work our way up from the bottom. So the first thing we need to do as you see here, is get the data sample. So, great thing about BigQuery is it got a lot of public data sets. And much like the flight data, the taxicab data here is also there. So what we're going to be doing is pulling all of the cab fares for New York City. And that's in this public data set, and the fields that we want to look at, right? This is a little bit of feature engineering deciding what we're going to explore and eventually make its way into our model. Well, if you want to think about the problem of predicting cab fare, what would be some of the things you'd be interested in? Well, you want to know potentially when they were picked up, what's the exact point like the latitude and longitude of the pick up and drop off, how many people were in the taxicab? Maybe there of course, there's multiple different fees or a fare amount tiered structure for the number of occupants, how long you went, what happens if you cross any of the bridges in New York? That's the total amount and then you have the fare amount plus any tips or discretionary spending, and that's how you get to that total amount. So we are going to see which of these factors ultimately play into determining the final fare of a cab ride, even before you get in, step foot into that door. So the first thing we need to do is get the data. So to get data here in Cloud Datalab, we're going to invoke a BigQuery query as you see here, and this is from the BigQuery sample. So you've got New York City, Yellow Cab trips, you pulled all those fields that I just mentioned and we're going to take a look at the very small part of the data. So we're just going to use much like we used the one percent sampling in our flights data for the last lab as you saw. We're going to be using just a small subset of the data here. So here's the initial query, and what we want to use is just say, a 100,000 maybe even we have a 100,000 records to choose from. Let's see if we can pull out just maybe 10,000 taxicab rides from that. All right. So we have kind of parameterized the SQL query. You can parameterize that much like you would string replacement, right? So the query is, take the raw data query because we specified this as raw data up here as you see there, replace every n, this is grab a record, right? Sample it every n, and then the total size we're looking at is a 100,000 records, and then you're ultimately going to print that query and then execute it. So this is the query that's executed, and then we're sampling against this where the remainder of that operation is one, and so now we're down to only 10,000 taxicab rides. The reason why we want to do the sampling effort again is, you don't want to just necessarily take the first 1,000 because that could be ordered, and you get out bias in your data like say, a great example for taxicab data is, it might be sorted by the most recent cab rides have been first. So if you start looking and exploring your data on the most recent like 3,000 rides, you can get bias introducing your results because maybe there was a change or a fare increase that was captured recently, or a fare decrease that you wouldn't know just by looking at that. We call it recency bias. So we've sampled effectively, and here is what we have. And this is just, we haven't done anything yet. This is just the field that we returned from the data sets. The next step is we want to actually explore it. So you see we have passenger counts, you see 1-5 here and some of the examples. You've got how far you've gotten. Really interesting. You got zero distance if this is miles for trip distance. That looks kind of weird. Zero tolls that can be expected, fare amount $2.50, and the total amount of $2.50. OK. So the data looks interesting. Let's see if we can explore it a little bit quicker. And the best way to do that, is to say create a data visualization. So oftentimes in machine learning, we'll create a scatter plot and take a look at some of the points that we have. So, here we've plotted trip distance against the fare amount. So you might be thinking, well the longer you travel in a cab, the more that meter rate is going to tick up. So here we see the longer the trip. So even a trip distance of 40 here, you see a general high fare amount of $100. But you notice two strange, maybe couple of strange anomalies in the data that you see here. Well, there's a ton of extremely small trips, or even trips that could be zero, because they're right on this line. That's an anomaly. We want to filter that off of the dataset. I don't know how you can have a cab ride that doesn't go anywhere. Maybe you go in and you get immediately kicked out. And so, you want to look at the points that are zero against this line. And then maybe any points that have a look at this kind of solid line that's going up here diagonally here. It looks almost like a line, but it's actually a ton of points collected across that line. That's because of the nature of the data. It's interesting because in New York when you exit JFK, one of the airports there, you could actually get a flat rate cab and go pretty much anywhere inside of Manhattan. And that will actually be a flat rate. So that's how based on the distance that you're traveling, it's actually known at that time. And that's why it's very easy to model that relationship and that relationship is just a line. But we want to predict not just folks coming from JFK, we want to predict folks that are traveling anywhere within New York. So, interesting things right. Let's take a look at some of the ways we can preprocess and clean that data before we ultimately bucket it into the training datasets, the validation datasets and the testing datasets. And again you don't want to jump to creating those datasets splits before you clean your data first. So garbage in garbage out. If you start splitting data that's horrible, you're going to get a horrible model results, and you're not going to be able to model any behavior that's out there in the real world. And a good rule of thumb is all data is dirty. You want to clean and make sure that it's actually in a good form before you feed it into your model. Your model only wants good high quality data. That's what it loves. OK. So, looking at some of the rides here. So, let's look at anything that has crossed a bridge. So tolls amount greater than zero. And then we have a particular day that we're looking at the pickup time. In this case it's May 20, 2014. One interesting thing just gazing at the data here, pick up longitude of zero, or pick up latitude of zero, that is clearly wrong, or dirty data. We need to filter out anything that doesn't actually have a valid pick up location. So, you want to have a dataset that ultimately makes sense and doesn't have any records that just look very strange. Another thing that you might notice here is that the total amount, nowhere in here do we actually say of the columns that are available to us, what the customer used as a tip because or and any cash amount as a tip as well that's not recorded in here. So, for the purposes of our model since that's unknown and since tips are discretionary, that's not really included as part of the fare originally. We're actually not going to predict that. So, what we're going to do is set the new total amount with a new fare amount to be just the total amount for the distance that you travel, plus any tolls that you have. So, in this particular example the fare amount of 8.5 here is just the distance that you traveled 2.22 $2 and change, plus you went over a bridge which is $5.33 that will get you that total fare amount. So, we going to recalculate that, just by adding those two together. And that's going to be a new total amount. So ignoring tips. All right. So, an interesting function that you can do is just a .describe, and that will give you a familiarity for what are some of the bounds, or some of the ranges of data for the columns that you have, very useful in statistics. So, let's take a look at the minimum and maximum for values. In case it wasn't clear for something like the pickup longitude or latitude when that was zero, you can see the max value is zero, minimum value is zero. So you can start to look at very strange things. Some of the things that immediately might become apparent, is if you have a minimum value for a cab fare that's negative 10. You can't have a negative cab fare. No one's paying you money to enter the cab and take the trip, you got to pay for the ride. So, and anything that looks like say, let's find the maximum of passenger count. Thankfully, this is six right here. But if you had a max passenger count of say twelve, there is not a cab vehicle unless it's a bus that was included in this. That's going to be there as well. So what we're slowly zeroing in on, is shaving and cleaning our whole dataset through an exercise that's called preprocessing. And then ultimately getting it ready to split into those three buckets, and then ultimately create a very simple benchmark off of that that we'll have to beat later on. All right. So, once you've slogged your way through understanding the data. And by the way this process could take weeks. If you're unfamiliar, or you're not a subject matter expert in the dataset that you're looking at, and this could be hundreds of columns, or billions of records, then engage with an SME, or subject matter expert that knows the data really well. And then really understand what the relationships are in the data, and then visualize it, use different visualizations, use statistical functions, even before you get to the machine learning side. You have to fundamentally understand what's going on in the data. So, although that took us only five minutes, the exploration part of machine learning, understand the datasets could take weeks or even months. OK. Let's look at some of the individual trips. So, here we are actually plotting these which is pretty cool, and you can see the trips themselves where we have the latitude and longitude. This is the trip lines. And then you can see that, lines that could be longer typically involve a toll. And that intuitively makes sense because you're crossing a bridge, you might be going a longer distance. It's not like somebody would get in a cab at the beginning of the bridge and then get out of the cab immediately after the bridge is done. So that's a good insight. OK. So, here is actually how we're going to clean up all this data. So, these are the five insights that we talked a little bit about before. So, we honed in that New York City longitudes and latitudes should be within this range negative 74 to 41. You can't have zero passengers. So, and arguably you shouldn't have more than a certain set amount, but we'll just have a baseline of saying no zero passengers. And then just like what we talked about with the tips, we're going to recalculate that total amount to just be fare amount plus the tolls amount as you see here. And then what we're going to do is, we know the pick up and drop off locations, but not the trip distance. So, this is an interesting pitfall that a lot of people run into when creating training datasets for machine learning models. It can't be known. If it's not known during production time, you can't train on it. So you can't say something like, all right, the trip distance was 5 miles, 5.5 miles. I'm going to say it was a dollar per mile, so therefore, a very easy simplistic model is that the ultimate trip is going to cost $5.50. That's because when you start getting new data, like say I've requested a taxicab. And then the model asks, "Okay cool. How long did you travel?" And you're like, wait a minute. I didn't get in the taxicab. It's trying to know the future before it happens. So, you can't date up, train on data that happens in the future. So, that's why we're actually dropping this from there, from the features data set as well. That's a really important point. So, think about data that exists, will exist when you actually launch this inside of production. So, lots of where clause filters for the BigQuery query that you see here. We're recalculating the fare amount. We have the different columns as you can see here. We're renaming them with aliases, and we're creating this function, which basically says, this is going to be a parameterized query that we're going to ultimately sample between these particular ranges. So, here is all of our filters as we talked about a little bit above. Here's our modulo operators in the Farm Fingerprint hash functions. We're hashing on pickup_datetime, and that means the all important message is that whatever you hash on, be prepared to lose. So we're willing to part with pickup_date time, in order for that column to be used in service for creating the barriers between those buckets. Training, evaluation, and testing. And that's ultimately we're saying that the time of day is not going to have predictive power ultimately in how much the cab fare is going to be. All right. So, we've created a query that can be parameterized, and we're going to say, if that were in training, and ultimately, what you can think down the road when I loop through this query three times, right? You got to create three data sets, training, evaluation and test. So, if we're in training, we want 70 percent of the data, so sample between zero and 70. And as you can see, sample_between is the query that we created earlier the a, b. And the a, b get plugged into a and b here, and that works out on our modulo operator that you see there for every end. So, for training, that 70 percent validation is between 70 and 85 subtracting those 2, which means it's an additional 15 percent of the training data set we have available in the last 15 percent or 85 through 100 is going to be your testing. Okay. So that gets that all ready to run. Here's what a query would look like if we ran it, and now what we're actually going to do, is specify what the outputs of that are going to be stored. Because ultimately, we need some kind of say, CSV files or some other way that the machine learning model can reach out, and touch and access this training evaluation and testing data. And to do that, we need to create a function that's going to create these CSVs. In this particular case, we're training locally. So within Data Lab, we're actually going to be storing and creating these CSVs, in future modules when you get more familiar with Cloud Machine Learning engine, and using other scalable, little bit of a prototyping step, what we're doing here is trying to do all this locally within Cloud Datalab. But you see they can actually reference data directly from the query and from Google Cloud Storage directly through like at Google Storage bucket. Okay. So, here's the CSVs that we're creating. We're asking them to remove the fare amount, and then update it with the new one that we have inside of the CSV. Here's all the features that we're dumping in, which is pretty much everything that was included in the query up above. And then here's the all important loop. For phase in, train, validation, and testing invoke that query over the sample of 100,000, and then actually execute that BigQuery query, and then return the results for data frame that we can then iterate and operate over. And with those results, we store that data frame with a prefix taxi dash hyphen, and then this is going to be your name of your data set. It's like taxi-train, taxi-validation, taxi-test inside of storage for the CSVs. And you can see that's exactly what happens here. So trust, but verify. We need make sure that those data sets actually do exist. So, just doing a simple ls here on the files that we have, and we see that there is 58,000 cab rides inside of the testing data set. And we have 400,000 in the training, and then 100,000 in the validation as well. So that reflects that split of what we have at the top, 70, 15, and 15. The interesting part, if you are asking about why the testing and validation could be different and that's because of the distribution of the data. And it might not be normally distributed, so if you have a lot of dates that are clumped together, and again hashing on one day like January 1, 2018, is going to return the same result as well. So, the data is not noisy enough, even if you tell it to be 70, 15, 15, it's going to hash it in blocks because you might have a lot of taxicabs that took place on New Year's Day that it has to fit into one of the different buckets, right? It can't be bold because you cannot split one single date what you're hashing on into two different places. Okay, great. So, let's take a look at the splits. We do that here, and now that we have all the data ready in those three silo buckets, it's finally, it's time to actually start creating a, I'll call this like a dummy model. This is your benchmark. So, if you had a simplistic guess of what the cab fare was going to be. So this is not taking into account weather, it's not taking into account whether or not you're coming from an airport. So again, all these more and more complex features and intuition that you can build into an advanced model, we're going to save that for later when we learn how to do TensorFlow, when you learn how to do proper feature engineering. Right now, is we want to create a pretty simplistic model that says, 'Hey, your advanced model better beat the RMSE or the loss metric for the model that we're running here as a benchmark." So, what does that simple model going to be? Well we're going to take a look at the, we're going to need to predict the trip distance first of all. So, a simple model is going to do that. And they're going to take the total fare amount, and divide it over the distance. And we'll just use a rate, per mile or per kilometer or something like that. And then, we want to say, based on the training data set what we know, and again, in the training data set it is labeled, meaning we actually do at the end of the day know the fare amount. That's how we can calculate the loss metric on the data, and we're going to use RMSE because it's a linear model so floating. Here's how we actually do that. Okay. So, we're going to define a couple of different functions to take the distances between the latitudes and longitudes or the pick up and drop off points. We're going to then estimate the distance between those two and then get a figure for how many miles that the taxicab actually drove. And again, we do know that information in training, but since we're predicting it, we cannot use that column. We're actually going to predict that again. And then you compute the RMSE as you see with the equation as you see listed there. And then we are going to print it out, going to pass in our features to our model. We actually do want to predict our target. What we're actually predicting is the fare amount. We're going to list the features, and then ultimately, what we're going to do is, we're going to define where our data frames for training, validation, and test, those three data sets exist, and then we're going to train. Train a very simple model, which says, predict me a fare amount as the average divided by the, so the rate that we're calculating is just simply the average of the costs. So, it was like a 10 dollar cab ride divided by the average of how far it went. So, the line 28 is the only place right here where you see any kind of modeling actually happen. So, we spent 15, 20 minutes going through this lab demo already, and line 28 is the only place where we actually are doing the prediction or doing the modeling. So this, it took this long to create the data sets to do the cleaning and preprocessing. To do the setup of the CSV files for ingestion for the model to make it super easy, and then ultimately, to have this model be the benchmark for future model performance. Now, this ratio of 99 percent exploration, cleaning up, creating in the new data sets, establishing the benchmarks 99 to one percent of the actual model, that's going to shift as we start to get into more of model building, and how to create more sophisticated models, and how to do feature engineering in the future. So right now, this just can be our benchmark. Okay. So, this is the rate per kilometer that we actually get and at the end of the day. We have a rate of $2.60 per kilometer ride inside of your taxicab. And here are the RMSEs that you see here, so we have a training loss metric of 7.45 validation of 9.35, and actually what we tested it on was surprisingly the best out of all the three at 5.44. Now, whether or not that is our benchmark, globally saying, your taxi cab ride is going to be 2.61 per kilometer no matter where you were going doesn't take into account traffic, doesn't take into account where you're going in Manhattan, doesn't take into account bridge tolls. We don't have any parameters in here for whether or not you're gonna be going over a bridge. It doesn't take into account time of day. So, all these things that you were just thinking in the back of your head, hey, you can't hard code 2.6 times the kilometers, all that intuition that we're going to build into more sophisticated models, and at the end of the day, they better do a much better job hopefully with all of the additional advanced insights that we're going to build into there, and we'll revisit this in the future to beat ultimately 5.44. So, that is your benchmark RMSE to beat. And that's it. So, the RMSE ultimately, if we took 5.44 times the actual rate, this is where you get that nine point. No, no, sorry. So, this was actually a little bit different. This is the 5.44 for this data set here. And you might get a little bit of the different response. All right, excellent. So, that wraps up this, this end to end lab, and I encourage you to continue taking the courses in the specialization. So ultimately, now that you have, you can't stop here. Now, that you know how to clean the data, get the data, ingest it, build the benchmarking model, you ultimately need to figure out, well, okay cool. I'm ready to actually do more sophisticated models and program in all those cool learning things that the model can do to make more sophisticated insights and beat ultimately this model with this RMSE here. So, stick around for future courses on TensorFlow, and how to actually beat this RMSE here and feel free. You have three attempts to do this lab. So, feel free to repeat it, and edit the code as you see fit in your Cloud Datalab notebooks. All right, we'll see around. Nice job.