We said we would talk about common Big Data challenges and how they're addressed. One of the most common challenges for managing on-premise Hadoop clusters is making sure they're efficiently utilized and tuned properly for all the workloads that their users throw at them. So let's check in with our team and see what challenges they're facing running the Spark ML job on the on-premise cluster. Our team has an on-premises data center running Hadoop represented by the blue box here. On any given week, they manage four different jobs for the organization. Our Spark ML recommendation job is one of the four. The data is persisted in traditional HDFS disk based storage. So here's an example scenario that may seem familiar to you. The team launches job number 1, say for argument it's a big data pipeline job which ramps up and consumes 50 percent of the available on-premise's cluster resources. Then, the team has a special request from marketing to run their prediction job for an upcoming campaign later today. That could be job number 2, and it ramps up and consumes the other 50 percent of available cluster resources. You probably see the problem now. If our Spark ML job was number 3 or number 4, it would get starved out of resources because the clusters compute capacity is under provisioned for the demands of the organization's Hadoop jobs. Or a costly alternative, is where only one job is running, and uses 50 percent of the cluster resources and the other 50 percent of machines are powered and available but they are not needed. The problem here lies in the static nature of the on-premise cluster capacity. The team needs a better way to optimize the usage of the cluster's resources without the headache of continually tuning, and adding, and removing servers themselves. This is where GCP can help. You can now think of clusters as flexible resources. With Cloud Dataproc, GCP's managed Hadoop product, you can spin up as many or as few cluster resources as your job needs in the cloud. Notice how I said cluster resources. You use them when you need to run jobs and turn them off when they're not needed anymore. So in this scenario, we can have jobs 1 and 2, run on their customized cluster number 1, and jobs number 3 and number 4 can then run on their own clusters too. Your jobs get the resources that they need. You can shut down the clusters when they're not in use. The clusters themselves become fungible resources. You can even automate when the cluster shut down, so you don't pay for resources that you're not using. You can set shutdown triggers based on how long a cluster has sat idle, or a specific timestamp, or a duration in seconds to wait, or when you submit a job, run this job and shut down. But what if the needs of a job change, and on some days we need a bigger cluster, and on some days we need a smaller cluster? Cloud Dataproc autoscaling provides flexible capacity to help you meet that need. It makes scaling decision by looking at Hadoop's YARN metrics. You can use autoscaling as long as when you shut down the cluster's node, it doesn't remove any data. So you cannot store the data on the cluster, but that's why we store our data on Cloud Storage or Bigtable or BigQuery, we store our data off cluster. So, autoscaling works as long as you don't store your data in HDFS. In addition to autoscaling, another advantage of running Hadoop clusters on GCP is that, you can incorporate preemptible virtual machines into your cluster architecture. Preemptible VMs are highly affordable, short-lived compute instances that are suitable for batch jobs and fault-tolerant workloads. Why fault tolerant? Because, preemptible machines, they offer the same machine types and options as regular compute instances, but they last only up to 24 hours and they can be taken away whenever somebody else comes along and offers up new compute needs for them. So if your applications are fault tolerant, and Hadoop applications are, then preemptible instances can reduce your Compute Engine costs significantly. How significant? Preemptible VMs are up to 80 percent cheaper than regular instances. The pricing is fixed, you get an 80 percent discount. So you always get low cost and financial predictability without taking the risk of gambling on variable market pricing. But just like autoscaling, preemptible VMs work when your workload can function without the data being stored on the cluster. You want to store that data off cluster, and that's what we will look at next.