Now that we've designed for reliability, lets explore disaster planning. High availability can be achieved by deploying to multiple zones in a region when using compute engine for higher availability, you can use a regional instance group, which provides built in functionality to keep instances running. Use auto healing with an application, health check and load balancing to distribute load for data. The storage solution selected will affect what is needed to achieve high availability for cloud sequel. The database can be configured for high availability, which provides date of redundancy and a standby instance of the database server in another zone. This diagram shows a high availability configuration with a regional manager. Instance Group for a web application that's behind a load balancer. The master cloud sequel instance is in US Central one A with a replica instance in U S Central one F. Some data services, such as fire store or spanner, provide high availability by default. In the previous example, the regional managed Instance group distributes VMS across zones. You can choose between single zones and multiple zones or regional configurations when creating your instance group. As you can see in this screenshot google communities, engine clusters can also be deployed to either a single or multiple zones as shown in this screenshot. A cluster consists of a master controller and collections of node polls. Regional clusters increase the availability of both a clusters master and it's nodes by replicating them across multiple zones of a region. If you are using instance groups for your service, you should create a health check to enable auto healing. The health check is a test endpoint in your service. It should indicate that your services available and ready to accept requests, and not just that the servers running a challenge with creating a good health check end point is that if you use other back in services, you need to check that they are available to provide positive confirmation that your service is ready to run. If the services it is dependent on are not available, it should not be available. If a health check fails the Instance group, it will remove the failing instance and create a new one. Health checks can also be used by load balancers to determine which instances to send requests to. Let's go over how to achieve high availability for google cloud's data storage and database services for google cloud storage. You can achieve high availability with multi region storage buckets if the latency impact is negligible. As this table illustrates, the multi region availability benefit is a factor of two as the unavailability decreases from 0.1 percent 0.5%. If you are using cloud sequel and need high availability, you can create a fail over replica. This graphic shows the configuration where a master is configured in one's own and a replica is created in another zone but in the same region. If the master is unavailable, the fail over will automatically be switched to take over the master. Remember that you are paying for the extra instance with this design. Fire store and spanner both offer single and multi region deployments. A multi region location is a general geographical area such as the United States. Data and a multi region location is replicated in multiple regions within a region. Data is replicated across zones, multi region locations can withstand the loss of entire regions and maintain availability without losing data. The multi region configurations for both fire store and spanner offer five nines of availability, which is less than six minutes of downtime per year. Now, I already mentioned that deploying for high availability increases costs because extra resource is air used. It is important that you consider the costs of your architectural decisions as part of your design process. Don't just estimate the cost of the resource is used, but also consider the cost of your service. Being down this table shown is a really effective way of assessing the risk versus cost by considering the different deployment options and balancing them against the cost being down. Now, let me introduce some disaster recovery strategies. A simple disaster recovery strategy may be to have a cold standby. You should create snapshots of persistent discs, machine images and data backups and store them in a multi region storage. This diagram shows a simple system using this strategy, snapshots are taken that could be used to recreate the system. If the main region fails, you can spin up service in the backup region. Using the snapshot images and persistent disks, you will have to route request to the new region, and it's vital to document and test this recovery procedure regularly. Another disaster recovery strategy is toe have a hot standby. Where instance groups exist in multiple regions and traffic is forwarded with a global load balancer. This diagram shows such a configuration. I already mentioned this, but you can also implement this for data storage services like multi regional cloud storage buckets and database services like Spanner and Firestore. Now, any disaster recovery plan should consider its aims in terms of two metrics, the recovery point objective and the recovery time objective. The recovery point objective is the amount of data that would be acceptable to lose, and the recovery time objective is how long it can take to be back up and running. You should brainstorm scenarios that might cause data loss or service failure is and build a table similar to the one shown here. This could be helpful to provide structure on the different scenarios and to prioritize them accordingly. You will create a table like this in the upcoming design activity. Along with the recovery plan, you should create a plan for how to recover based on the disaster scenarios that you define for each scenario, devise a strategy based on the risk and recovery point and time objectives. This isn't something that you want to simply document and leave. You should communicate the process for recovering from failures to all parties. The procedures should be tested and validated regularly at least once per year, and ideally, recovery becomes a part of daily operations, which helps streamline the process. This table illustrates the backup strategy for different resource is along with the location of the backups and the recovery procedure. This simplified view illustrates the type of information that you should capture, before we get into our next design activity. I just want to emphasize how important it is to prepare a team for disaster by using drills. Have you decided what you think can go wrong with your system? Think about the plans for addressing each scenario and document these plans, then practice these plans periodically. Neither test or production environment at each stage assess the risks carefully and balance the costs of availability against the cost of unavailability. The cost of unavailability will help you evaluate the risk of not knowing the system's weaknesses