In this module, we will discuss how to manage data pipelines with Cloud Data Fusion and Cloud Composer. Cloud Data Fusion provides a graphical user interface and APIs that increase time efficiency and reduce complexity. It equips business users, developers, and data scientists to quickly and easily build, deploy, and manage data integration pipelines. Cloud Data Fusion is essentially a graphical no code tool to build data pipelines. Data Fusion is used by developers, data scientists, and business analysts alike. For developers, Data Fusion allows you to cleanse, match, remove duplicates, blend, transform, partition, transfer, standardize, automate, and monitor data. Data scientists can use Cloud Data Fusion to visually build integration pipelines, test, debug, and deploy applications. Business analysts can run Cloud Data Fusion at scale on GCP, operationalized pipelines, and inspect rich integration metadata. Behind the scenes, Cloud Data Fusion creates ephemeral criminal environments to run pipelines. In the beta release, Data Fusion supports Cloud Dataproc as an execution environment where you can choose to run pipelines such as MapReduce, Spark, or Spark Streaming programs. Data Fusion provisions an ephemeral Cloud Dataproc cluster in your customer project at the beginning of a pipeline run. Executes the pipeline using MapReduce or Spark in the cluster, and then tears the cluster down after the pipeline execution is complete. Alternatively, if you manage your Cloud Dataproc clusters in controlled environments through technologies like terraform. You can also configure Data Fusion not to provision clusters. In such environments, you can run pipelines against existing Cloud Dataproc clusters. You can create multiple instances in a single project and can specify a GCP region to create instances. Based on requirements and cost constraints, you can create a basic or an enterprise instance. Each instance contains a unique independent Data Fusion deployment that contains a set of services which are responsible for pipeline life-cycle management, orchestration, coordination, and metadata management. These services are run using long running resources and a tenant project. Note, although each instance creates long-running resources, you are only charged for pipeline execution beyond upfront cost. A long running but idle instance does not incur additional charges over time. You only incur charges when you run pipelines to process data using the instance. Data Fusion creates instances on a GKE cluster inside a tenant project. You can find more details about the resources used by an instance in architecture components. You can create and manage Data Fusion instances using the GCP console UI, by clicking the Data Fusion link in the big data section. Let's take a closer look at how Data Fusion is integrated with GCP. At the time of this recording, Data Fusion executes your pipelines on Cloud Dataproc UI instance we just saw. There was future support coming for executing Cloud data-flow in the future. Inside of a Data Fusion instance that is booted up on a Dataproc VM, are these five core products and services listed here. Data Fusion runs in a containerized environment on GKE, with persistent disks and long-term data storage on Cloud storage. To manage user and pipeline data, is backed by a Cloud SQL database, and uses the key management service. All of that said, you will likely never interact with any of these underlying services as a goal of Data Fusion is to abstract them away from you, so you can focus your time on exploring data sets and building beautiful pipelines with no code and in a UI. Through the rest of this module, we'll show you tips and tricks for working in the Data Fusion UI as you will practice in your lab. At a high level, Data Fusion provides you with a graphical user interface to build data pipelines with no code. You can use existing templates, connectors to GCP, and other Cloud services providers and an entire library of transformations to help you get your data in the format and quality you want. Also, you can test and debug the pipeline and follow along with each node as it receives and processes data. As you will see in the next section, you can tag pipelines to help organize them more efficiently for your team, and you can use the unified search functionality to quickly find field values or other keywords across your pipelines and schemas. Lastly, we will talk about how Data Fusion tracks the lineage of transformations that happen before and after any given field on your dataset. One of the advantages of Cloud Data Fusion is that it's extensible. This includes the ability to templatize pipelines, create conditional triggers, and manage and templatize plugins. There is a UI widget plug-in as well as custom provisioners, custom compute profiles, and the ability to integrate to hub. The two major user interface components we will focus our attention on in this course, are the wrangler UI for exploring data sets visually, and building pipelines with no code, and the data pipeline UI for drawing pipelines right on to a Canvas. You can choose from existing templates for common data processing tasks like GCS, to BigQuery. There are other features of Cloud Data Fusion that you should be aware of too. There's an integrated rules engine where business users can program in their pre-defined checks and transformations, and store them in a single place. Then data engineers can call these rules as part of a rule book or pipeline later. We mentioned data lineage as part of field metadata earlier. You can use the metadata aggregator to access the lineage of each field in a single UI and analyze other rich metadata about your pipelines and schemas as well. For example, you can create and share a data dictionary for your schemas directly within the tool.