- [Morgan] Before we hop right into the next category of AWS services for data lakes, think of a real lake. How did the water get into that lake? Or in other words, how was that lake formed? Is it fed by a small but consistent stream of water always flowing in? Is it fed by a large waterfall that pours thousands of gallons of water into the lake per minute? Or perhaps it was formed by water that used to be glaciers thousands of years ago, but now it only gets filled by occasional rainfall, or maybe it's a combination of sources that fill the lake. Let's take this idea of how literal lakes get their water and apply it to the concept of a data lake. The data in a data lake will likely be coming from multiple sources, and the way you move or ingest that data will depend on the source and the type of the data that you are ingesting. Consider how a stream feeds a real lake. There's a small amount of water consistently and constantly pouring into the lake with a steady flow. Now, think of your data lake. Maybe you have a data source where you have to constantly ingest small pieces of data to fill the data lake. For example, imagine you have water passing through a pipe, and you put a sensor in the pipe, and this sensor measures the temperature of the water and sends the data to be ingested and then processed. This is an example of ingesting real time data or ingesting streams of data in motion. You can use a service from the Kinesis family for real time data ingestion. Kinesis has multiple AWS services that allow you to ingest data streams of various types and analyze that data in real time, moving data into AWS in a secure and scalable way. The first service to know about under the Kinesis family is Kinesis Data Streams, which Raf did touch on briefly in a previous lesson. This could be used for our water pipe device example. Kinesis Data Streams is used to collect data from various sources, and you can then build Kinesis applications to continuously process the data, generate metrics, power live dashboards, or send aggregated data into stores like Amazon S3. To use Kinesis Data Streams, you have to use the AWS Software Development Kit or SDK for Kinesis, and process the incoming data, which is organized into shards. You are responsible for scaling these shards in and out, as needed. The data stored in the shards can be processed by one or multiple consumers, all using the SDK. If you are not a developer type and you don't want to worry about using the SDK or think about what shards are or how to manage them, you can instead use another service under the Kinesis family called Kinesis Firehose. Kinesis Firehose ingests data from sources in real time, just like Kinesis Data Streams, and allows you to simply designate where you want that data stored. There is no need to use the SDK. This makes it easier to use Kinesis for simple use cases that don't require data processing or aggregation before storage. Raf will talk more about Kinesis over the next week. Now, not all of the data in your data lake will be there from real time data ingestion. Let's say you would like to ingest data in a restful manner. For example, when you are working with IoT devices, you will likely want the device to transmit data that you can collect. IoT devices are often microcontrollers with limited hardware capabilities. Given these constraints, using an SDK like the Kinesis SDK could work, but it might be simpler to use standard HTTP calls. HTTP libraries are lightweight and can be loaded easily. So, what AWS service can you use for data ingestion that would allow an HTTP call or even WebSockets? Look no further than Amazon API Gateway. Amazon API Gateway is a service that allows you to host APIs that act as a front door or interface to a backend. This backend could be an application running on an EC2 instance or an AWS Lambda function, or even another AWS service like Kinesis. By hosting a restful API with API Gateway, you can ingest data into your data lake by simply submitting HTTP calls from your data source with the data posted on the payload, without needing to use any specific SDKs or heavy libraries. API Gateway can be a simpler alternative for data ingestion, but it doesn't offer the same depth of features for ingesting stream-like data into your data lake with a service like Kinesis. What you choose to use really depends on your use case. There are often use cases where you need to ingest data from third-party sources, third-party sources that you do not manage or have control over. Depending on what your goals are, third-party data might enrich the data that you have already collected. For example, let's say you are a company that creates software for healthcare services. You might be interested in collecting anonymized data from healthcare providers or data processors that you can store with your own data in order to gain insights. Instead of trying to curate this data yourself, using physical media or managing FTP credentials, or integrating with different APIs, look at using the service AWS Data Exchange. Data Exchange offers an interface to hundreds of commercial products from data providers across many industries, like financial services or healthcare or others. You can use the Data Exchange APIs to securely extract and deliver the relevant data into your S3 destination. Now, what about data that is generated by the use of a third party software, but it's data that you do own? For example, data that comes from tools like Salesforce or Google Analytics. You might think that you need to write your own connectors to these services, to get the data provided by this SAS application into your data lake. The good news is, the service AppFlow can handle a lot of this for you. App flow does the data collection and secure transfer to your data lake for you, without you needing to write your own custom connector code. You can run data transfer flows on demand, on a schedule, or after an event. You can quickly analyze this data using services like Amazon Athena to join it with multiple other data sets already stored in S3. You can also use pre-curated datasets through the registry of open data on AWS, which makes it easy to find data sets made publicly available through AWS services. You can consume data being hosted through this program, and you can even apply to share your own datasets. Sharing data in the cloud lets users put focus on analysis and deriving insights. To wrap this up, just like real lakes that are formed in various ways or formed in a combination of ways, your data lake on AWS will be formed by data coming from various places, and the tools that you use for data ingestion will vary as well. You can use a service from the Kinesis family for real time data ingestion, Amazon API Gateway for restful data ingestion, AppFlow for data ingestion from SAS applications, Data Exchange, or the Registry of Open Data for public datasets that you want to ingest into your data lake.