- [Morgan] Now let's talk about some analytical tools you can use once your data is processed and in a format ready to be consumed. You can use various tools to analyze data depending on the data itself and your desired outcomes. Consider this example. The data you want to analyze is in CSV format and located in S3. Let's say you want to run a SQL query against this data. You might be tempted to run a process to load it into a relational database and then run your SQL queries. However, with AWS, you actually don't need to do that. Let's talk about Amazon Athena. Athena is a great serverless service to use with your data lakes because it allows you to query data in S3 using standard SQL, similar to how you would with a relational database, all without needing to move or load the data into a database. You can query the data directly out of S3, which allows you to run ad hoc queries against your datasets. With Athena, there is no need to manage a cluster or data warehouse to run your SQL queries. Athena integrates with AWS Glue as it natively supports querying data sets and data sources that are registered with the AWS Glue Data Catalog. These sorts of native integrations are one of the reasons AWS is a great choice for operating data lakes. The different AWS services work together as building blocks to form your solution. Another AWS service that integrates directly with S3 for analytics is Amazon Redshift, the data warehousing service. Redshift allows you to spin up clusters to run complex SQL queries for analytics against your data. You can load data into Redshift from various sources beyond S3, which gives you the ability to run joins across datasets. Athena and Redshift are great for analyzing data that has already been stored in S3. But what if you want to analyze data as it comes into your data lake in near real time? For example, imagine you have IoT devices sending seismic activity to AWS for analysis, sending 60 data points every second, and you are ultimately storing this data in S3. In this scenario, let's say you have tens of thousands of devices ingesting the data and running analysis on it, say in batches, once every hour with the goal to generate an alert when the data points exceed a defined threshold. Well, if you want to alert people that an earthquake might be coming, running this analysis once an hour won't be good enough. You need a way to detect anomalies in real time. Well, as you might have noticed by now, when I say the words "real time", I'm most likely going to be talking about Amazon Kinesis, which is the case here. You can use Kinesis Data Streams or Kinesis Firehose for real time data ingestion, but you can also analyze that data in real time using Kinesis Data Analytics to do things like generate real time dashboards. The service enables you to quickly author and run SQL code against streaming sources to perform time series analytics, feed real time dashboards, and create real time metrics. You can process and analyze streaming data using standard SQL. This would work perfectly for our example of IoT devices sending seismic data, as you could query the data as it comes in, and if the values exceed normal thresholds, you can then send an alert via Simple Notification Service. The other great thing about Kinesis Analytics is that you can configure destinations for where you want to send the results of your analytics. So if you want to generate a real time dashboard for the incoming data, and then also save it to S3 or elsewhere, like Amazon Elasticsearch Service, for search capabilities, that is all possible using the Kinesis family of services. Speaking of Amazon Elasticsearch Service or Amazon ES, let's go over what that service is all about. Amazon ES is an open source search and analytics engine for all types of data. That data can include text, geospatial, structured, or unstructured. You can send real time data into Amazon Elasticsearch Service, but it also supports using data from S3. One common way to use Amazon ES is to use it for log analytics since it's great at indexing full-text log data from various sources to make it queryable and discoverable. Amazon ES is very commonly used in the Elasticsearch, Logstash, Kibana, or ELK Stack, which can give you a robust suite of add-ons to ingest, store, and visualize data. Visualization is a great way to wrap your head around the data that you have collected, aggregated, and analyzed. Kibana is a supported visualization tool through Amazon ES, but AWS also offers Amazon QuickSight for visualization, which we will cover in a future lesson. All right, lots of information in this video. The main takeaway is this. Everything is entirely dependent on your use case. Sometimes it can seem like different AWS services do basically the same thing, but each AWS service has its own strengths that are tailored to specific use cases. Those strengths may or may not play well with your use case. Make sure you dig deeper into each service before deciding what you want to use for processing and analyzing data in your data lake solution.