As Morgan covered in the last video, Dataprep is 80 percent of the processing you do in a data lake. This video has a primary objective of covering some file optimizations you can do to better extracts performance and costs out of your data lake. I will go over columnar data formats, compression and splitting, and also talk about Athena partitioning. Get your beverage of choice, and let's begin talking about columnar data formats. Imagine a data set that looks like this table containing id, name, and age. Now, imagine that we are using a CSV file for storing this data. Since CSV is row-oriented, it will store the information in disk organized by blocks per row or the first block contains information about the first line, the second one about the second line and etc. Now, let's say you want to do the following query. Select average of age from table. This is a query that will need every single value from one specific column because the data is stored in a row format, it will require the system to perform multiple disk iOS to get the data needed to calculate the average. In this example, I am matching one row per file system block, just for you to get the idea. That being said, the query would need six storage iOS because we have six rows in the table. On the other hand, this is how data is stored when you have a data format that store information in a columnar way. Here, the information is arranged in blocks by column, where the first disk block would have the values 1, 2, 3, 4, 5, 6. The second disk block would have all the names, and the third disk block would have all the ages. Again, I'm abstracting block sizes and matching them with my data sets just to illustrate the difference. If we do the same query, which requires all values in a column, the number of consumed iOS is smaller because the information into disk is naturally organized in blocks that represents the values of a column. With less disk activity, the query tends to run faster. We can tell that this type of data arrangement is more optimized, for this type of query. But again, there is no better format. Row-based systems are designed to efficiently return data for entire rows. Also if you are constantly sorting and adding new entries to the data set, row-based formats can be more appropriate. Since the pieces of information needed to sort and write is tied together requiring less iOS. Columnar data store shines for analytics because for most big data and data lake related queries, you will be interested in extracting values from entire columns. Columnar data formats are also optimized for compression. Let me tell you why compression is strategic for big data world. Compression is important for two main reasons. First, it allows data processing systems to optimize for memory. Most high-performance big data technologies, copy data to RAM memory for faster execution, and compressed data would allow you to feed more data using the same memory space. Second, compression is important because in services like Amazon S3, where you pay for usage, one of the aspects taken into consideration for building up the price is the size of your buckets. In other words, compression makes things faster and cheaper. It is something you would like to have in your data set. Speaking about cost and performance, data partitioning is also important, especially when you were working with Amazon Athena. That's because Amazon Athena charges you per terabyte scanned. If we convert our files to appropriate data formats before doing Athena queries, we can make Athena scan less data, resulting in Athena queries that runs faster and cheaper. Columnar data formats are instrumental for Athena usage because as you learned, data is organized in the form of columns. If you have a data set with columnar data format, and do a query that scans entire columns. Under the hood, Athena will perform S3 partial GET operations, which will get only the specific part of S3 objects that contains the columns you were looking for. That would make Athena scanning fewer data and consuming less memory. Again, resulting in a query that runs faster and cheaper. In addition to file formats, compression and partitioning, there are more best practices I recommend you doing in your data lake. The additional reading of this week contain good information and don't forget to check it out. In the next video, you will see a demo where I will show you the difference of running Amazon affiliate queries against a CSV, which is row-based, and Parchi, which is column-based. Remember, data lakes are still scheme on read which means the processing layers can adapt to the data set. But optimizations and Dataprep are highly recommended, mostly because of cost and performance.