This leads us to our discussion of two parallel distributed file systems. First one that we'll look at is Hadoop, and again, we're just scratching the surface on this, and the other one is the Lustre file system. These ideas that I mentioned on the previous slide is very similar to a notion called Storage Compute, has anyone heard of Storage Compute? Raise your hand if you've heard of Storage Compute. No. It's the idea, the primary idea and the driver behind Storage Compute is throughput and parallel access to the data and dramatically reducing the power consumptions for moving the data. So, you go back to that original picture of the IBM Watson box and the NetApp storage application, there's power expended to read that data, bring it into the CPU, do some computations on it and store results back to the storage unit, there's an energy cost associated with that. Storage Compute is all about pushing that compute out as close to the storage as possible. A friend of mine at Seagate, Kevin Gomez, he's written a number of papers on this. You can go Google his name, Kevin Gomez and Storage Compute, find his papers. These file systems are similar in this notion to what's happening in Storage Compute, but they have slightly different purposes. So, I'll just leave it at that, you can go Google Storage Compute, Kevin Gomez and go read his papers to see more about that. So, we'll take a look at Hadoop. So, they set out, the builders of Hadoop had a series of goals that they wanted to achieve and set as targets, think of them as a requirement, product requirements, okay? They wanted to plan for, they understood that systems would fail and so they built, they devised a system that embraced failure. So, if some component fail, the storage device failed, the system just keeps doing along, running great. One to grade, streaming access to data and batch processing and wanted to deal with very large files, typical file and Hadoop is gigabytes to terabytes in size. It's also a write-once, read-many scheme, it does support appending and truncating, but it's not a good fit for a general purpose file IO like we have on our laptops or machines we use everyday. We used to call these worm drives when I was at Seagate. We were discussing whether we wanted to build worm drives or not which is write-once and read-many as an acronym. I don't know what happened to that notion, but I haven't seen any products coming out of Seagate, so probably went off in the corner and died. We wanted it to be supported and portable across heterogeneous hardware and software systems. So, heterogeneous is opposite of homogeneous where everything's the same. We've got all different kinds and sizes of compute and storage devices and software, the difference types and everything can be different. That was one of their goals, it doesn't have to be this homogeneous hardware and software system. So, here's my cartoon picture of what's called a Hadoop cluster. So, we've got clients over here that want to access data, and when a client executes fopen and fclose commands, creating files and closing files and those commands go to a master called a NameNode, and there's a bunch of storage associated with that. This is where all the metadata is stored for names and locations and how many replicas of the data there are. When you set up a Hadoop file system, you can set up the cluster to say, I want three copies of every file or I want five copies of every file. So, if a drive fail some place that it is marked as bad and behind the scenes as time allows, the Hadoop file system will recognize that, all right, I've only got four copies of that. Now, I'm going to go make a fit copy. So, it'll go find a copy one of the other four, and we'll copy it someplace else and make that fit copy again. This is how it embraces failure, this is how it deals with failures. The DataNodes is where your fwrites and your freads are directed to. In this example, I showed that this data is replicated. Replication can be supported by the application as well as I understand it, being setup when you initialize and bring up a Hadoop file system. In this case, data is operated on them blocks and you'll notice that they're much larger blocks in a block storage device. A block storage device, there were 512, up there four, two to four bytes. In Hadoop, a block is 128 megabytes typical, but it is configurable. Then a file, of course, would consist of multiple of these very large 128 megabyte blocks. So, the NameNodes here perform open, closing, renaming of files and those other types of administrative functions and the DataNodes perform creation of the files, deletion of files and the replication of data. They can think of, this is one rack, this is another rack, might be in the same room, might be across the business campus, might be around the world, maybe someplace else. The goals of Lustre file system are similar yet different from the goals of the Hadoop file system. It was intended to run on Linux machines, and it was intended to scale capacity and performance very easily. This information is a year old. I probably should have gone in freshen this one out. It supports clients ranging from the numbers into hundreds of clients up to a million clients. At the time I did my research, according to the documentation on the facts on the Lustre site, there were deployments out in the field somewhere, they didn't say who that had 50,000 clients, users and machines accessing data on this particular Lustre file system. Wanted to provide very good performance to the clients. The goal was to try to get to a point where a single client could utilize 90 percent of the network bandwidth. The aggregate client bandwidth can reach upwards of about 10 terabytes per second. The file size, it has a maximum single file size of 32 petabytes, which is pretty good by today's standards. It wouldn't surprise me as we move along. These standards and these systems continue to grow and adapt, nothing is static. Very few things are static in the computer industry, and we're constantly standing where we'll make a deployment and we observe how it behaves for a while and we stand back and we go as engineers and say, "You know, we can make that better." We make it bigger and we make it faster. That's what we do in the computer business. Can support, at the time I wrote this, up to 512 petabytes with over a trillion files making up that 512 petabyte total. Like I said, I imagine these numbers are going to just continue to move up and go do research and grab a snapshot of what the values are today and you check a year from now and they're going to be bigger. Unlike Hadoop, which was a write-once read-many file system, this was designed for general purpose file IO unlike Hadoop. Replication is not a focus, but it can support. I added this link here from the lustre.org site about file level replication and how to design that if you're going to deploy a Lustre file system. You can work into your network, your file system, this notion of data replication. So, here's my cartoon picture. Lustre presents an object semantic to the application. So, an Object Storage Server presents this object semantics to the clients. Sitting behind it is a Object Storage Target. These are all made up of today because 99.99 percent of all the storage in the world are block-based devices. So, these Object Storage Servers are doing a translation from object semantics to block semantics to actually get to the data because that's what exists today. So, again, here we have our clients out here. They're doing fopens and fcloses and so forth and much like Hadoop, it goes to what's called a management server, the metadata server, and again, this database of information is keeping track of file names and directory permissions, directory names and permissions and file layout and the structure of where all of the data is and the fwrites and the fread commands are routed to the object servers that service the actual data write requests and data read requests to bring our data back over the network to the client. As a screenshot from the Lustre website, they show two networks running in parallel. One of them's an InfiniBand and the other one is Ethernet. So, we've got a bunch of Lustre clients out here. These management servers are sitting here off to the side, and does have a notion of failover so you can set up one to be active and one or more to be in standby and then there's metadata servers that are taking care of all the metadata, all the filenames and directory names and all that additional metadata to manage a file system. You can have an active metadata server and then one or more standby servers. Then there's some lesser routers out here that route all the requests to these Object Storage Servers. They call them OSS. I don't know how people try to pronounce that or not. The OSS, it receives the request for the reads and writes and they turn around and go out to Object Storage Targets out here, but with commodity storage devices. So, you could have low-grade static type drives out here. You might have high-end enterprise storage devices out here which would run the SAS interface for example, it's not uncommon. These Object Storage Servers have a notion of failover as well. So, when you build your system and your architect file system, you decide how many of these you want. Of course, no money's not infinite, so there's always a limitation when you're building anything. You just can't go crazy and put an infinite number of replication in there of redundant nodes, but there's some practical limit. But it does support this notion of failover and that's an idea that's been around in the enterprise data center for a long time, this notion of failover. SAS drives have two ports on them or have had for years. When I was at Seagate, many enterprise class customers put all these drives in a big rack, and they'd use all the A ports and not transfer any data and the A port unless that port failed. Then the drive would switch over to the B ports, and they did that for reliability purposes. To my knowledge, customers are not generally transferring data in both the A port and the B port simultaneously. That may be different now. There was this notion of failover, this notion of redundancy, this notion of resiliency, that's one of the things that separates an enterprise class product or products from client level products as there's quite a bit more resiliency and redundancy and robustness built into enterprise class products.