We're going to look at the difference between a block-based device and an object-based drive. We're going to take a look at three specific file systems not in any great detail. NFS and the Apache Hadoop File System and the Lustre file system. There are many others out in the world, we're not going to cover them. ZFS that was made by Sun. There's one called Quantcast, Ceph, Gluster, PVFS and others. I'm sure on the computer science side of the world there are probably semester-long courses on file system development. We're just going to scratch the surface just a little bit just to give you an idea of what a file system does and what some of the limitations of a traditional file system like NFS has when you're dealing with very large amounts of data. So, take a look at block devices first. So, on my cartoon picture here, I have a host system communicating with the drive over some kind of an interface. There's lots of interfaces out there. SATA is incredibly popular, still is. We have discussions all the time where I work about about when is SATA going to go away and it seems to hang on and hang on and hang on. So, we don't want our customers to keep on, wanting to keep buying SATA drive so we keep building SATA drives for them. Stands for Serial Attached ATA. SAS is, we call it SAS in the industry, it's Serial Attached SCSI. These are the types of drives that I worked on when I was at Seagate. USB we're all familiar with that, the universal serial bus and to wrap it up and comer our PCIe-based NVMe drivers. Stands for nonvolatile memory and the little e means over PCIe. So, drives store data in what are called blocks. They are also called sectors. That terminology can be used interchangeably. I'm going to move this many blocks. I'm going to move as many sectors. So, block equals sectors. Each block just like a byte in the memory, has an address and it's referred to as the logical block address and we just say LBA. Talk about the LBA, you're talking about the address of some block on the device. A drive can only transfer in units of sectors of these blocks. If you want to transfer one byte from the file and read one byte or write one byte, you have to write, read 512 bytes say at a time if the block size is 512 bytes. Block sizes vary from the smallest 512 bytes, all the way up to 4,224 bytes. These strange sizes up here above 4,096 are used in the enterprise space and 4,096 bytes of that is user data and then there's some additional metadata that's tacked onto the end of the user data for validating the integrity of that sector. We're not going to get into all of that. Just be aware the takeaway is that a block device, it's formatted with some sector size some block size and then all subsequent transfers take place in units of whatever that size is. So, what happens here, we've got a UML diagram. You see this in the SCSI specifications. They use this picture a way of indicating what happens when the time goes down same as in a UML diagram. So, a whole system when it wants to read some information from a drive, it will issue a read command to it and the read command contains the starting logical block address of the first block, the first sector that needs to be read from the media and the drive and transferred to the host system in a length n. So, many sectors. So, the command is transferred here shown in red over to the drive. The drive goes and finds that first sector and starts transferring sectors across the interface indicated by these blue lines. When all of the sectors for the command has been completed, the drive returns status to the host system saying this route command is complete or it might have had errors. It's possible, drives do report errors. The whole system will come back and try to reread of that information or maybe it'll go to a different drive if it's a data center and the server got to read error from one drive. Many times data centers will use data replication and they'll go read it from someplace else. Then there's higher levels of protocols that can fix the read errors by rewriting the data to this drive to fix it. Questions? Okay. So, what happens when we do a file read, so all of this code, these three lines of C code here, you've probably written a whole bunch of times in your past, I would expect you would have by now. Standard C file IO. You define a file handle which has been up above here and you call fopen and you give it the name of the file and you can give it read permissions or read write permissions. So, in this case, we're opening this file for read. You use malloc to allocate some buffer size in your application's memory space that you want to read data from that file into. So, we call malloc and it returns a pointer to a buffer in your memory that's 4,000 bytes long. Then we call fread, pass it the buffer pointer, we pass it a size and the number of the sizes to read. So, in this case, it's just like saying we want to read 1,000 bytes and the file handle, and it returns with the amount of bytes that were actually read. In all the applications I've ever written, I think I've only ever checked the amount read just I think it returns negative one if I remember correctly if it was a read error. I've never really done anything. I've never really, yes. I've never done anything with the amount read other than check it to see if there was an error. So, here's your application that you've been writing in C or C++ and you can make this call to fread to the file system which is part of the operating system. Okay. What the operating system does is turns this into a command for the drive and the command is read. Starting it as LBA and read two sectors. This 1000, these, two values combine together to tell file system it knows that this device is formatted with 512 byte sectors, 512 byte blocks. So, it has to move, at least to service the request, it has to move at least two blocks or 1,024 bytes. So, that's the command that the drive sees. So, the drive goes up, okay, I can go find those LBAs for you so it goes and reads those two LBAs. The operating system maintains a buffer at those two sectors get placed into, so it's 512 bytes and 512 bytes. And then the file system,the OS file system copies the 1,000 bytes that you ask for from its private buffer, into your buffer. Then the operating system returns the amount red here, 1,000 bytes to your application and assigns it to this integer variable here. In the file pointer, we think about files, at least when I think about files, I think about them as a long string of bytes. When you create a file or a file is first opened, the file pointer is pointing at byte zero, and then as you read the file pointer moves along, and you can use the seek commands to move the file pointer back and forth in the file, read from various locations within the file. What's actually happening, is that OS behind the scenes is moving data in 512 byte blocks. Yes, and so when we're done, the file pointer we read byte zero to byte 99, and the file pointer's pointing at byte 1,000 after that read is over, ready to read the next byte, unless you call seek and move the file pointer someplace else. Object devices or object drive change the semantics that are used from blocks to objects. In an object-based drive, data is referred to by an object ID. These objects can be of arbitrary size, the whole block semantics thing is done away with. The idea behind these object drives is, there's a lot of software involved in the whole file system, calling stack this traditional C file, file system talking to block based devices. So, one routine calls another routine and calls another routine and calls another routine. So, there's quite a bit of latency involved before the command actually gets to the drive. The value proposition of object devices, is that the application can talk directly to the drive or through a very very shallow shim much less overhead. At Seagate, when I was there, built some storage devices that were called kinetic drives, kinetic boxes, I think they were when when you high rack mounted devices with, I remember correctly this is going back a number of years. I think it had twin 10 gigabit ether net ports on it, and applications would talk to the storage devices inside this box with objects semantics. So, a read command and this situation has an ID associated with the object that wants to be read, some offset and a length in bytes. So that command is delivered to the drive, and the drive goes and looks up that objects ID, finds where it is stored on the media, and returns however much data lengthened bytes, starting at the offset into that object, into streams some number of bytes. It doesn't have to be 512, it can be 1,000 it could be seven, it could be any number. There might be some limitations there depending on the implementation, maybe you round it off to 16 bytes, as the smallest thing you can move or something. But the idea is it gives a much finer granularity and it eliminates quite a bit of that overhead of the traditional file system, squishing it down to a very very thin layer. I don't know if Seagate is still selling those kinetic drives or not, but that's the idea behind them. So, this is pretty much a car cartoon, but I wanted to make it similar to the block basic example. So, in this case, we're doing an object open on my file for read and we get a file handle. Again, we allocate a memory buffer to put the data into, and we call object read with the buff pointer and 1,000 bytes in a file handle, and what the command that ends up coming from the operating system out to the device, and then along that read path, is read, ID, offset in length. So, the drive goes off, finds that object based on the ID, transfers 1,000 bytes to the operating system, the operating system will copy that into the applications buffer, and then return to the caller, the amount of bytes read. Now, in an optimized system, this buffer probably wouldn't even exist. The drive would directly return the data right into the very very thin shim here for just shuffling the data as the data is coming from the drive, and storing those 1,000 bytes into the application buffer. So, it may only need a small small buffer acting like a five O here, or no buffer at all. Again, I made this slide about an hour ago, since Andy wanted this to be a class where new and upcoming and an evolving developing technologies are coming forth and being worked on, there's this notion of a key value driver, a key-value device, and it's very very similar to an object-based device. I gave you some links here to go read on your own. It's not important, undoubtedly, there will be an exam question on the final that says, "what's the difference between an object drive and a block based device?" I'm not going to ask you questions about key-value devices. But it's interesting reading, this one was out of an enterprise storage forum. Got a link to a Wikipedia article about No SQL and key value store, and then someone has this project Voldemort, it must be a Harry Potter fan. All of these have to do with key value devices. It works almost exactly the same. The read command has a key and an offset. Actually, I should've set value here, because what happens is, this value, the ID gets hashed, and the hash value is the thing that is used to look up the actual object inside the drive. So, that they really shouldn't say key, the key gets created like with a short two hash that give us some large number to identify the thing and then the drive, hashes that and uses that to identify the data. And this is happening real-time right now in our world, and I'm sure other storage devices vendors are looking into this because our customers are starting to talk about this. And so we're trying to figure out well what does it really mean, what kind of hardware does the drive have to have in order to support this? What does it mean to the drive, in terms of performance and power dissipation and quality of service, and all these different metrics that we look at? So, keep your eye out for key value devices, and if you hear that term now, you have some idea about what's going on there.