In this session, we're going to try and bridge the gap between data parallelism in the shared memory case. Which is what we'd learned in the parallel programming course and distributed data parallelism. So taking that idea of data parallelism and extending that to the situation where you no longer have data on just one node anymore. Now you have data spread across several independent nodes. As usual let's get started by trying to develop a little bit of intuition first. Let's try to visualize this differences between shared memory and distributed data parallelism. So let's start with the short memory case, so if I had to draw a picture of data parallelism, what does it look like? Since we're talking about the shared memory case, we assume just one node, one computer. That means we have some dataset on one computer. And we'd like to exploit the ability of the computers multiple processors to try and compute this data more quickly by doing it in parallel, by breaking up the work and trying to do many things at once. So let's assume that our data is in a collection, it's got a lot of parallel collection. So this jar is going to be a collection of jelly beans, let's just say. And our collection of jelly beans is called jar. It doesn't matter if we're doing this in parallel collections or sequential collections. We generally have the same API available to us. We can say, okay, I have this collection in a jar and I can do a map on that collection. So I want to do the same operation to all of the jelly beans in the jar. So for each jellybean in the collection, I'm going to do something to it. I don't know whatever it is, maybe I change the color. Now in the shared memory case, you can have this API. But under the hood, what parallel collections will do if we use parallel collections in Scala, is it will somehow split up the data. So there's some way to make chunks of the data. And then to have some kind of task or worker or some kind of thread abstraction, something that does processing on these individual data shards and parallels. So typically, you have many of these and whenever that work is done, we start combining the result into a complete result again. If necessary, right? So visually, we can chunk up our dataset into several pieces. And so we have something like a worker or a thread doing the individual work on individual pieces of that data all parallel at the same time. So again, this is the shared memory case, we have all of this data sitting in memory. And then different pieces of that data are being worked on by different individual processes, whatever our process is realized as, okay? So that's the visual picture of the shared memory case. What's important to note is that this can all live underneath a collections abstraction. So for example, we can take exactly the same abstraction that we already used in the case of regular collections. So for all you know because I've left out the types here. This jar could be a list or it could be a parallel array, you don't know. It has the same API, looks exactly the same as a regular Scala collection or a paralell collection. So the point here is we can have underneath the hood this data paralellism automatically happening for us. And we can reuse the same already very familiar and comfortable API that people already use. So key takeaway here is that that this collections abstraction fits over this data-parallel model very well. And we can extend this to the distributed case as well. So what would the same data apparel of execution look like in the distributed case sort of under the cover independent of the abstraction? So in the case of distributed data parallelism as we saw in the shared memory data parallelism. We split the data on the same machine, in the distributed case, we split the data over several nodes. Okay, that's the difference, we split it up over several nodes. In the shared memory case, workers or threads independently operate on the data shards in parallel. In the distributed case, nodes operate on the shards in parallel. So it's not workers or threads, it's independent nodes, independent machines working on the data in parallel, okay? So this very high level description of data parallelism in the distributed case is very similar to the shared memory case. The only difference really is that we're splitting things over several nodes. And the nodes are the things that do the work, instead of individual threats. So what does this look like? We can go back to the example of having some kind of collection. Once again lets have a collection of jelly beans. In this case, what we have done is we got individual nodes,, individual compute nodes. And each one has a piece of a dataset as an independent collection on each independent node. But one thing that we didn't have to deal with in the parallel programming course. The prerequisite to this course, when we were learning about data-parallelisms in the shared memory case, was the network, was the latency involved with these nodes having to share data or to communicate with one another in some way. The latency imposed by needing to do this, was never a worry when we were learning about shared memory data parallelism. So this is a big, fundamental difference now that we'll see in later lectures is actually going to impact the programming model. The latency is something that we will never be able to forget about. However, just like in parallel collections, we can still keep the same familiar collections abstraction over our distributed data-parallel execution. So just like before, we can have code that looks just like this. And in this case, a jar can be one of these distributed collections and spark. And yet we have this wonderful collections API that looks just like Scala collections, over top of this distributed data-parallel execution. So this model in general, fits very, very nicely underneath this collections abstraction is sort of the key takeaway here. So just to summarize the differences between shared memory and distributed data-parallels. In the shared memory case, you have this data-parallel programming model, this collections programming model. And underneath the hood, the way that it's actually executed is that the data is partitioned in memory. And then operated upon in parallel by independent threads or using a thread pool or something like that. Yet in the distributed case, we have the same collection abstraction we did in parallel model on top of this distributed execution. But now instead we have data between machines, the network in between which is important. And just like in a shared memory case we still operate on that data in parallel. So what does this mean? It means that a lot of what we learned in the previous course, in the shared memory data parallelism part of that course, can actually be applied to the distributed counterparts. For example, non associative abduction operations. So just to remind you what that is. If you had a parallel collection and you did reduce on it. And you subtracted the elements in your reduction operation, that's a non associative operation. And in parallel collections, that will give you a non deterministic result. Spark is exactly the same way, when you do these things in parallel. And they're non-associative when you have the non-associative reduction, the same exact thing happens, you could have non deterministic results. So the same things that we learned, the same sort of key take away. So that we got from the short memory data parallelism part of the parallel programming course. We can apply a lot of that the distributed data parallelism in Spark. However, the one thing we have to really think about carefully when we're using Spark's program model is to think about latency. And we'll see that more in subsequent lectures. So as you know the implementation of this distributed data-parallel program model is Apache Spark. Apache Spark has an abstraction called resilient distributed datasets. We call them RDDs for short, and an RDD is basically the distributed counterpart of a parallel collection. So throughout the course we're going to be focussing a lot on these RDDs. How they work and how to get the best performance out of them. This is a last sort of high level illustration that I'm going to give you about sort of how to think about Spark and how it works. So let's assume that we have a very large dataset, let's assume for example we have the english Wikipedia which we've reduced. So we don't have all of the possible data per article. We have just the titles and the texts of those articles, and just the english part of Wikipedia. And already that reduce sentence is pretty big, we can put that the memory of one machine unless you have really special machine. But most normal people has laptops can't put this into memory. So we know we have to split it up over a couple of notes. So assuming we have a dataset like this, what Spark will do is it will chunk up this dataset somehow via some partitioning mechanism And then it will distribute this partitioned dataset over a cluster of nodes. And what you get back from Spark is a reference to this distributed dataset, this entire distributed datasets. So now you have a very large collection that's spread over in this example, let's say eight machines. And you have one name to refer to all of that distributed data with. And now I can think of it as if it's a single collection, because to me it's going to look like a single collection abstractions like a single list or something, right? So I can now refer to this thing called Wiki. And in fact, I'm referring to the distributed data that's distributed all over these, for example eight machines here, okay? I can then just use this collection abstraction that we're already very comfortable with. So this thing we've named it Wiki, and now I can just do a map on this Wiki for every single article. I can adjust the text, I can get text, and I can convert the text to all lowercase. So getting rid of any uppercase characters in the Wikipedia text, for example. This is just a high level picture of how we're going to be interacting with this distributed data. And as you can see, it's going to look very familiar. It's going to look just like collections that you're already very comfortable with except that it's going to be distributed over many nodes.