Which means these things are lazy, right?
I'm just going to remind you again, lazy operations.
So map we already know pretty well, apply a function to each element in the RDD and
then return another RDD of the result of all of these applications per element.
flatMap, same idea, but flatten the result.
So return an RDD of the contents of the iterators returned.
In the case of filter, we use this predicate function.
And we return an RDD of elements that have passed this predicate
condition in this case.
It's a function from some type to a Boolean.
And finally, distinct.
This is also another rather common operation.
And basically, this is kind of like distinct on a set operation.
So it will remove duplicated elements and return a new RDD without duplicates.
Now that we've seen a few common transformations that you're going stumble
across in your assignments and
in the wild, let's look at common actions that you might find in the wild.
And again, I'm going to remind you with big letters that these are eager.
These are how you kick off staged up computations.
So some of the most commonly used operations are collect.
So collect is very popular, because it's a way,
after you've reduced your data structure down, let's say you did a bunch
of filter operations and you've gone from some RDD that was 50 gigabytes, and
now perhaps you've only got a handful of elements in it.
This is how you get the result back.
After you've reduced the size of your dataset, you can use collect to
get back the subset or the smaller data set that you filtered down.
So you get all elements out of the RDD.
But typically you use collect after you've done a few transformations.
You know your data set's going to be smaller,
then you do collect to get all of those results collected on one machine.
The next operator which we've used so far actually is count.
So the idea behind count is very similar to the idea
behind counts in collections API.
Just return the number of elements in the RDD.
Take is another very important action, okay?
It's an action because we go from an RDD to an array.
So you maybe have a very large RDD of 50 gigabytes worth of elements.
And you say okay, take 100.
And you get back then an array of those 100 elements of type T.
So you basically are converting what was once an RDD into an Array.
That's the same with collect, I didn't mention that, the return type of collect
is an Array of T, because you're taking things out of an RDD and you're putting
it into an Array on one machine instead of spread out on many machines in an RDD.
Finally, we have reduce and foreach.
Reduce I think we know pretty well, combine all of the elements
in RDD together using this operation function and then return the result.
So the return type of this is A instead of an RDD, that's how we know it's an action.
And finally, foreach, because foreach returns type unit.
So this applies this function to each element in the RDD.
But since it's not returning an RDD, it's an action.
So, again, how do you determine whether something is an action or
a transformation?
You look at the return type.
If it's not an RDD, it's an action, which means it's eager.
Never forget this.
Let's look at another example.
So let's assume we have an RDD of type String.
So it doesn't matter where it came from.
We have this RDD of type String now, which contains gigabytes and
gigabytes of collected logs over the previous year.
And each element of this RDD represents one line of logging.
Okay, so this is a very realistic scenario which you might want to use Spark for.
Perhaps you have many devices or machines constantly logging to some persistent
storage somewhere like S3, okay.
And now you want to analyze maybe how many errors you have in the last month.
So assuming that we have these dates in a typical
year-month-day-hour-minute-second format, and errors, when there is an error,
the errors are logged with a prefix that includes the word error in it.
How would we determine the number of errors that were logged in the month of
December, 2016?
Easy, we can start by calling filter.
And on each line in the log, we can check to see
if the string 2016-12 exists.
And if so, only pass this predicate if also this is satisfie.
So if error exists also in that line of text.
So if you have both 2016-12 and error in that line of text,
then filter it down into a new RDD with just these elements.