0:00
If you want to avoid going to memory when you don't need to,
you have to rely on caches,
but interactions with the cache are not transparent to the programmer.
Instead of interacting with the cache directly,
you should try to program with two concepts in mind:
locality of memory access in space and locality of memory access in time.
You already know that processors are built with
two levels of parallelism: multiple cores and vector instruction support.
The memory subsystem is also built with parallelism.
Whenever you request a single floating-point number from memory or from a cache,
you get this number and you also in parallel get multiple neighbors of this number,
because the minimal block of data that can be transferred between
memory and cache is 64 bytes long.
These blocks are called cache lines,
they are aligned on 64-byte boundaries in Intel architecture,
and the reason why memory is organized in such a way is that
your application is expected to have unit stride,
and if it does have unit stride,
for the cost of one memory access or the latency of one memory access,
you get seven accesses free,
in the case of double precision,
or for one memory access,
you get 15 accesses free in the case of single precision.
Because of such memory organization,
it is important to maintain spatial locality of memory access,
and that, to the programmer,
translates to unit-stride access in memory.
Unit-stride access means that if you have a loop,
then you are accessing elements in memory that are back to back.
Then for one memory access,
you get one latency and 15 or 7 free accesses.
Sometimes, you need to look at the order of loop nesting
and make sure that the order is such that the innermost loop has unit stride.
For example, if you have a row-major layout of a two-dimensional container,
then you should traverse this container columns first,
row next so that you
request the cache line and then you use all of the elements in this cache line.
And you should not have strided access,
where you go rows first,
columns next, because in this case,
you get one element,
memory automatically delivers to you seven of the neighbors,
but you only use one of the neighbors and then you move on to the next cache line.
You will eventually use these elements in memory,
but you will have to revisit them and potentially endure
the latency of another memory access.
Here are two code snippets that illustrate how loop permutation might work in practice.
What I'm showing here is matrix-matrix multiplication.
I have matrices A and B and I'm computing matrix A so that
C_ij is the sum with respect to k of A_ik B_kj.
You probably know this formula from basic linear algebra,
and I'm showing two codes to do this.
One code has the order of loops i-j-k and the other has i-k-j.
My claim is that one is better than the other,
and it is easy to understand why.
The loop on the left with respect to k has a scalar on the left,
a constant on the left,
so this is going to result in vector reduction.
This may or may not be a problem depending on the length of the loop,
but let's take a look at the other memory accesses.
When I access A, I'm good,
this is unit-stride in k. It is unit-stride
because when k is incremented by a value of one,
I move in memory to the next immediate element.
In this way, I exploit the parallelism in
the memory architecture by using the entire cache line.
But this is a problem.
Here, when I increment k by one,
I move in memory by n elements,
so this is strided access.
When you have a stride for reading memory,
it's called a gather operation,
and it is much less efficient than unit stride.
Well, in double precision,
potentially eight times less efficient.
In contrast, the loop on the right has unit stride it the left-hand side,
it has a constant here for memory of access in A,
doesn't depend on j,
and for B, I also have unit stride.
So this is a more efficient way to do matrix-matrix multiplication.
Now I want to make a disclaimer.
The code on the right is actually a bad code for
matrix multiplication and the code on the left is an even worse.
What the code on the right is lacking is that access locality in time.
There's ways to improve this code by loop tiling, but in reality,
if you have a standard mathematical operation that is found in BLAS or
LAPACK libraries, don't write loops.
You should instead use a highly optimized implementation of this function,
for example, from the Intel Math Kernel Library, MKL.
It is likely going to be more efficient than any loops you're gonna potentially write.
At the same time, what I've illustrated is how to
improve that locality in space by permuting loops.
Sometimes, you can improve that locality in space by changing your data containers.
Most frequently, you see it
when arrays of structures are converted to structures of arrays.
You're gonna have particles with coordinates x,y,z and I'm trying to vectorize a loop
that traverses these particles and performs some calculations on them.
Then when I load data into vector register,
I will have to go with a stride of three.
So, on average, I will discard
two-thirds of the data that memory gives me, in a cache line.
And that is much less efficient than changing the data container so that you have,
instead of particle of either x,
you have particle x of i.
And then you touch one element,
memory delivers to you the entire cache line and you use
this entire cache line for free to load the data into the vector register.
This is how you take advantage of parallelism built into
the memory subsystem by maintaining unit stride.