The pyramid shown below (CSAPP Figure 6.21) depicts the memory hierarchy in computer systems:
The triangular shape refers to the size, speed, and cost of the storage devices. Devices at the top are smaller, faster, and costlier than the devices towards the bottom of the pyramid. Each level of the pyramid serves as a cache for the level underneath.
Temporal locality: Once a memory location has been referenced it is likely to be referenced again in the near future
Spatial locality: Once a memory location has been referenced, nearby memory locations are likely to be referenced in the near future
Let’s motivate our study of cache-conscious programming by working with 2-dimensional arrays in C.
Consider the following 2-dimensional array declaration in C:
int a[3][2];
The array is laid out in row-major order in memory as follows:
+---------+---------+---------+---------+---------+---------+
| a[0][0] | a[0][1] | a[1][0] | a[1][1] | a[2][0] | a[2][1] |
+---------+---------+---------+---------+---------+---------+
Consider the following nested loop that computes the sum of all elements in a M
x N
matrix a
:
for (int i = 0; i < M; i++) {
for (int j = 0; j < N; j++) {
sum += a[i][j];
}
}
This algorithm exhibts good spatial locality. We are accessing the elements of
a
one-by-one as they are laid out in memory, i.e., stride-1 reference pattern.
Each cache miss will be handled by bringing in a cache-line sized chunk of a
,
which will result in subsequent cache hits for the rest of the elements residing
in the cache-line.
There is also good temporal locality in the loop body. The local variables
sum
, a
, i
, and j
are referenced each time in the loop body. An
optimizing compiler will probably cache these variables in registers.
On the other hand, the following modified version of the nested loop exhibits poor spatial locality:
for (int j = 0; j < N; j++) {
for (int i = 0; i < M; i++) {
sum += a[i][j];
}
}
Here, the order of the loops have been switched. We are now accessing the matrix column-by-column instead of row-by-row, which is not the order in which the elements are laid out in memory. Unless the entire matrix can fit in the cache, this reference pattern will result in significantly more cache misses than the previous version.
Let’s continue to study the effect of rearranging nested loops on cache
performance by considering various matrix multiplication implementations from
CSAPP Figure 6.44. The following three code snippets (Figures 6.44 a, d, and f)
multiply two N
x N
matrixes a
and b
, yielding matrix c
. We focus our
analysis on the inner-most nested loop.
First consider code snippet 1:
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
sum = 0;
for (int k = 0; k < N; k++) {
sum += a[i][k] * b[k][j];
}
c[i][j] += sum;
}
}
sum
because we assume that the
compiler will cache sum into a register. This is why the code uses sum
in the
inner loop instead of repeatedly accessing c[i][j]
.a
is accessed in stride-1b
is accessed in stride-NNext consider code snippet 2:
for (int k = 0; k < N; k++) {
for (int j = 0; j < N; j++) {
r = b[k][j];
for (int i = 0; i < N; i++) {
c[i][j] += a[i][k] * r;
}
}
}
a
and c
are both accessed in stride-NFinally consider code snippet 3:
for (int i = 0; i < N; i++) {
for (int k = 0; k < N; k++) {
r = a[i][k];
for (int j = 0; j < N; j++) {
c[i][j] += r * b[k][j];
}
}
}
b
and c
are both accessed in stride-1Benchmarking these three snippets with large enough size N
reveals that
snippet 3 runs the fastest, followed by snippet 1, and snippet 2 performs the
worst. The runtimes are largely determined by the snippets’ cache performances.
Snippet 3 performs the best by far due to its stride-1 reference pattern for
both b
and c
. Snippet 1 performs worse than snippet 3 due to its stride-N
access on matrix b
despite having fewer memory accesses. Snippet 2 performs
worst by far due to its stride-N access to both matrices.
mmap()
Instead of performing I/O with read()
/write()
we can map a region of the
file into the address space and treat it like an array of bytes. The kernel
ensures that updates to this region of memory will get written to the
corresponding region on disk. This mapping can be viewed as an example of using
main memory as a cache for disk. This is demonstrated by APUE Figure 14.26:
Recall that the virtual address space is broken up into pages (typically 4KB) and that the OS kernel maintains a mapping from these virtual pages to physical pages in RAM:
File-backed mappings add an additional mapping from physical pages in RAM to regions of disk. Two processes can map the same region of a file into their address space and see each other’s updates to the file. The OS kernel maps the virtual pages of the two processes to the same physical pages, so updates made by one process is immediately reflected in the other process even before the contents get written to disk. In fact, if you just want two processes to share the same memory, you can remove the file from the equation and create an anonymous memory mapping.
The mmap()
system call creates new memory mappings:
#include <sys/mman.h>
void *mmap(void *addr, size_t length, int prot, int flag, int fd, off_t offset);
// Returns: starting address of mapped region if OK, MAP_FAILED on error
Here is a brief explanation of the parameters:
NULL
and let mmap()
decide the starting address and return itMAP_SHARED
: Updates to the mapping are visible to other processes, and
are carried through to the underlying fileMAP_PRIVATE
: Updates to the mapping are not visible to other processes,
and are not carried through to the underlying fileMAP_ANONYMOUS
: The mapping is not backed by any file. This has the
effect of simply allocating memory to the process when used with
MAP_PRIVATE
Having discussed memory hierarchy, cache-conscious programming, and memory mappings, we can now complete our discussion of allocators.
Recall the optimization from last lecture: repurpose the unused payload in free blocks for two pointers to link together free blocks explicitly:
With the explicit free list, searching for a free block only requires iterating through free blocks in the list, as opposed to all blocks, which was the case was with the implicit free list.
The explicit free list needs to be updated whenever a block is freed. Where should the newly freed block be placed?
An obvious strategy is to maintain the free list in LIFO order: simply insert the newly freed block at the beginning of the list. The following diagram illustrates this:
This approach is simple and fast, and may benefit from temporal locality. Reusing a recently freed block means that its memory location may still be in the cache and we don’t have to use a new cache line, possibly evicting other data from the cache.
We can further optimize free list search time and memory usage by maintaining multiple free lists. Each free list contains blocks of the same or similar size.
For example, we can partition the block sizes by powers of 2:
mmap()
syscall, which creates new page mappings for the processReferences:
TCmalloc
, from Google, and jemalloc
, from FreeBSD and currently maintained
by Facebook, are some alternatives to GNU libc’s implementation. These
allocators are optimized for highly concurrent workloads. We’ll revisit some of
the techniques used by these allocators when we cover threads later in the
course.
A program may rely heavily on frequent allocations and deallocation of the same kind of object (e.g., graph algorithm managing fixed-size node objects). Instead of using a general purpose allocator, such programs can use a specialized memory pool allocator for that object.
The pool allocator will allocate a large region of memory (e.g., some multiple of page size), and divide it into equal sized objects. Allocation and deallocation are as simple as marking an object used and freed, respectively. The fixed-size chunks eliminates external fragementation seen in general-purpose allocators due to their support for variable-sized chunks.
Last Updated: 2024-01-29