COMS 4995 Advanced Systems Programming

Caching, Memory Mapping, and Advanced Allocators

Memory Hierarchy

The pyramid shown below (CSAPP Figure 6.21) depicts the memory hierarchy in computer systems:

fig6.21

The triangular shape refers to the size, speed, and cost of the storage devices. Devices at the top are smaller, faster, and costlier than the devices towards the bottom of the pyramid. Each level of the pyramid serves as a cache for the level underneath.

Cache-conscious Programming

Principle of Locality

2-dimensional Arrays in C

Let’s motivate our study of cache-conscious programming by working with 2-dimensional arrays in C.

Consider the following 2-dimensional array declaration in C:

int a[3][2];

The array is laid out in row-major order in memory as follows:

+---------+---------+---------+---------+---------+---------+
| a[0][0] | a[0][1] | a[1][0] | a[1][1] | a[2][0] | a[2][1] |
+---------+---------+---------+---------+---------+---------+

Example 1: Summing Matrix Elements

Consider the following nested loop that computes the sum of all elements in a M x N matrix a:

for (int i = 0; i < M; i++) {
    for (int j = 0; j < N; j++) {
        sum += a[i][j];
    }
}

This algorithm exhibts good spatial locality. We are accessing the elements of a one-by-one as they are laid out in memory, i.e., stride-1 reference pattern. Each cache miss will be handled by bringing in a cache-line sized chunk of a, which will result in subsequent cache hits for the rest of the elements residing in the cache-line.

There is also good temporal locality in the loop body. The local variables sum, a, i, and j are referenced each time in the loop body. An optimizing compiler will probably cache these variables in registers.

On the other hand, the following modified version of the nested loop exhibits poor spatial locality:

for (int j = 0; j < N; j++) {
    for (int i = 0; i < M; i++) {
        sum += a[i][j];
    }
}

Here, the order of the loops have been switched. We are now accessing the matrix column-by-column instead of row-by-row, which is not the order in which the elements are laid out in memory. Unless the entire matrix can fit in the cache, this reference pattern will result in significantly more cache misses than the previous version.

Example 2: Matrix Multiplication

Let’s continue to study the effect of rearranging nested loops on cache performance by considering various matrix multiplication implementations from CSAPP Figure 6.44. The following three code snippets (Figures 6.44 a, d, and f) multiply two N x N matrixes a and b, yielding matrix c. We focus our analysis on the inner-most nested loop.

First consider code snippet 1:

for (int i = 0; i < N; i++) {
    for (int j = 0; j < N; j++) {
        sum = 0;
        for (int k = 0; k < N; k++) {
            sum += a[i][k] * b[k][j];
        }
        c[i][j] += sum;
    }
}

Next consider code snippet 2:

for (int k = 0; k < N; k++) {
    for (int j = 0; j < N; j++) {
        r = b[k][j];
        for (int i = 0; i < N; i++) {
            c[i][j] += a[i][k] * r;
        }
    }
}

Finally consider code snippet 3:

for (int i = 0; i < N; i++) {
    for (int k = 0; k < N; k++) {
        r = a[i][k];
        for (int j = 0; j < N; j++) {
            c[i][j] += r * b[k][j];
        }
    }
}

Benchmarking these three snippets with large enough size N reveals that snippet 3 runs the fastest, followed by snippet 1, and snippet 2 performs the worst. The runtimes are largely determined by the snippets’ cache performances. Snippet 3 performs the best by far due to its stride-1 reference pattern for both b and c. Snippet 1 performs worse than snippet 3 due to its stride-N access on matrix b despite having fewer memory accesses. Snippet 2 performs worst by far due to its stride-N access to both matrices.

Summary

Memory Mapping using mmap()

Instead of performing I/O with read()/write() we can map a region of the file into the address space and treat it like an array of bytes. The kernel ensures that updates to this region of memory will get written to the corresponding region on disk. This mapping can be viewed as an example of using main memory as a cache for disk. This is demonstrated by APUE Figure 14.26:

apue-fig-14.26

Recall that the virtual address space is broken up into pages (typically 4KB) and that the OS kernel maintains a mapping from these virtual pages to physical pages in RAM:

02-proc-mem-layout

File-backed mappings add an additional mapping from physical pages in RAM to regions of disk. Two processes can map the same region of a file into their address space and see each other’s updates to the file. The OS kernel maps the virtual pages of the two processes to the same physical pages, so updates made by one process is immediately reflected in the other process even before the contents get written to disk. In fact, if you just want two processes to share the same memory, you can remove the file from the equation and create an anonymous memory mapping.

The mmap() system call creates new memory mappings:

#include <sys/mman.h>

void *mmap(void *addr, size_t length, int prot, int flag, int fd, off_t offset);
        // Returns: starting address of mapped region if OK, MAP_FAILED on error

Here is a brief explanation of the parameters:

Advanced Allocator Strategies

Having discussed memory hierarchy, cache-conscious programming, and memory mappings, we can now complete our discussion of allocators.

Explicit Free List

Recall the optimization from last lecture: repurpose the unused payload in free blocks for two pointers to link together free blocks explicitly:

explicit

With the explicit free list, searching for a free block only requires iterating through free blocks in the list, as opposed to all blocks, which was the case was with the implicit free list.

The explicit free list needs to be updated whenever a block is freed. Where should the newly freed block be placed?

An obvious strategy is to maintain the free list in LIFO order: simply insert the newly freed block at the beginning of the list. The following diagram illustrates this:

explicit-lifo

This approach is simple and fast, and may benefit from temporal locality. Reusing a recently freed block means that its memory location may still be in the cache and we don’t have to use a new cache line, possibly evicting other data from the cache.

Segregated Free Lists

We can further optimize free list search time and memory usage by maintaining multiple free lists. Each free list contains blocks of the same or similar size.

For example, we can partition the block sizes by powers of 2:

Real Life Allocators

GNU libc Implementation

References:

Industry Implementations

TCmalloc, from Google, and jemalloc, from FreeBSD and currently maintained by Facebook, are some alternatives to GNU libc’s implementation. These allocators are optimized for highly concurrent workloads. We’ll revisit some of the techniques used by these allocators when we cover threads later in the course.

Memory Pool Allocator

A program may rely heavily on frequent allocations and deallocation of the same kind of object (e.g., graph algorithm managing fixed-size node objects). Instead of using a general purpose allocator, such programs can use a specialized memory pool allocator for that object.

The pool allocator will allocate a large region of memory (e.g., some multiple of page size), and divide it into equal sized objects. Allocation and deallocation are as simple as marking an object used and freed, respectively. The fixed-size chunks eliminates external fragementation seen in general-purpose allocators due to their support for variable-sized chunks.


Last Updated: 2024-01-29