COMS 4995 Advanced Systems Programming

Threads 2

Thread Safety

A thread safe function can be called by multiple threads at the same time without stepping on each other’s toes. A function is thread safe either because it is reentrant or the function was made thread safe by using necessary synchronization primitives.

Thread safety via reentrancy

A function is reentrant if it can be safely called again before completing a previous execution. A function can be reentered from the same thread when interrupted by a signal and calling the function from the signal handler. A function can also be reentered by multiple threads that are running at the same time.

An example of a nonreentrant function is one that maintains internal static data, such as strtok() – see man strtok for more details. The C stdlib offers reentrant versions of many such functions so that they can be used in multi-threaded environments. strtok_r() is the reentrant version of strtok() because it doesn’t use an internal static data to track state and requires the user to pass in the state as a function argument instead.

All reentrant functions are, by definition, thread safe.

Thread safety via synchronization

We previously discussed the C stdio functions and their use of internal static buffers for reading and writing. As such, these functions are not reentrant. However, these functions are still thread safe. This is achieved by assigning a lock to each FILE object and having each library function acquire the lock.

It is also possible for user programs to explicitly acquire the FILE object lock using flockfile() to perform a series of I/O operations atomically. Note that this means that the FILE object lock is recursive, which allows the same calling thread to acquire the lock again without deadlocking. See man flockfile for more details.

Signal safety vs. Thread safety

Recall that async-signal-safety refers to whether or not a function can be safely called from a signal handler. Reentrant functions are, by definition, thread safe and async-signal safe.

However, async-signal safety and thread safety are not the same. As previously discussed, thread safety can be ensured by adding locking to a function. If a signal handler interrupts a critical section and attempts to reacquire a non-recursive lock, the program deadlocks. Therefore, a thread safe function is not necessarily async-signal safe. Note that the C stdio functions use a recursive lock, so this concern of deadlocking doesn’t apply. These functions are still not async-signal safe due to their use of static buffers.

One way to make a function async-signal safe is to simply block all signals whose handlers may invoke that function. These functions may not necessarily be thread safe.

Thread Signal Handling

We learned previously how signals work in a single-threaded environment. man 7 signal states the following about how signals work in a multi-threaded environment:

A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of threads has the signal unblocked, then the kernel chooses an arbitrary thread to which deliver the signal.

To control signal delivery, we usually designate a single thread to handle the signals of interest and block the signals in all other threads. We previously learned that sigprocmask() can be used to set up a single-threaded process’s signal mask. This function’s behavior is undefined in multi-threaded environments. Instead, we must use pthread_sigmask() to configure the signal masks of threads.

The following sample program demonstrates how to set up a signal-handling thread:

The main thread first blocks the signals of interest, installs handlers for them, and then creates the signal handling thread.
Signal dispositions apply to all threads in a process
Signal masks are per-thread, but are inherited by the new thread from the parent thread (i.e., the thread that called pthread_create())
The signal handling thread unblocks the signals of interest and handles them
The main thread keeps the signals blocked and will never receive them

int sig = 0;

void handler(int signo) { sig = signo; }

#define handle_error_en(en, msg) \
    do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

static void *sig_thread(void *arg) {
    int s;
    sigset_t *set = arg;

    /* Unblock SIGUSR1 and SIGQUIT for this thread */
    s = pthread_sigmask(SIG_UNBLOCK, set, NULL);
    if (s != 0)
        handle_error_en(s, "pthread_sigmask");

    /* Race condition: you can miss a signal here. Alternatively, use
     * sigsuspend() for atomic signal unblock and pause() */

    for (;;) {
        pause();
        printf("Signal handling thread got signal %d\n", sig);
    }

    return NULL;
}

int main(int argc, char *argv[]) {
    pthread_t thread;
    sigset_t set;
    int s;

    /* Block SIGUSR1 and SIGQUIT */
    sigemptyset(&set);
    sigaddset(&set, SIGQUIT);
    sigaddset(&set, SIGUSR1);
    s = pthread_sigmask(SIG_BLOCK, &set, NULL);
    if (s != 0)
        handle_error_en(s, "pthread_sigmask");

    /* Install handlers for SIGQUIT and SIGUSR1 */
    struct sigaction act;
    act.sa_handler = handler;
    sigemptyset(&act.sa_mask);
    act.sa_flags = 0;
    sigaction(SIGUSR1, &act, NULL);
    sigaction(SIGQUIT, &act, NULL);

    s = pthread_create(&thread, NULL, &sig_thread, &set);
    if (s != 0)
        handle_error_en(s, "pthread_create");

    /* Main thread carries on to create other threads and/or do
       other work */

    pause();            /* Dummy pause so we can test program */
}

Note that the signal handling thread simply pauses until a signal interrupts it – the thread is effectively waiting until a signal is delivered. In other words, we are handling signals synchronously by using a mechanism meant for asynchronous delivery. The POSIX API offers a cleaner alternative that enables synchronous signal handling without installing a handler:

int sigwait(const sigset_t *set, int *sig);

man sigwait states the following:

The sigwait() function suspends execution of the calling thread until one of the signals specified in the signal set set becomes pending. The function accepts the signal (removes it from the pending list of signals), and returns the signal number in sig.

The following program shows how to set up a signal handling thread that uses sigwait() instead of installing signal handlers. Note that the signals of interest remain blocked in both the main and signal handling threads, and the default dispositions of the signals never get invoked.

#define handle_error_en(en, msg) \
    do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)

static void *sig_thread(void *arg) {
    sigset_t *set = arg;
    int s, sig;

    for (;;) {
        /* sigwait() consumes a pending signal */
        s = sigwait(set, &sig);
        if (s != 0)
            handle_error_en(s, "sigwait");
        printf("Signal handling thread got signal %d\n", sig);
    }
    return NULL;
}

int main(int argc, char *argv[]) {
    pthread_t thread;
    sigset_t set;
    int s;

    /* Block SIGUSR1 and SIGQUIT; other threads created by main()
       will inherit a copy of the signal mask. */
    sigemptyset(&set);
    sigaddset(&set, SIGUSR1);
    sigaddset(&set, SIGQUIT);
    s = pthread_sigmask(SIG_BLOCK, &set, NULL);
    if (s != 0)
        handle_error_en(s, "pthread_sigmask");

    s = pthread_create(&thread, NULL, &sig_thread, &set);
    if (s != 0)
        handle_error_en(s, "pthread_create");


    /* Main thread carries on to create other threads and/or do
       other work */

    pause();            /* Dummy pause so we can test program */
}

Thread Local Storage

Recall that threads share global and static variables. However, it sometimes desirable for threads to have their own independent global/static variables. This is known as thread local storage (TLS). The following program demonstrates the difference between a regular global variable and a thread-local global variable:

// Global variable shared by all threads
uint64_t x = 0;

// Thread local global variable
__thread uint64_t y = 0;

void *inc_and_print(void *unused) {
    for (int i = 0; i < 1e7; i++) {
        x++;
        y++;
    }

    printf("x=%lu, y=%lu\n", x, y);
    return NULL;
}

int main(int argc, char **argv) {
    pthread_t threads[2];
    pthread_create(&threads[0], NULL, &inc_and_print, NULL);
    pthread_create(&threads[1], NULL, &inc_and_print, NULL);

    pthread_join(threads[0], NULL);
    pthread_join(threads[1], NULL);

    return 0;
}

The two threads will race to increment the value of the regular global variable x, resulting in an unpredictable value. On the other hand, the two threads get their own copies of the thread-local global variable y so there is no race.

The GNU C compiler offers the __thread modifier to declare a thread-local global/static variable. Adding the modifier is all the programmer needs to write; the compiler/linker will take care of the rest. Unfortunately, this feature is non-standard. The POSIX standard offers an alternative known as “thread-specific data”, but the API is more complicated to use.

TLS is used by libraries to support multi-threaded applications. One such example of this is errno, which is a thread-local global variable. This allows multiple threads to use syscalls and errno without interfering with each other.

Thread vs. Process Creation

Consider the following program that simply creates a noop thread and joins on it:

void *noop(void *arg) {
    return NULL;
}

int main(int argc, char **argv) {
    pthread_t thread;
    pthread_create(&thread, NULL, &noop, NULL);
    pthread_join(thread, NULL);
    return 0;
}

We can use the strace program to inspect the system calls invoked by this program for thread creation. Shown below is a section of the output that is of interest to us:

$ strace ./pthread
...
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7efe0d9ff000
mprotect(0x7efe0da00000, 8388608, PROT_READ|PROT_WRITE) = 0
getrandom("\x3a\x4e\xe0\xd8\x0b\xb1\x20\x58", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x55820065f000
brk(0x558200680000)                     = 0x558200680000
rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7efe0e1ff910, parent_tid=0x7efe0e1ff910, exit_signal=0, stack=0x7efe0d9ff000, stack_size=0x7fff00, tls=0x7efe0e1ff640} => {parent_tid=[0]}, 88) = 3141615
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
...

The calls to mmap() and mprotect() set up the new thread’s stack:

mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7efe0d9ff000
mprotect(0x7efe0da00000, 8388608, PROT_READ|PROT_WRITE) = 0

The stack is created as a private anonymous mapping. The MAP_STACK flag is also specified to indicate that the mapping is for a stack. This doesn’t have any effect on Linux, but other operating systems may make use of this flag.
The mapping is created with PROT_NONE with a size of 8MB + 4KB. A subsequent call to mprotect() sets 8MB of the region to have read/write permissions but leaves 4KB of the region as a “guard page”. Reads and writes to this guard page will result in a segfault. The guard page is used to protect against stack overflows.

clone3() is the main underlying system call for pthread_create(). Guarding its invocation are two calls to rt_sigprocmask() to block signals:

rt_sigprocmask(SIG_BLOCK, ~[], [], 8)   = 0
clone3(...) = ...
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0

To prevent race conditions while setting up the new thread, the main thread temporarily blocks all signals until after clone3() returns.

Let’s now look at the clone3() invocation in detail. The clone() family of functions in Linux are used to create a new task (i.e., thread or child process):

clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7efe0e1ff910, parent_tid=0x7efe0e1ff910, exit_signal=0, stack=0x7efe0d9ff000, stack_size=0x7fff00, tls=0x7efe0e1ff640} => {parent_tid=[0]}, 88) = 3141615

The flags and arguments passed to clone3() determine the semantics of the newly task. Here are some flags and arguments of interest:

CLONE_VM: the new task shares the same address space
CLONE_FILES: the new task shares the same file descriptor table
CLONE_SIGHAND: the new task shares the same signal dispositions
CLONE_THREAD: the new task is placed in the same thread group; it will have the same pid but get a unique thread ID (tid)
CLONE_SETTLS: sets the new task’s TLS descriptor to the tls argument
stack: the address of the new task’s stack. This is the address that mmap() returned earlier

Let’s now consider process creation. The following program forks a child process that immediately exists. The parent process calls waitpid() to wait for the child process before exiting:

int main(int argc, char **argv) {
    pid_t pid;
    if ((pid = fork()) > 0) {
        waitpid(pid, NULL, 0);
    }
}

Let’s compare fork()’s system call invocations to that of pthread_create()’s. Shown below is a section of the strace output that is of interest to us:

clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2c04fc1a10) = 3157115
wait4(3157115, NULL, 0, NULL)           = 3157115
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3157115, si_uid=1008, si_status=0, si_utime=0, si_stime=0} ---

child_stack=NULL: the child process uses the same stack area as the parent, but in its own address space
Note that the resource-sharing flags from pthread_create()’s invocation aren’t specified here.
SIGCHLD added to the flags argument indicates that a SIGCHLD should be sent to the parent process when this child process terminates.
The strace output shows that the parent process receives a SIGCHLD when the process terminates.

Consider the following sample program that allocates two integers on the heap using our malloctopus allocator (implicit free list version) and increments them serially:

void inc(uint64_t *val) {
    for (int i = 0; i < 1e7; i++)
        ++*val;
    printf("%lu\n", *val);
}

int main(int argc, char **argv)
{
    mm_init();

    void *unused = mm_malloc(32);
    uint64_t *p = mm_malloc(8);
    uint64_t *q = mm_malloc(8);

    mm_checkheap(1);

    inc(p);
    inc(q);

    mm_free(p);
    mm_free(q);
    mm_free(unused);
    mm_deinit();
}

In our testing, it took about 57ms for this program to run. We can try to speed up the program by making it multi-threaded as follows:

void inc(uint64_t *val) {
    for (int i = 0; i < 1e7; i++)
        ++*val;
    printf("%lu\n", *val);
}

void* inc_thread(void *v) {
    inc(v);
    return NULL;
}

int main(int argc, char **argv)
{
    mm_init();

    void *unused = mm_malloc(32);
    uint64_t *p = mm_malloc(8);
    uint64_t *q = mm_malloc(8);

    mm_checkheap(1);

    pthread_t t1, t2;
    pthread_create(&t1, NULL, &inc_thread, p);
    pthread_create(&t2, NULL, &inc_thread, q);

    pthread_join(t1, NULL);
    pthread_join(t2, NULL);

    mm_free(p);
    mm_free(q);
    mm_free(unused);
    mm_deinit();
}

With two threads incrementing the integers in parallel, you may expect the program to run about twice as fast. Yet, our testing shows that it was even slower – it took about 63ms to run!

Let’s inspect the program output to see what’s happening:

$ ./mm-test
Heap (0x7f1721600010):
0x7f1721600010: header: [16:a] footer: [16:a]
0x7f1721600020: header: [48:a] footer: [48:a]
0x7f1721600050: header: [32:a] footer: [32:a]
0x7f1721600070: header: [32:a] footer: [32:a]
0x7f1721600090: header: [3984:f] footer: [3984:f]
0x7f1721601020: EOL
10000000
10000000

The heap contains five blocks: the prologue, the unused block, the two blocks pointed by p and q, and a 3984-byte free block. Note that the minimum block size is 32 bytes – 16 bytes for header and footer and a minimum payload of 16 bytes.

Recall that the cache line size on x86-64 CPUs is 64 bytes (or 0x40 in hex). We can see that the blocks pointed to by p and q reside on the same cache line.

Generally speaking, when two threads running on two CPUs keep updating the same memory location, they keep invalidating each other’s cache for that memory location, severely degrading the performance. While the two threads in this example are not updating the same memory location, they are accessing the same cache line. If any byte in a cache line is modified, the entire cache line is invalidated. So if the two separate memory locations are close enough to be in a single cache line, the program would behave as if the two threads are accessing a single memory location. This phenomenon is known as false sharing.

The unused block in the sample program was placed to force the next two blocks into the same cache line. If we remove the unused block, the two blocks would have fallen on separate cache lines. We can also ensure that allocated blocks do not share cache lines by sufficiently padding allocation requests such that the payloads are on separate cache lines.

We can change the allocations for p and q such that they are at least a cache line apart as follows:

uint64_t *p = mm_malloc(64);
uint64_t *q = mm_malloc(64);

In our testing, making this change brings down the program’s running time to 33ms!

Thread considerations in memory allocators

Memory allocators must be thread safe so that they can be used in multi-threaded environments. One can imagine the allocator implementation synchronizing all malloc() and free() operations using an internal lock. This naive approach will not scale well in a heavily multi-threaded environment.

Modern memory allocators are optimized to reduce contention in multi-threaded environments. For example, the GNU C implementation does the following:

Instead of maintaining one centralized heap structure, the allocator maintains multiple disjoint “arenas” from which to allocate memory from. This reduces lock contention because two threads allocating from different arenas do not compete for the same lock.
To avoid contending for a lock all together, each thread maintains a per-thread cache, called the tcache, containing a blocks which that can be accessed without locking an arena. This is done with the help of TLS.

References:

https://sourceware.org/glibc/wiki/MallocInternals

Last updated: 2024-02-26