A thread safe function can be called by multiple threads at the same time without stepping on each other’s toes. A function is thread safe either because it is reentrant or the function was made thread safe by using necessary synchronization primitives.
A function is reentrant if it can be safely called again before completing a previous execution. A function can be reentered from the same thread when interrupted by a signal and calling the function from the signal handler. A function can also be reentered by multiple threads that are running at the same time.
An example of a nonreentrant function is one that maintains internal static
data, such as strtok()
– see man strtok
for more details. The C stdlib
offers reentrant versions of many such functions so that they can be used in
multi-threaded environments. strtok_r()
is the reentrant version of strtok()
because it doesn’t use an internal static data to track state and requires the
user to pass in the state as a function argument instead.
All reentrant functions are, by definition, thread safe.
We previously discussed the C stdio functions and their use of internal static
buffers for reading and writing. As such, these functions are not reentrant.
However, these functions are still thread safe. This is achieved by assigning a
lock to each FILE
object and having each library function acquire the lock.
It is also possible for user programs to explicitly acquire the FILE
object
lock using flockfile()
to perform a series of I/O operations atomically. Note
that this means that the FILE
object lock is recursive, which allows the
same calling thread to acquire the lock again without deadlocking. See man
flockfile
for more details.
Recall that async-signal-safety refers to whether or not a function can be safely called from a signal handler. Reentrant functions are, by definition, thread safe and async-signal safe.
However, async-signal safety and thread safety are not the same. As previously discussed, thread safety can be ensured by adding locking to a function. If a signal handler interrupts a critical section and attempts to reacquire a non-recursive lock, the program deadlocks. Therefore, a thread safe function is not necessarily async-signal safe. Note that the C stdio functions use a recursive lock, so this concern of deadlocking doesn’t apply. These functions are still not async-signal safe due to their use of static buffers.
One way to make a function async-signal safe is to simply block all signals whose handlers may invoke that function. These functions may not necessarily be thread safe.
We learned previously how signals work in a single-threaded environment. man 7
signal
states the following about how signals work in a multi-threaded
environment:
A process-directed signal may be delivered to any one of the threads that does not currently have the signal blocked. If more than one of threads has the signal unblocked, then the kernel chooses an arbitrary thread to which deliver the signal.
To control signal delivery, we usually designate a single thread to handle the
signals of interest and block the signals in all other threads. We previously
learned that sigprocmask()
can be used to set up a single-threaded process’s
signal mask. This function’s behavior is undefined in multi-threaded
environments. Instead, we must use pthread_sigmask()
to configure the signal
masks of threads.
The following sample program demonstrates how to set up a signal-handling thread:
pthread_create()
)int sig = 0;
void handler(int signo) { sig = signo; }
#define handle_error_en(en, msg) \
do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
static void *sig_thread(void *arg) {
int s;
sigset_t *set = arg;
/* Unblock SIGUSR1 and SIGQUIT for this thread */
s = pthread_sigmask(SIG_UNBLOCK, set, NULL);
if (s != 0)
handle_error_en(s, "pthread_sigmask");
/* Race condition: you can miss a signal here. Alternatively, use
* sigsuspend() to atomically adjust the signal mask and then suspend
* execution. */
for (;;) {
pause();
printf("Signal handling thread got signal %d\n", sig);
}
return NULL;
}
int main(int argc, char *argv[]) {
pthread_t thread;
sigset_t set;
int s;
/* Block SIGUSR1 and SIGQUIT */
sigemptyset(&set);
sigaddset(&set, SIGQUIT);
sigaddset(&set, SIGUSR1);
s = pthread_sigmask(SIG_BLOCK, &set, NULL);
if (s != 0)
handle_error_en(s, "pthread_sigmask");
/* Install handlers for SIGQUIT and SIGUSR1 */
struct sigaction act;
act.sa_handler = handler;
sigemptyset(&act.sa_mask);
act.sa_flags = 0;
sigaction(SIGUSR1, &act, NULL);
sigaction(SIGQUIT, &act, NULL);
s = pthread_create(&thread, NULL, &sig_thread, &set);
if (s != 0)
handle_error_en(s, "pthread_create");
/* Main thread carries on to create other threads and/or do
other work */
pause(); /* Dummy pause so we can test program */
}
Note that the signal handling thread simply pauses until a signal interrupts it – the thread is effectively waiting until a signal is delivered. In other words, we are handling signals synchronously by using a mechanism meant for asynchronous delivery. The POSIX API offers a cleaner alternative that enables synchronous signal handling without installing a handler:
int sigwait(const sigset_t *set, int *sig);
man sigwait
states the following:
The sigwait() function suspends execution of the calling thread until one of the signals specified in the signal set set becomes pending. The function accepts the signal (removes it from the pending list of signals), and returns the signal number in sig.
The following program shows how to set up a signal handling thread that uses
sigwait()
instead of installing signal handlers. Note that the signals of
interest remain blocked in both the main and signal handling threads, and the
default dispositions of the signals never get invoked.
#define handle_error_en(en, msg) \
do { errno = en; perror(msg); exit(EXIT_FAILURE); } while (0)
static void *sig_thread(void *arg) {
sigset_t *set = arg;
int s, sig;
for (;;) {
/* sigwait() consumes a pending signal */
s = sigwait(set, &sig);
if (s != 0)
handle_error_en(s, "sigwait");
printf("Signal handling thread got signal %d\n", sig);
}
return NULL;
}
int main(int argc, char *argv[]) {
pthread_t thread;
sigset_t set;
int s;
/* Block SIGUSR1 and SIGQUIT; other threads created by main()
will inherit a copy of the signal mask. */
sigemptyset(&set);
sigaddset(&set, SIGUSR1);
sigaddset(&set, SIGQUIT);
s = pthread_sigmask(SIG_BLOCK, &set, NULL);
if (s != 0)
handle_error_en(s, "pthread_sigmask");
s = pthread_create(&thread, NULL, &sig_thread, &set);
if (s != 0)
handle_error_en(s, "pthread_create");
/* Main thread carries on to create other threads and/or do
other work */
pause(); /* Dummy pause so we can test program */
}
Recall that threads share global and static variables. However, it sometimes desirable for threads to have their own independent global/static variables. This is known as thread local storage (TLS). The following program demonstrates the difference between a regular global variable and a thread-local global variable:
// Global variable shared by all threads
uint64_t x = 0;
// Thread local global variable
__thread uint64_t y = 0;
void *inc_and_print(void *unused) {
for (int i = 0; i < 1e7; i++) {
x++;
y++;
}
printf("x=%lu, y=%lu\n", x, y);
return NULL;
}
int main(int argc, char **argv) {
pthread_t threads[2];
pthread_create(&threads[0], NULL, &inc_and_print, NULL);
pthread_create(&threads[1], NULL, &inc_and_print, NULL);
pthread_join(threads[0], NULL);
pthread_join(threads[1], NULL);
return 0;
}
The two threads will race to increment the value of the regular global variable x
, resulting in an unpredictable value. On the other hand, the two threads get their own copies of the thread-local global variable y
so there is no race.
The GNU C compiler offers the __thread
modifier to declare a thread-local
global/static variable. Adding the modifier is all the programmer needs to
write; the compiler/linker will take care of the rest. Unfortunately, this feature is non-standard. The POSIX standard offers an alternative known as “thread-specific data”, but the API is more complicated to use.
TLS is used by libraries to support multi-threaded applications. One such
example of this is errno
, which is a thread-local global variable. This allows
multiple threads to use syscalls and errno
without interfering with each
other.
Consider the following program that simply creates a noop thread and joins on it:
void *noop(void *arg) {
return NULL;
}
int main(int argc, char **argv) {
pthread_t thread;
pthread_create(&thread, NULL, &noop, NULL);
pthread_join(thread, NULL);
return 0;
}
We can use the strace
program to inspect the system calls invoked by this program for thread creation. Shown below is a section of the output that is of interest to us:
$ strace ./pthread
...
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7efe0d9ff000
mprotect(0x7efe0da00000, 8388608, PROT_READ|PROT_WRITE) = 0
getrandom("\x3a\x4e\xe0\xd8\x0b\xb1\x20\x58", 8, GRND_NONBLOCK) = 8
brk(NULL) = 0x55820065f000
brk(0x558200680000) = 0x558200680000
rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7efe0e1ff910, parent_tid=0x7efe0e1ff910, exit_signal=0, stack=0x7efe0d9ff000, stack_size=0x7fff00, tls=0x7efe0e1ff640} => {parent_tid=[0]}, 88) = 3141615
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
...
The calls to mmap()
and mprotect()
set up the new thread’s stack:
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7efe0d9ff000
mprotect(0x7efe0da00000, 8388608, PROT_READ|PROT_WRITE) = 0
MAP_STACK
flag is
also specified to indicate that the mapping is for a stack. This doesn’t have
any effect on Linux, but other operating systems may make use of this flag.PROT_NONE
with a size of 8MB + 4KB. A subsequent
call to mprotect()
sets 8MB of the region to have read/write permissions but
leaves 4KB of the region as a “guard page”. Reads and writes to this guard page
will result in a segfault. The guard page is used to protect against stack
overflows.clone3()
is the main underlying system call for pthread_create()
. Guarding
its invocation are two calls to rt_sigprocmask()
to block signals:
rt_sigprocmask(SIG_BLOCK, ~[], [], 8) = 0
clone3(...) = ...
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
clone3()
returns.Let’s now look at the clone3()
invocation in detail. The clone()
family of
functions in Linux are used to create a new task (i.e., thread or child process):
clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, child_tid=0x7efe0e1ff910, parent_tid=0x7efe0e1ff910, exit_signal=0, stack=0x7efe0d9ff000, stack_size=0x7fff00, tls=0x7efe0e1ff640} => {parent_tid=[0]}, 88) = 3141615
The flags and arguments passed to clone3()
determine the semantics of the
newly task. Here are some flags and arguments of interest:
CLONE_VM
: the new task shares the same address spaceCLONE_FILES
: the new task shares the same file descriptor tableCLONE_SIGHAND
: the new task shares the same signal dispositionsCLONE_THREAD
: the new task is placed in the same thread group; it will have
the same pid but get a unique thread ID (tid)CLONE_SETTLS
: sets the new task’s TLS descriptor to the tls
argumentstack
: the address of the new task’s stack. This is the address that
mmap()
returned earlierLet’s now consider process creation. The following program forks a child process
that immediately exists. The parent process calls waitpid()
to wait for the
child process before exiting:
int main(int argc, char **argv) {
pid_t pid;
if ((pid = fork()) > 0) {
waitpid(pid, NULL, 0);
}
}
Let’s compare fork()
’s system call invocations to that of
pthread_create()
’s. Shown below is a section of the strace
output that is of
interest to us:
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f2c04fc1a10) = 3157115
wait4(3157115, NULL, 0, NULL) = 3157115
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=3157115, si_uid=1008, si_status=0, si_utime=0, si_stime=0} ---
child_stack=NULL
: the child process uses the same stack area as the parent,
but in its own address spacepthread_create()
’s invocation
aren’t specified here.SIGCHLD
added to the flags argument indicates that a SIGCHLD
should be
sent to the parent process when this child process terminates.strace
output shows that the parent process receives a SIGCHLD
when
the process terminates.Consider the following sample program that allocates two integers on the heap
using our malloctopus
allocator (implicit free list version) and increments
them serially:
void inc(uint64_t *val) {
for (int i = 0; i < 1e7; i++)
++*val;
printf("%lu\n", *val);
}
int main(int argc, char **argv)
{
mm_init();
void *unused = mm_malloc(32);
uint64_t *p = mm_malloc(8);
uint64_t *q = mm_malloc(8);
mm_checkheap(1);
inc(p);
inc(q);
mm_free(p);
mm_free(q);
mm_free(unused);
mm_deinit();
}
In our testing, it took about 57ms for this program to run. We can try to speed up the program by making it multi-threaded as follows:
void inc(uint64_t *val) {
for (int i = 0; i < 1e7; i++)
++*val;
printf("%lu\n", *val);
}
void* inc_thread(void *v) {
inc(v);
return NULL;
}
int main(int argc, char **argv)
{
mm_init();
void *unused = mm_malloc(32);
uint64_t *p = mm_malloc(8);
uint64_t *q = mm_malloc(8);
mm_checkheap(1);
pthread_t t1, t2;
pthread_create(&t1, NULL, &inc_thread, p);
pthread_create(&t2, NULL, &inc_thread, q);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
mm_free(p);
mm_free(q);
mm_free(unused);
mm_deinit();
}
With two threads incrementing the integers in parallel, you may expect the program to run about twice as fast. Yet, our testing shows that it was even slower – it took about 63ms to run!
Let’s inspect the program output to see what’s happening:
$ ./mm-test
Heap (0x7f1721600010):
0x7f1721600010: header: [16:a] footer: [16:a]
0x7f1721600020: header: [48:a] footer: [48:a]
0x7f1721600050: header: [32:a] footer: [32:a]
0x7f1721600070: header: [32:a] footer: [32:a]
0x7f1721600090: header: [3984:f] footer: [3984:f]
0x7f1721601020: EOL
10000000
10000000
The heap contains five blocks: the prologue, the unused block, the two blocks
pointed by p
and q
, and a 3984-byte free block. Note that the minimum block
size is 32 bytes – 16 bytes for header and footer and a minimum payload of 16
bytes.
Recall that the cache line size on x86-64 CPUs is 64 bytes (or 0x40 in hex). We
can see that the blocks pointed to by p
and q
reside on the same cache line.
Generally speaking, when two threads running on two CPUs keep updating the same memory location, they keep invalidating each other’s cache for that memory location, severely degrading the performance. While the two threads in this example are not updating the same memory location, they are accessing the same cache line. If any byte in a cache line is modified, the entire cache line is invalidated. So if the two separate memory locations are close enough to be in a single cache line, the program would behave as if the two threads are accessing a single memory location. This phenomenon is known as false sharing.
The unused block in the sample program was placed to force the next two blocks into the same cache line. If we remove the unused block, the two blocks would have fallen on separate cache lines. We can also ensure that allocated blocks do not share cache lines by sufficiently padding allocation requests such that the payloads are on separate cache lines.
We can change the allocations for p
and q
such that they are at least a
cache line apart as follows:
uint64_t *p = mm_malloc(64);
uint64_t *q = mm_malloc(64);
In our testing, making this change brings down the program’s running time to 33ms!
Memory allocators must be thread safe so that they can be used in multi-threaded
environments. One can imagine the allocator implementation synchronizing all
malloc()
and free()
operations using an internal lock. This naive approach
will not scale well in a heavily multi-threaded environment.
Modern memory allocators are optimized to reduce contention in multi-threaded environments. For example, the GNU C implementation does the following:
References:
Last updated: 2024-02-26