zookeeper

HW6: zookeeper [ 🐙🦎🐄🐞🦭 ]

Submission

We will be using GitHub for distributing and collecting your assignments. At this point you should already have a repository created on your behalf in the cs4157-hw GitHub org. Follow the instructions on the class listserv to get access to that repository.

To obtain the skeleton files that we have provided for you, you need to clone your private repository. Your repository page should have a button titled “< > Code”. Click on it, and under the “Clone” section, select “SSH” and copy the link from there. For example:

$ git clone git@github.com:cs4157-hw/hw6-<id>-<your-team-name>.git

The TAs will use scripts to download, extract, build, and display your code. It is essential that you DO NOT change the names of the skeleton files provided to you. If you deviate from the requirements, the grading scripts will fail and you will not receive credit for your work.

You need to have at least 5 git commits total, but we encourage you to have many more. Your final submission should be pushed to the main branch.

As always, your submission should not contain any binary files. Your program must compile with no warnings, and should not produce any memory leaks or errors when run under valgrind. This requirement applies to all parts of the assignment.

At a minimum, README.txt should contain the following info:

Name & UNI for all group members
Homework assignment number
Description for each part

The description should indicate whether your solution for the part is working or not. You may also want to include anything else you would like to communicate to the grader, such as extra functionality you implemented or how you tried to fix your non-working code.

Answers to written questions, if applicable, must be added to the skeleton file we have provided.

Overview

In this assignment, you will create a container manager called “zookeeper” from scratch, using only the facilities provided by the Linux kernel. Zookeeper will run a given program in an isolated, virtualized container environment, similar to the containers created by existing tools like Docker or Podman. You will learn how to:

isolate OS resources using Linux namespaces
restrict processes using Linux control groups (cgroups)
limit access to kernel facilities using Linux capabilities and the seccomp framework
work with filesystems and mounts
configure container networking

Part 1: Create Main Container Process

1.1 Command Line Processing

Zookeeper needs to be given a program to run in the container and command line options to configure the container. The provided skeleton code provides the following command line options using the getopt(3) library function:

$ ./zookeeper -h

Usage: zookeeper [options] <command>

Options:
    -h         Print this help message and exit
    -r <dir>   Set <dir> as / in the container
    -m <mem>   Limit the amount of memory available to the container
    -c <cpu>   Limit the amount of CPU percentage
    -p <procs> Limit the number of processes the container can create
    -P         Mount /proc inside the container
    -n         Enable networking in the container

Review the skeleton code and the usage of getopt().

1.2 Clone Child Process

Use the clone() syscall to start <program> as a child process of zookeeper. Things to keep in mind:

Create an 8MB stack buffer for the child process using mmap(). Be sure to specify the MAP_STACK flag, see man mmap for more details.
- The stack pointer passed to clone() should point at the end of the memory buffer (i.e., the top of the stack).
Pass SIGCHLD to clone()’s flags argument; see “The child termination signal” section in man 2 clone for more details.
Use execvp(3) to execute <program>.

As is, the child process will start running after clone(). Later on in this assignment, the parent process will need to set up a few things before the child process can start running. Let’s synchronize the parent and child processes using an unnamed pipe.

The parent process can wrap the arguments for the child process in a simple struct like this:

struct child_args {
    int channel;  // Read-end of the pipe
    char **args;
};

The child process should immediately block on reading from the pipe. The parent process will write a character to the pipe once its finished setting up, unblocking the child process.

Note: If you test your program now, the zookeeper process will terminate before the child process. We will fix that in the following section.

1.3 Wait for the Child

Now that we know how to create a child process using clone(), we need to make the zookeeper parent process wait for the child to terminate. There are two reasons to do this:

The parent should return the child’s return code.
Later on in this assignment, we will need to perform some cleanup after the child has exited.

Modify the parent process to use waitpid(2) to block until the child process has terminated. If the child terminated normally, obtain its return code and use it as the return code for the parent zookeeper process. If the child is terminated due to a signal, the parent process should send itself the same signal so that it terminates in the same way. See man 2 waitpid for more details.

Part 2: Isolate the Container Process

Start by copying your part1/ into part2/.

2.1 Virtualize Global OS Resources

In this step, you will ask clone() to create a dedicated (virtualized) copy of various systems resources for the container process. Thus, the process will have its own user/group, process, network, and mount namespace. This will isolate the process from the rest of the OS.

Pass additional flags to clone() to virtualize the following resources of the container process:

Users (uid) and groups (gid)
Control group (cgroup) view
Interprocess communication (IPC) resources
- Note that this refers to System V IPC mechanisms, not the POSIX IPC mechanisms we learned in class – see APUE chapter 15 for more details. We won’t use it, but we’ll still virtualize this resource for completeness.
Network resources
Mount points
Process IDs
UTS (hostname, NIS domain) resources

Note well: Creating new instances of various global OS resources requires privileged access (CAP_SYS_ADMIN). Existing container managers (Docker, Podman) often run under root to get around this. Unfortunately, you don’t have the luxury in this homework since you do not have root access on SPOC. For this reason, we always have to create a new user/group namespace using the CLONE_NEWUSER flag. When this flag is combined with any other CLONE_NEW* flags, clone() first creates a new user/group namespace in which the container process will be given all privileges, including CAP_SYS_ADMIN. clone() then creates all other namespaces, setting the new user namespace as the owner. That is how a rootless container manager such as zookeeper can get around the CAP_SYS_ADMIN requirement.

If you run /bin/bash via zookeeper now, you should see that it runs under the user “nobody” in the container. We’ll fix this in the next section.

2.2 Install subuid and subgid Maps

Currently, our container process runs under the user “nobody”, which is not what we want. We would like the process to run under “root”. To accomplish that, we need to install subuid and subgid maps to map user/group IDs inside of the container to real user/group IDs in the global namespace.

The first mapping we’ll make is from the root user/group in the container to your primary user/group ID in the global namespace. Use geteuid() and getegid() to retrieve your user/group IDs.

Processes within the container may need to run as different users/groups than root. Check out /etc/passwd on SPOC – some system services run as their own dedicated account. We’ll have to allow for more than one user/group inside of our container. Luckily, Linux allocates a per-user range of UIDs/GIDs for use, see man subuid and man subgid for more details. We’ll use these ranges to create mappings for more users/groups in our container to the global namespace.

Retrieve your user subuid and subgid ranges from /etc/subuid and /etc/subgid. These ranges are keyed by username, so you’ll have to use getpwuid() to translate your UID to username. Then use the newuidmap and newgidmap utilities to install the two mappings described above. You’ll have to fork() and execvp() these utilities from the zookeeper process after clone() but before the container starts running. Note that we invoke these utilities instead of writing to /proc/<container-pid>/uid_map and /proc/<container-pid>/gid_map directly because writing the subuid and subgid ranges to these files requires root privileges. We get around that by invoking the newuidmap and newgidmap utilities, which are setuid-root. See the utilities’ man pages for more details and the format in which to specify the mappings.

Restart zookeeper with /bin/bash. You should now see the shell running under root.

2.3 Update the Hostname

Before you execvp() into <program>, change the hostname to “zoo” using sethostname().

Restart zookeeper with /bin/bash; you should see “root@zoo” on the command prompt.

Part 3: Limit the Container’s Kernel Access

Start by copying your files from part2/ into part3/.

You now have a container process that runs under (pseudo) root, i.e., with capabilities such as CAP_SYS_ADMIN. The process also has access to the entire Linux kernel syscall API – this is generally fine for syscalls that support namespaces, but not all syscalls do. In this step, you will drop the container process’s capabilities and install seccomp rules to restrict the process’s access to the Linux kernel.

3.1 Drop Dangerous Capabilities

part3/zoo-cap-seccomp.h provides an array called capabilities_to_drop enumerating various capabilities the container process should not have. The list includes capabilities to access the kernel audit framework, suspend, wake up, or reboot the system, load kernel modules, perform raw I/O, configure resource limits, etc. See man 7 capabilities for more details. In this part, we will drop such capabilities.

You might wonder why this step is necessary since we only run zookeeper under an ordinary (non-root) user. It creates a separate user namespace, restricting the container’s admin privileges to the user namespace only. In this assignment, neither zookeeper nor the container process has elevated privileges outside the user namespace. In general, however, you cannot assume that your container manager will not be run under root. Even through we won’t be able to test our changes, we’ll still drop these capabilities for completeness.

The provided drop_capabilities() function in zoo-cap-seccomp.h uses prctl() and the libcap library to drop the capabilities from the capabilities_to_drop array. The Linux capability framework is very complex and hard to understand. To save time, we’ve provided the code for you to simply call before you execvp() into <program>. Review the code and the associated man pages to get a grasp for what the code is trying to do.

Note: Since the list includes CAP_SYS_ADMIN, your container process cannot perform many administrative operations after this step in its user namespace. Make sure to call anything that requires the admin capability before this step.

3.2 Deny Syscall Access with Seccomp

At this point, our container process still has full access to all Linux syscalls. After dropping the capabilities in the previous step, it will not be able to invoke all of them. However, there are syscalls that it can still invoke. Unfortunately, not all syscalls support namespaces. In this step, we will install seccomp rules to deny access to such syscalls.

part3/zoo-cap-seccomp.h provides a data structure called syscalls_to_prohibit. It is an array of structures where each structure contains a human-friendly name, syscall number, the syscall’s arguments to match, and the number of such matches. The func, nargs, and arg fields are meant to be passed as arguments to seccomp_rule_add().

The provided configure_seccomp() function from zoo-cap-seccomp.h iterates over the array and installs a seccomp rule for each record using libseccomp library functions. We decided to also give away this function to save you some time. Study the code and read the associated man pages. Simply call this function before you before you execvp() into <program>.

Task

Pick one of the rules that we installed and come up with an experiment that triggers it. Compare the result before and after you installed the seccomp rules. Describe your experiment and results in the README.txt at the top-level of your repo.

Part 4: Virtualize the Filesystem

Start by copying your part3/ into part4/.

Up to this point, your container uses the filesystem provided by the host, i.e., it has access to all files and directories on the host. In this step, we create a separate, dedicated filesystem for the container and teach zookeeper how to use it.

4.1 Prerequisite: Create Container Filesystem Directory

We have uploaded a tarball with a minimal Linux filesystem for you at /opt/asp/zoo-fs.tar. Extract the tarball into your home directory as follows:

$ mkdir ~/zoo-fs
$ cd ~/zoo-fs
$ # Temporarily clear umask so that files/directories retain their intended
$ # permissions.
$ (umask 0 && tar xvf /opt/asp/zoo-fs.tar)

4.2 Prepare Container’s Mount Namespace

Use mount() to recursively configure the container’s entire (starting at /) mount namespace as private. See man 2 mount for more details and for the appropriate flags. This step prevents mount and unmount event propagation between mount namespaces, as explained in this LWN article.

4.3 Bind-Mount the Filesystem

Right now, you have the container filesystem in a regular directory. For the container to mount the directory as /, we must first turn the directory into a mount point. Create a temporary directory using mkdtemp() using /tmp/zoo.XXXXXX as the template path. Bind-mount the filesystem into the temporary directory using mount(). See man 2 mount for more details. Make sure zookeeper removes this temporary directory at cleanup time before it terminates.

4.4 Switch Container’s Filesystem

First, change the container’s current working directory using chdir() into the temporary directory where the filesystem is bind-mounted.

Next, mount the proc pseudo-filesystem in your mount namespace at /path/to/tempdir/proc/ using the following mount() call:

mount("proc", mount_path, "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, NULL);

Next, use pivot_root() to use the filesystem bind-mounted at the temporary directory in the container namespace as the container’s root directory. Since glibc does not provide a pivot_root() wrapper, you’ll have to invoke it using the syscall() library function. See man 2 pivot_root and man 2 syscall for more details.

Note: pivot_root() requires a second temporary directory to move the original (host) root mount. Consider using the pivot_root(".", ".") trick mentioned at the end of man pivot_root.

Finally, unmount the original root mount using the umount2() syscall so the container can no longer access the host’s filesystem. You’ll need to use the MNT_DETACH flag here since the original mount is still in use by the host system.

Part 5: Restrict the Container

Start by copying your part4/ into part5/.

The fully functional container implemented in previous steps has one important shortcoming: it has full access to all the hardware resources from the host. In this step, we will implement command line options that would allow zookeeper to limit the amount of resources the container can use through Linux’s cgroup mechanism. See man 7 cgroups for an overview.

5.1 Create New Container cgroup

Until now, all container processes share the same cgroup with the zookeeper process and are thus subject to the same resource limits as zookeeper. In this step, we will create a new cgroup for the container. We will move the main container process into this cgroup. All children processes created by the container will be put in the same cgroup. Thus, they will all be subject to the same resource limits.

Note: On systemd-based Linux systems (such as Ubuntu), the cgroup hierarchy is managed by systemd. Unprivileged users have only limited access to it. Furthermore, that access needs to be coordinated by systemd. The skeleton code provides a helper function designed to help you get around these limitations. The function systemd_move_to_scope(), which is called from zookeeper’s main() function, asks systemd to carve out a cgroup subtree for zookeeper to manage. For the scope name, use "zookeeper-<pid>.scope", where <pid> is zookeeper’s pid.

The following function, get_container_cgroup_path(), returns the pathname (directory) of the cgroup for your container. You must stick to this pathname. This is the only place where an unprivileged zookeeper can create the cgroup. Note that CONTAINER_SCOPE is a macro that the skeleton code already defines.

/*
 * Return the full pathname to the container process' control group under
 * /sys/fs/cgroup. On systemd-powered systems, the returned pathname must much
 * the cgroup hierarchy set up for us by systemd since a non-root user does not
 * have write access to any other cgroups.
 */
static const char *get_container_cgroup_path(uid_t uid, pid_t pid)
{
    static char cgroup[PATH_MAX];

    int rc = snprintf(cgroup, sizeof(cgroup), CONTAINER_SCOPE, uid, uid, pid);
    if (rc < 0 || rc >= (int)sizeof(cgroup)) {
        warn("snprintf(CONTAINER_SCOPE)");
        return NULL;
    }

    return cgroup;
}

First, call get_container_cgroup_path() with your account’s uid and the pid of the main container process (the pid returned by clone()). Create the directory using mkdir(). Give the directory user read, write, and execute permissions. Next, move your main container process to the new cgroup by writing its pid (as a string) into the file cgroup.procs in the cgroup directory.

You can use systemd-cgls to see if these steps worked. You should see the zookeeper process and the container process (e.g, /bin/bash) in separate cgroups under “user.slice”. The container cgroup is called zoo-<pid>, where <pid> is the pid of the main container process and <uid> is your user ID.

$ systemd-cgls --unit "user@<uid>.service"
Unit user@<uid>.service (/user.slice/user-<uid>.slice/user@<uid>.service):
├─user.slice (#4053613)
│ ├─zookeeper-1597640.scope (#4064832)
│ │ └─1597640 ./zookeeper -r /home/janakj/root/ /bin/bash
│ ├─zoo-1597641 (#4064876)
│ │ └─1597641 /bin/bash
└─init.scope (#4053146)
...

5.2 Enforce Resource Limits

To enforce cgroup resource limits, you’ll write the specified limit as a string in the corresponding file in the cgroup directory from the zookeeper process after calling clone() but before the container process runs:

CPU limit: cpu.max
- This setting is specified as the string "<val> 100000", where val/100000 specifies the maximum CPU bandwidth as a percentage. For example, if a zookeeper user specifies -c 30, you should write 30000 100000 to cpu.max, which restricts the container process to 30% CPU.
Memory limit: memory.max
Process limit: pids.max

5.3 Clean up

The cgroup created in the previous steps is persistent. It will linger on after all container processes have exited. Thus, we need to manually delete it when it is no longer needed, i.e., once the main container process exited. This is the reason why zookeeper uses waitpid(2) to wait for the container process to terminate.

Once the main container process has terminated, delete the cgroup directory with rmdir(). You can retrive the path to the cgroup directory using get_container_cgroup_path().

5.4 Experiment

Devise an experiment that attempts to surpass a limit for each of the cgroup controllers specified above. Describe your experiments and results in the README.txt at the top-level of your repo.

Part 6: Networking

In this part, we will configure networking in the container. When the container manager is run under root (e.g., Docker), it typically creates a virtual Ethernet pair between the host and the container. The manager then configures NAT (IP masquerading) in the host to allow the container to communicate over the Internet. Unfortunately, these steps require administrative privileges in the host. In other words, this approach is unavailable to an ordinary (rootless) container manager.

A workaround (as implemented in rootless Podman) is to use a helper program called slirp4netns. This program implements a userspace networking stack and makes the stack available to the container via a TAP network interface. slirp4netns operates like a VPN client, except there is no VPN. The slirp4netns userspace stack receives packets from the container and forwards the packets as an ordinary TCP/UDP/IP client on the host. This approach is not very performant, but without root on the host, this is the best we can do.

To enable networking in the container, we need to start the slirp4netns helper in the zookeeper process, giving it the pid of the container process. Create a new child process using fork() and then execute slirp4netns with execlp() as follows:

 execlp("slirp4netns", "slirp4netns",
        "--disable-host-loopback",
        "--mtu=65521",
        "--enable-seccomp",
        "--enable-ipv6",
        "-c", container_pid_str, "eth0",
        (char*)NULL);

The helper outputs debugging information to stdout and stderr, so consider closing those file descriptors before execlp() if you don’t want to see it.

If you start the helper correctly, you should see the slirp4netns process in the same cgroup with the zookeeper process with systemd-cgls:

│   ├─user@1000.service … (#4120130)
│   │ ├─user.slice (#4120428)
│   │ │ ├─zookeeper-1803218.scope (#4128689)
│   │ │ │ ├─1803218 ./zookeeper -r /home/janakj/fs -n /bin/bash
│   │ │ │ └─1803220 slirp4netns --disable-host-loopback --mtu=65521 --enable-seccomp --enable-ipv6 -c 1803219 eth0
│   │ │ └─zoo-1803219 (#4128733)
│   │ │   └─1803219 /bin/bash

When the container terminates, don’t forget to properly shut down the slirp4netns process with kill() followed by waitpid().

Note on DNS: To resolve host names in the container, ensure you have a usable DNS server in /etc/resolv.conf in the container’s filesystem. You can use the Google DNS server:

echo "nameserver 8.8.8.8" > /etc/resolv.conf

The filesystem provided in the tarball has this correctly configured already. If you ever create your own filesystem, you may need to edit /etc/resolv.conf manually. In Docker and Podman, this file is usually bind-mounted from the host.

You should now be able to install new packages from within your container, for example:

# apt update; apt install cowsay

You’ll probably see some errors/warnings because our container environment still has some kinks to work out – just ignore them, installation will still succeed.

Last updated: 2024-04-16