We will be using GitHub for distributing and collecting your assignments. At this point you should already have a repository created on your behalf in the cs4157-hw GitHub org. Follow the instructions on the class listserv to get access to that repository.
To obtain the skeleton files that we have provided for you, you need to clone your private repository. Your repository page should have a button titled “< > Code”. Click on it, and under the “Clone” section, select “SSH” and copy the link from there. For example:
$ git clone git@github.com:cs4157-hw/hw6-<id>-<your-team-name>.git
The TAs will use scripts to download, extract, build, and display your code. It is essential that you DO NOT change the names of the skeleton files provided to you. If you deviate from the requirements, the grading scripts will fail and you will not receive credit for your work.
You need to have at least 5 git commits total, but we encourage you to have many
more. Your final submission should be pushed to the main
branch.
As always, your submission should not contain any binary files. Your program
must compile with no warnings, and should not produce any memory leaks or errors
when run under valgrind
. This requirement applies to all parts of the
assignment.
At a minimum, README.txt
should contain the following info:
The description should indicate whether your solution for the part is working or not. You may also want to include anything else you would like to communicate to the grader, such as extra functionality you implemented or how you tried to fix your non-working code.
Answers to written questions, if applicable, must be added to the skeleton file we have provided.
In this assignment, you will create a container manager called “zookeeper” from scratch, using only the facilities provided by the Linux kernel. Zookeeper will run a given program in an isolated, virtualized container environment, similar to the containers created by existing tools like Docker or Podman. You will learn how to:
Zookeeper needs to be given a program to run in the container and command line
options to configure the container. The provided skeleton code provides the
following command line options using the getopt(3)
library function:
$ ./zookeeper -h
Usage: zookeeper [options] <command>
Options:
-h Print this help message and exit
-r <dir> Set <dir> as / in the container
-m <mem> Limit the amount of memory available to the container
-c <cpu> Limit the amount of CPU percentage
-p <procs> Limit the number of processes the container can create
-P Mount /proc inside the container
-n Enable networking in the container
Review the skeleton code and the usage of getopt()
.
Use the clone()
syscall to start <program>
as a child process of zookeeper.
Things to keep in mind:
mmap()
. Be sure to
specify the MAP_STACK
flag, see man mmap
for more details.
clone()
should point at the end of the
memory buffer (i.e., the top of the stack).SIGCHLD
to clone()
’s flags
argument; see “The child termination
signal” section in man 2 clone
for more details.execvp(3)
to execute <program>
.As is, the child process will start running after clone()
. Later on in this
assignment, the parent process will need to set up a few things before the child
process can start running. Let’s synchronize the parent and child processes
using an unnamed pipe.
The parent process can wrap the arguments for the child process in a simple struct like this:
struct child_args {
int channel; // Read-end of the pipe
char **args;
};
The child process should immediately block on reading from the pipe. The parent process will write a character to the pipe once its finished setting up, unblocking the child process.
Note: If you test your program now, the zookeeper process will terminate before the child process. We will fix that in the following section.
Now that we know how to create a child process using clone()
, we need to make
the zookeeper parent process wait for the child to terminate. There are two
reasons to do this:
Modify the parent process to use waitpid(2)
to block until the child process
has terminated. If the child terminated normally, obtain its return code and use
it as the return code for the parent zookeeper process. If the child is
terminated due to a signal, the parent process should send itself the same
signal so that it terminates in the same way. See man 2 waitpid
for more
details.
Start by copying your part1/
into part2/
.
In this step, you will ask clone()
to create a dedicated (virtualized) copy of
various systems resources for the container process. Thus, the process will have
its own user/group, process, network, and mount namespace. This will isolate the
process from the rest of the OS.
Pass additional flags to clone()
to virtualize the following resources of the container process:
Note well: Creating new instances of various global OS resources requires
privileged access (CAP_SYS_ADMIN
). Existing container managers (Docker,
Podman) often run under root to get around this. Unfortunately, you don’t have
the luxury in this homework since you do not have root access on SPOC. For this
reason, we always have to create a new user/group namespace using the
CLONE_NEWUSER
flag. When this flag is combined with any other CLONE_NEW*
flags, clone()
first creates a new user/group namespace in which the container
process will be given all privileges, including CAP_SYS_ADMIN
. clone()
then
creates all other namespaces, setting the new user namespace as the owner. That
is how a rootless container manager such as zookeeper can get around the
CAP_SYS_ADMIN
requirement.
If you run /bin/bash
via zookeeper now, you should see that it runs under the
user “nobody” in the container. We’ll fix this in the next section.
Currently, our container process runs under the user “nobody”, which is not what we want. We would like the process to run under “root”. To accomplish that, we need to install subuid and subgid maps to map user/group IDs inside of the container to real user/group IDs in the global namespace.
The first mapping we’ll make is from the root user/group in the container to
your primary user/group ID in the global namespace. Use geteuid()
and
getegid()
to retrieve your user/group IDs.
Processes within the container may need to run as different users/groups than
root. Check out /etc/passwd
on SPOC – some system services run as their own
dedicated account. We’ll have to allow for more than one user/group inside of
our container. Luckily, Linux allocates a per-user range of UIDs/GIDs for use,
see man subuid
and man subgid
for more details. We’ll use these ranges to
create mappings for more users/groups in our container to the global namespace.
Retrieve your user subuid and subgid ranges from /etc/subuid
and
/etc/subgid
. These ranges are keyed by username, so you’ll have to use
getpwuid()
to translate your UID to username. Then use the newuidmap
and
newgidmap
utilities to install the two mappings described above. You’ll have
to fork()
and execvp()
these utilities from the zookeeper process after
clone()
but before the container starts running. Note that we invoke these
utilities instead of writing to /proc/<container-pid>/uid_map
and
/proc/<container-pid>/gid_map
directly because writing the subuid and subgid
ranges to these files requires root privileges. We get around that by invoking
the newuidmap
and newgidmap
utilities, which are setuid-root. See the
utilities’ man pages for more details and the format in which to specify the
mappings.
Restart zookeeper with /bin/bash
. You should now see the shell running under
root.
Before you execvp()
into <program>
, change the hostname to “zoo” using
sethostname()
.
Restart zookeeper with /bin/bash
; you should see “root@zoo” on the command
prompt.
Start by copying your files from part2/
into part3/
.
You now have a container process that runs under (pseudo) root, i.e., with
capabilities such as CAP_SYS_ADMIN
. The process also has access to the entire
Linux kernel syscall API – this is generally fine for syscalls that support
namespaces, but not all syscalls do. In this step, you will drop the container
process’s capabilities and install seccomp rules to restrict the process’s
access to the Linux kernel.
part3/zoo-cap-seccomp.h
provides an array called capabilities_to_drop
enumerating various capabilities the container process should not have. The list
includes capabilities to access the kernel audit framework, suspend, wake up, or
reboot the system, load kernel modules, perform raw I/O, configure resource
limits, etc. See man 7 capabilities
for more details. In this part, we will
drop such capabilities.
You might wonder why this step is necessary since we only run zookeeper under an ordinary (non-root) user. It creates a separate user namespace, restricting the container’s admin privileges to the user namespace only. In this assignment, neither zookeeper nor the container process has elevated privileges outside the user namespace. In general, however, you cannot assume that your container manager will not be run under root. Even through we won’t be able to test our changes, we’ll still drop these capabilities for completeness.
The provided drop_capabilities()
function in zoo-cap-seccomp.h
uses
prctl()
and the libcap
library to drop the capabilities from the
capabilities_to_drop
array. The Linux capability framework is very complex and
hard to understand. To save time, we’ve provided the code for you to simply call
before you execvp()
into <program>
. Review the code and the associated man
pages to get a grasp for what the code is trying to do.
Note: Since the list includes CAP_SYS_ADMIN
, your container process cannot
perform many administrative operations after this step in its user namespace.
Make sure to call anything that requires the admin capability before this step.
At this point, our container process still has full access to all Linux syscalls. After dropping the capabilities in the previous step, it will not be able to invoke all of them. However, there are syscalls that it can still invoke. Unfortunately, not all syscalls support namespaces. In this step, we will install seccomp rules to deny access to such syscalls.
part3/zoo-cap-seccomp.h
provides a data structure called
syscalls_to_prohibit
. It is an array of structures where each structure contains
a human-friendly name, syscall number, the syscall’s arguments to match, and the
number of such matches. The func
, nargs
, and arg
fields are meant to be
passed as arguments to seccomp_rule_add()
.
The provided configure_seccomp()
function from zoo-cap-seccomp.h
iterates
over the array and installs a seccomp rule for each record using libseccomp
library functions. We decided to also give away this function to save you some
time. Study the code and read the associated man pages. Simply call this
function before you before you execvp()
into <program>
.
Pick one of the rules that we installed and come up with an experiment that
triggers it. Compare the result before and after you installed the seccomp
rules. Describe your experiment and results in the README.txt
at the top-level
of your repo.
Start by copying your part3/
into part4/
.
Up to this point, your container uses the filesystem provided by the host, i.e., it has access to all files and directories on the host. In this step, we create a separate, dedicated filesystem for the container and teach zookeeper how to use it.
We have uploaded a tarball with a minimal Linux filesystem for you at
/opt/asp/zoo-fs.tar
. Extract the tarball into your home directory as follows:
$ mkdir ~/zoo-fs
$ cd ~/zoo-fs
$ # Temporarily clear umask so that files/directories retain their intended
$ # permissions.
$ (umask 0 && tar xvf /opt/asp/zoo-fs.tar)
Use mount()
to recursively configure the container’s entire (starting at /)
mount namespace as private. See man 2 mount
for more details and for the
appropriate flags. This step prevents mount and unmount event propagation
between mount namespaces, as explained in this LWN
article.
Right now, you have the container filesystem in a regular directory. For the
container to mount the directory as /, we must first turn the directory into a
mount point. Create a temporary directory using mkdtemp()
using
/tmp/zoo.XXXXXX
as the template path. Bind-mount the filesystem into the
temporary directory using mount()
. See man 2 mount
for more details. Make
sure zookeeper removes this temporary directory at cleanup time before it
terminates.
First, change the container’s current working directory using chdir()
into the
temporary directory where the filesystem is bind-mounted.
Next, mount the proc
pseudo-filesystem in your mount namespace at /path/to/tempdir/proc/
using the following mount()
call:
mount("proc", mount_path, "proc", MS_NOSUID | MS_NODEV | MS_NOEXEC, NULL);
Next, use pivot_root()
to use the filesystem bind-mounted at the temporary
directory in the container namespace as the container’s root directory. Since
glibc does not provide a pivot_root()
wrapper, you’ll have to invoke it using
the syscall()
library function. See man 2 pivot_root
and man 2 syscall
for
more details.
Note: pivot_root()
requires a second temporary directory to move the
original (host) root mount. Consider using the pivot_root(".", ".")
trick
mentioned at the end of man pivot_root
.
Finally, unmount the original root mount using the umount2()
syscall so the
container can no longer access the host’s filesystem. You’ll need to use the
MNT_DETACH
flag here since the original mount is still in use by the host
system.
Start by copying your part4/
into part5/
.
The fully functional container implemented in previous steps has one important
shortcoming: it has full access to all the hardware resources from the host. In
this step, we will implement command line options that would allow zookeeper to
limit the amount of resources the container can use through Linux’s cgroup
mechanism. See man 7 cgroups
for an overview.
Until now, all container processes share the same cgroup with the zookeeper process and are thus subject to the same resource limits as zookeeper. In this step, we will create a new cgroup for the container. We will move the main container process into this cgroup. All children processes created by the container will be put in the same cgroup. Thus, they will all be subject to the same resource limits.
Note: On systemd-based Linux systems (such as Ubuntu), the cgroup hierarchy is
managed by systemd. Unprivileged users have only limited access to it.
Furthermore, that access needs to be coordinated by systemd. The skeleton code
provides a helper function designed to help you get around these limitations.
The function systemd_move_to_scope()
, which is called from zookeeper’s
main()
function, asks systemd to carve out a cgroup subtree for zookeeper to
manage. For the scope name, use "zookeeper-<pid>.scope"
, where <pid>
is
zookeeper’s pid.
The following function, get_container_cgroup_path()
, returns the
pathname (directory) of the cgroup for your container. You must stick to this
pathname. This is the only place where an unprivileged zookeeper can create the
cgroup. Note that CONTAINER_SCOPE
is a macro that the skeleton code already
defines.
/*
* Return the full pathname to the container process' control group under
* /sys/fs/cgroup. On systemd-powered systems, the returned pathname must much
* the cgroup hierarchy set up for us by systemd since a non-root user does not
* have write access to any other cgroups.
*/
static const char *get_container_cgroup_path(uid_t uid, pid_t pid)
{
static char cgroup[PATH_MAX];
int rc = snprintf(cgroup, sizeof(cgroup), CONTAINER_SCOPE, uid, uid, pid);
if (rc < 0 || rc >= (int)sizeof(cgroup)) {
warn("snprintf(CONTAINER_SCOPE)");
return NULL;
}
return cgroup;
}
First, call get_container_cgroup_path()
with your account’s uid and the pid of
the main container process (the pid returned by clone()
). Create the directory
using mkdir()
. Give the directory user read, write, and execute permissions.
Next, move your main container process to the new cgroup by writing its pid (as
a string) into the file cgroup.procs
in the cgroup directory.
You can use systemd-cgls
to see if these steps worked. You should see the zookeeper process and the container process (e.g, /bin/bash
) in separate cgroups under “user.slice”. The container cgroup is called zoo-<pid>
, where <pid>
is the pid of the main container process and <uid>
is your user ID.
$ systemd-cgls --unit "user@<uid>.service"
Unit user@<uid>.service (/user.slice/user-<uid>.slice/user@<uid>.service):
├─user.slice (#4053613)
│ ├─zookeeper-1597640.scope (#4064832)
│ │ └─1597640 ./zookeeper -r /home/janakj/root/ /bin/bash
│ ├─zoo-1597641 (#4064876)
│ │ └─1597641 /bin/bash
└─init.scope (#4053146)
...
To enforce cgroup resource limits, you’ll write the specified limit as a string
in the corresponding file in the cgroup directory from the zookeeper process
after calling clone()
but before the container process runs:
cpu.max
"<val> 100000"
, where
val/100000
specifies the maximum CPU bandwidth as a percentage. For
example, if a zookeeper user specifies -c 30
, you should write 30000
100000
to cpu.max
, which restricts the container process to 30% CPU.memory.max
pids.max
The cgroup created in the previous steps is persistent. It will linger on after all container processes have exited. Thus, we need to manually delete it when it is no longer needed, i.e., once the main container process exited. This is the reason why zookeeper uses waitpid(2)
to wait for the container process to terminate.
Once the main container process has terminated, delete the cgroup directory with
rmdir()
. You can retrive the path to the cgroup directory using get_container_cgroup_path()
.
Devise an experiment that attempts to surpass a limit for each of the cgroup
controllers specified above. Describe your experiments and results in the
README.txt
at the top-level of your repo.
In this part, we will configure networking in the container. When the container manager is run under root (e.g., Docker), it typically creates a virtual Ethernet pair between the host and the container. The manager then configures NAT (IP masquerading) in the host to allow the container to communicate over the Internet. Unfortunately, these steps require administrative privileges in the host. In other words, this approach is unavailable to an ordinary (rootless) container manager.
A workaround (as implemented in rootless Podman) is to use a helper program
called slirp4netns
. This program implements a userspace networking stack and
makes the stack available to the container via a TAP network interface.
slirp4netns
operates like a VPN client, except there is no VPN. The
slirp4netns
userspace stack receives packets from the container and forwards
the packets as an ordinary TCP/UDP/IP client on the host. This approach is not
very performant, but without root on the host, this is the best we can do.
To enable networking in the container, we need to start the slirp4netns
helper in the zookeeper process, giving it the pid of the container process. Create a new child process using fork()
and then execute slirp4netns
with execlp()
as follows:
execlp("slirp4netns", "slirp4netns",
"--disable-host-loopback",
"--mtu=65521",
"--enable-seccomp",
"--enable-ipv6",
"-c", container_pid_str, "eth0",
(char*)NULL);
The helper outputs debugging information to stdout and stderr, so consider
closing those file descriptors before execlp()
if you don’t want to see it.
If you start the helper correctly, you should see the slirp4netns
process in the same cgroup with the zookeeper process with systemd-cgls:
│ ├─user@1000.service … (#4120130)
│ │ ├─user.slice (#4120428)
│ │ │ ├─zookeeper-1803218.scope (#4128689)
│ │ │ │ ├─1803218 ./zookeeper -r /home/janakj/fs -n /bin/bash
│ │ │ │ └─1803220 slirp4netns --disable-host-loopback --mtu=65521 --enable-seccomp --enable-ipv6 -c 1803219 eth0
│ │ │ └─zoo-1803219 (#4128733)
│ │ │ └─1803219 /bin/bash
When the container terminates, don’t forget to properly shut down the
slirp4netns
process with kill()
followed by waitpid()
.
Note on DNS: To resolve host names in the container, ensure you have a usable DNS server in /etc/resolv.conf
in the container’s filesystem. You can use the Google DNS server:
echo "nameserver 8.8.8.8" > /etc/resolv.conf
The filesystem provided in the tarball has this correctly configured already. If
you ever create your own filesystem, you may need to edit /etc/resolv.conf
manually. In Docker and Podman, this file is usually bind-mounted from the host.
You should now be able to install new packages from within your container, for example:
# apt update; apt install cowsay
You’ll probably see some errors/warnings because our container environment still has some kinks to work out – just ignore them, installation will still succeed.
Last updated: 2024-04-16