634 stories

Running system services in containers


At FOSDEM, in the awesome Guile track, I briefly demoed a new experimental GuixSD feature as part my talk on system services: the ability to run system services in containers or “sandboxes”. This post discusses the rationale, status, and implementation of this feature.

The problem

Our computers run many programs that talk to the Internet, and the Internet is an unsafe place as we all know—with states and assorted organizations collecting “zero-day exploits” to exploit them as they see fit. One of the big tasks of operating system distributions has been to keep track of known software vulnerabilities and patch their packages as soon as possible.

When we look closer, many vulnerabilities out there can be exploited because of a combination of two major weaknesses of GNU/Linux and similar Unix-like operating systems: lack of memory-safety in the C language family, and ambient authority in the operating system itself. The former leads to a huge class of bugs that become security issues: buffer overflows, use-after-free, and so on. The latter makes them more exploitable because processes have access to many resources beyond those they really need.

Security-sensitive software is now increasingly written in memory-safe languages, as is the case for Guix and GuixSD. Projects that have been using C are even considering a complete rewrite, as is the case for Tor. Of course the switch away from memory-unsafe languages won’t happen overnight, but it’s good to see a consensus emerging.

The operating system side of things is less bright. Although the principle of least authority (POLA) has been well-known in operating system circles for a long time, it remains foreign to Unix and GNU/Linux. Processes run with the full authority of their user. On top of that, until recent changes to the Linux kernel, resources were global and there was essentially a single view of the file system, of the process hierarchy, and so on. So when a remote-code-execution vulnerability affects a system service—like in the BitlBee instant messaging gateway (CVE-2016-10188) running on my laptop—an attacker could potentially do a lot on your machine.

Fortunately, many daemons have built-in mechanisms to work around this operating system defect. For instance, BitlBee, and Tor can be told to switch to a separate unprivileged user, avahi-daemon and ntpd can do that and also change root. These techniques do reduce the privileges of those processes, but they are still imperfect and ad hoc.

Increasing process isolation with containers

The optimal solution to this problem would be to honor POLA in the first place. As an example, the venerable GNU/Hurd is a capability-based operating system. Thus, GNU/Hurd has supported fine-grained virtualization from the start: a newly-created process can be given a capability to its own proc server (which implements the POSIX notion of processes), to a specific TCP/IP server, etc. In addition, its POSIX personality offers interesting extensions, such as the fact that processes run with the authority of zero or more UIDs. For instance, the Hurd’s login program starts off with zero UIDs and gains a UID when someone has been authenticated.

Back to GNU/Linux, “namespaces” have been introduced as a way to retrofit per-process views of the system resources, and thus improve isolation among processes. Each process can run in a separate namespace and thus have a different view of the file system, process tree, and so on (a process running in separate namespaces is often referred to as a “container”, although that term is sometimes used to denote much larger tooling and practices built around namespaces.) Why not use that to better isolate system services?

Apparently this idea has been floating around. systemd has been considering to extend its “unit files” to include directives instructing systemd to run daemons in separate namespaces. GuixSD uses the Shepherd instead of systemd, but running system services in separate namespaces is something we had been considering for a while.

In fact, adding the ability to run system services in containers was a low-hanging fruit: we already had call-with-container to run code in containers, so all we needed to do was to provide a containerized service starter that uses call-with-container.

The Shepherd itself remains unaware of namespaces, it simply ends up calling make-forkexec-constructor/container instead of make-forkexec-constructor and that’s it. The changes to the service definitions of BitlBee and Tor are minimal. The end result, for Tor, looks like this:

(let ((torrc (tor-configuration->torrc config)))
  (with-imported-modules (source-module-closure
                          '((gnu build shepherd)
                            (gnu system file-systems)))
    (list (shepherd-service
           (provision '(tor))
           (requirement '(user-processes loopback syslogd))

           (modules '((gnu build shepherd)
                      (gnu system file-systems)))

           (start #~(make-forkexec-constructor/container
                     (list #$(file-append tor "/bin/tor") "-f" #$torrc)

                     #:mappings (list (file-system-mapping
                                       (source "/var/lib/tor")
                                       (target source)
                                       (writable? #t))
                                       (source "/dev/log") ;for syslog
                                       (target source)))))
           (stop #~(make-kill-destructor))
           (documentation "Run the Tor anonymous network overlay.")))))

The with-imported-modules form above instructs Guix to import our (gnu build shepherd) library, which provides make-forkexec-constructor/container, into PID 1. The start method of the service specifies the command to start the daemon, as well as file systems to map in its mount name space (“bind mounts”). Here all we need is write access to /var/lib/tor and to /dev/log (for logging via syslogd). In addition to these two mappings, make-forkexec-constructor/container automatically adds /gnu/store and a bunch of files in /etc as we will see below.

Containerized services in action

So what do these containerized services look like when they’re running? When we run herd status bitblee, disappointingly, we don’t see anything special:

charlie@guixsd ~$ sudo herd status bitlbee
Status of bitlbee:
  It is started.
  Running value is 487.
  It is enabled.
  Provides (bitlbee).
  Requires (user-processes networking).
  Conflicts with ().
  Will be respawned.
charlie@guixsd ~$ ps -f 487
bitlbee    487     1  0 Apr11 ?        Ss     0:00 /gnu/store/pm05bfywrj2k699qbxpjjqfyfk3grz2i-bitlbee-3.5.1/sbin/bitlbee -n -F -u bitlbee -c /gnu/store/y4jfxya56i1hl9z0a2h4hdar2wm

Again this is because the Shepherd has no idea what a namespace is, so it just displays the daemon’s PID in the global namespace, 487. The process is running as user bitlbee, as requested by the -u bitlbee command-line option.

We can invoke nsenter and take a look at what the BitlBee process “sees” in its namespace:

charlie@guixsd ~$ sudo nsenter -t 487 -m -p -i -u $(readlink -f $(type -P bash))
root@guixsd /# echo /*
/dev /etc /gnu /proc /tmp /var
root@guixsd /# echo /proc/[0-9]*
/proc/1 /proc/5
root@guixsd /# read line < /proc/1/cmdline
root@guixsd /# echo $line
root@guixsd /# echo /etc/*
/etc/hosts /etc/nsswitch.conf /etc/passwd /etc/resolv.conf /etc/services
root@guixsd /# echo /var/*
/var/lib /var/run
root@guixsd /# echo /var/lib/*
root@guixsd /# echo /var/run/*
/var/run/bitlbee.pid /var/run/nscd

There’s no /home and generally very little in BitlBee’s mount namespace. Notably, the namespace lacks /run/setuid-programs, which is where setuid programs live in GuixSD. Its /etc directory contains the minimal set of files needed for proper operation rather than the complete /etc of the host. /var contains nothing but BitlBee’s own state files, as well as the socket to libc’s name service cache daemon (nscd), which runs in the host system and performs name lookups on behalf of applications.

As can be seen in /proc, there’s only a couple of processes in there and “PID 1” in that namespace is the bitlbee daemon. Finally, the /tmp directory is a private tmpfs:

root@guixsd /# : > /tmp/hello-bitlbee
root@guixsd /# echo /tmp/*
root@guixsd /# exit
charlie@guixsd ~$ ls /tmp/*bitlbee
ls: cannot access '/tmp/*bitlbee': No such file or directory

Our bitlbee process runs in a separate mount, PID, and IPC namespace, but it runs in the global user namespace. The reason for this is that we want the -u bitlbee option (which instructs bitlbee to setuid to an unprivileged user at startup) to work as expected. It also shares the network namespace because obviously it needs to access the network.

A nice side-effect of these fully-specified execution environments for services is that it makes them more likely to behave in a reproducible fashion across machines—just like fully-specified build environments help achieve reproducible builds.


GuixSD master and its upcoming release include this feature and a couple of containerized services, and it works like a charm! Yet, there are still open questions as to the way forward.

First, we only looked at “simple” services so far, with simple static file system mappings. Good candidates for increased isolation are HTTP servers such as NGINX. However, for these, it’s more difficult to determine the set of file system mappings that must be made. GuixSD has the advantage that it knows how NGINX is configured and could potentially derive file system mappings from that information. Getting it right may be trickier than it seems, though, so this is something we’ll have to investigate.

Another open question is how the service isolation work should be split between the distro, the init system, and the upstream service author. Authors of daemons already do part of the work via setuid and sometimes chroot. Going beyond that would often hamper portability (the namespace interface is specific to the kernel Linux) or even functionality if the daemon ends up lacking access to resources it needs.

The init system alone also lacks information to decide what goes into the namespaces of the service. For instance, neither the upstream author nor the init system “knows” whether the distro is running nscd and thus they cannot tell whether the nscd socket should be bind-mounted in the service’s namespace. A similar issue is that of D-Bus policy files discussed in this LWN article. Moving D-Bus functionality into the init system itself to solve this problem, as the article suggests, seems questionable, notably because it would add more code to this critical process. Instead, on GuixSD, a service author can make the right policy files available in the sandbox; in fact, GuixSD already knows which policy files are needed thanks to its service framework so we might even be able to automate it.

At this point it seems that tight integration between the distro and the init system is the best way to precisely define system service sandboxes. GuixSD’s declarative approach to system services along with tight Shepherd integration help a lot here, but it remains to be seen how difficult it is to create sandboxes for complex system services such as NGINX.

About GNU Guix

GNU Guix is a transactional package manager for the GNU system. The Guix System Distribution or GuixSD is an advanced distribution of the GNU system that relies on GNU Guix and respects the user's freedom.

In addition to standard package management features, Guix supports transactional upgrades and roll-backs, unprivileged package management, per-user profiles, and garbage collection. Guix uses low-level mechanisms from the Nix package manager, except that packages are defined as native Guile modules, using extensions to the Scheme language. GuixSD offers a declarative approach to operating system configuration management, and is highly customizable and hackable.

GuixSD can be used on an i686 or x86_64 machine. It is also possible to use Guix on top of an already installed GNU/Linux system, including on mips64el, armv7, and aarch64.

Read the whole story
41 days ago
Cluj-Napoca, România
Share this story

Unity is dead. Long live Ubuntu!

1 Share
Rollercoaster ... of Linux. Again. In this article, I discuss the recent announcement by Canonical to stop the development for phone and convergence, why this happened and what it implies, the technological and strategic directions and challenges, Gnome 3 alternative, fragmentation, uncertain future, and more. Take a look.
Read the whole story
46 days ago
Cluj-Napoca, România
Share this story

The High Cost of On Premises Infrastructure

2 Comments and 3 Shares
IT Infrastructure is a challenge for any company and especially companies that are not large enough to implement their own, full scale datacenters.  Like many things in IT, major challenges come in the form of lacking specific, seldom used expertise as well as lacking the scale to utilize singular resources effectively. This lack of scale … Continue reading The High Cost of On Premises Infrastructure
Read the whole story
68 days ago
Cluj-Napoca, România
Share this story
2 public comments
67 days ago
After over a year at a collocation facility as a jack-of-all-trades systems guy, I can say with certainty that I did not possess before that there is nothing easy, or cheap, when it comes to building even just a passable data center. Every aspect has layers of complexity that make small but important details easy to lose or forget.
Seymour, Indiana
67 days ago
I'd still lean cloud for the kind of small businesses this article is speaking to. For ex MySQL replication and backups are all automated in the cloud, but a pain in the ass to build and maintain on your own servers without knowledgeable staff
Bend, Oregon

Why every business should consider an open source point of sale system

1 Share
Why every business should consider an open source point of sale system

Point of sale (POS) systems have come a long way from the days of simple cash registers that rang up purchases. Today, POS systems can be all-in-one solutions that include payment processing, inventory management, marketing tools, and more. Retailers can receive daily reports on their cash flow and labor costs, often from a mobile device.

read more
Read the whole story
128 days ago
Cluj-Napoca, România
Share this story

Jehanne: A Plan 9 based OS

1 Share
Jehanne is a new distributed operating system designed for programmers. The core values that lead the development are simplicity and security. Jehanne is a fork of Harvey (which in turn is a fork of Plan 9 from Bell Labs merged with Nix's kernel sources) but diverges from the design and conventions of its ancestors whenever they are at odds with its goals. Read about development progress made in 2016.
Read the whole story
135 days ago
Cluj-Napoca, România
Share this story

Jim Hall: The importance of the press kit

1 Share
I'd like to share a few lessons I've learned about creating a press kit. This helped us spread the word about our recent FreeDOS 1.2 release, and it can help your open source software project to get more attention.
I'm part of several open source software projects, but probably the one that I'll be remembered for is FreeDOS. As an open source software implementation of DOS, you might not think that FreeDOS will get much attention in today's tech news. Yet when we released FreeDOS 1.2 a few weeks ago, we got a ton of news coverage.

Slashdot was the first to write about FreeDOS 1.2, but we also saw coverage from Engadget Germany, LWN, Heise Online, PC Forum Hungary, FOSS Bytes, ZDNet Germany, PC Welt, Tom's Hardware, and Open Source Feed. And that's just a sample of the news! There were articles from the US, Germany, Japan, Hungary, Ukraine, Italy, and others.

In reading the articles people had written about FreeDOS 1.2, I realized something that was both cool and insightful: most tech news sites re-used material from our press kit.

You see, in the weeks leading up to FreeDOS 1.2, I assembled additional information and resources about FreeDOS 1.2 release, including a bunch of screenshots and other images of FreeDOS in action. In an article posted to our website, I highlighted the press kit, and added "If you are writing an article about FreeDOS, feel free to use this information to help you." And they did!

We track a complete timeline of interesting events on our FreeDOS History page, including links to articles. Comparing the press coverage from FreeDOS 1.0, FreeDOS 1.1 and FreeDOS 1.2, we definitely saw the most articles about FreeDOS 1.2. And unlike previous releases where only a few tech news websites wrote articles about FreeDOS and other news outlets mostly referenced the first few sites, the coverage of FreeDOS 1.2 was mostly original articles. Only a small handful were references to news items from other news sites.

I put that down to the press kit. With the press kit, journalists were able to quickly pull interesting information and quotes about FreeDOS, and find images they could use in their articles. For a busy journalist who doesn't have much time to write about a free DOS implementation in 2016, our press kit made it easy to create something fresh. And news sites love to write their own stories rather than link to other news sites. That means more eyeballs for them.

Here are a few lessons I learned from creating our press kit:
Include basic information about your open source software project.

What is your project about? What does it do? How is it useful? Who uses it? What are the new features in this release? These are the basic questions any journalist will want to answer in their article, if they choose to write about you. In the FreeDOS press kit, I also included a history about FreeDOS, discussing how we got started in 1994 and some highlights from our timeline.

Write in a casual, conversational tone that's easy to quote.

In writing about your project, pretend you are writing an email to someone you know. Or if you prefer, write like you are posting something to a personal blog. Keep it informal. Avoid jargon. If your language is too stuffy or too technical, journalists will have a hard time quoting from you. In writing the FreeDOS press kit, I started by listing a few common questions that people usually ask me about FreeDOS, then I just responded to them like I was answering an email. My answers were often long, but the paragraphs were short so easier to skim.

Provide lots of screenshots of your project doing different things.

Whether your program runs from the command line or in a graphical environment, screenshots are key. And tech news sites like to use images; they are a cheap way to draw attention. So take lots of screenshots and include them in your press kit. Show all the major features through these screenshots. But be wary of background images and other branding that might distract from your screenshots. In particular, if the screenshot will show your desktop, set your wallpaper to the default for your operating system, or use a solid color in the range medium- to light-blue. For the FreeDOS press kit, I took a ton of screenshots of every step in the install process. I also grabbed screenshots of FreeDOS at the command line, running utilities and tools, and playing some of the games we installed.

Organize your material so it's easy to read.

You may find your press kit will become quite long. That's okay, as long as this doesn't make it difficult for someone to figure out what's there. Put the important stuff first. Use a table of contents, if you have a lot of information to share. Use headings and sections to break things up. If a journalist can't find the information they need to write an article about your project, they may skip it and write about something else. I organized our press kit like a simple website. An index page provided some basic information, with a list of links to other material contained in the press kit. I arranged our screenshots in separate "pages." And every page of screenshots started with a brief context, then listed the screenshots without much fanfare. But every screenshot included a description of what you were seeing. For example, I had over forty screenshots from installing FreeDOS, and I wrote a one-sentence description for each.

Be your own editor.

No matter how much work you put into it, one will want to use your press kit if it is riddled with spelling errors and poor grammar. Consider writing your press kit material in a word processor and running a spell check against it. Read your text aloud and see if it makes sense to you. When you're done, try to look at your press kit from the perspective of someone who hasn't used your project before. Can they easily understand what it's about? To help you in this step, ask a friend to review the material for you.

Advertise, advertise, advertise!

Don't assume that tech news sites will seek you out. You need to reach out to them to let them know you have a new release coming up. Create your press kit well in advance, and about a week or two before your release, individually email every journalist or tech news website that might be interested in you. Most news sites have a "Contact us" link or list of editor "beats" where you can direct yourself to the writer or editor most likely to write about your topic. Craft a short email that lets them know who you are, what project you're from, when the next release will happen, and what new features it will include. Give them a link to the press kit directly in your email. But make the press kit easy to see in the email. Use the full URL to the press kit, and make it clickable. Also link to the press kit from your website, so anyone else who visits your project can quickly find the information they need to write an article.
By doing a little prep work before your next major release, you can increase the likelihood that others will write about you. And that means you'll get more people who discover your project, so your open source software project can grow.
Read the whole story
137 days ago
Cluj-Napoca, România
Share this story
Next Page of Stories