Hack 15 Restrict System Calls with Systrace

Keep your programs from performing tasks they weren't meant to do.

One of the more exciting new features in NetBSD and OpenBSD is systrace, a system call access manager. With systrace, a system administrator can specify which programs can make which system calls, and how those calls can be made. Proper use of systrace can greatly reduce the risks inherent in running poorly written or exploitable programs. Systrace policies can confine users in a manner completely independent of Unix permissions. You can even define the errors that the system calls return when access is denied, to allow programs to fail in a more proper manner. Proper use of systrace requires a practical understanding of system calls and what functionality programs must have to work properly.

First of all, what exactly are system calls? A system call is a function that lets you talk to the operating-system kernel. If you want to allocate memory, open a TCP/IP port, or perform input/output on the disk, you'll need to use a system call. System calls are documented in section 2 of the manpages.

Unix also supports a wide variety of C library calls. These are often confused with system calls but are actually just standardized routines for things that could be written within a program. For example, you could easily write a function to compute square roots within a program, but you could not write a function to allocate memory without using a system call. If you're in doubt whether a particular function is a system call or a C library function, check the online manual.

You may find an occasional system call that is not documented in the online manual, such as break(). You'll need to dig into other resources to identify these calls (break() in particular is a very old system call used within libc, but not by programmers, so it seems to have escaped being documented in the manpages).

Systrace denies all actions that are not explicitly permitted and logs the rejection using syslog. If a program running under systrace has a problem, you can find out which system call the program wants to use and decide if you want to add it to your policy, reconfigure the program, or live with the error.

Systrace has several important pieces: policies, the policy generation tools, the runtime access management tool, and the sysadmin real-time interface. This hack gives a brief overview of policies; in [Hack #16], we'll learn about the systrace tools.

The systrace(1) manpage includes a full description of the syntax used for policy descriptions, but I generally find it easier to look at some examples of a working policy and then go over the syntax in detail. Since named has been a subject of recent security discussions, let's look at the policy that OpenBSD 3.2 provides for named.

Before reviewing the named policy, let's review some commonly known facts about the name server daemon's system-access requirements. Zone transfers and large queries occur on port 53/TCP, while basic lookup services are provided on port 53/UDP. OpenBSD chroots named into /var/named by default and logs everything to /var/log/messages.

Each systrace policy file is in a file named after the full path of the program, replacing slashes with underscores. The policy file usr_sbin_named contains quite a few entries that allow access beyond binding to port 53 and writing to the system log. The file starts with:

# Policy for named that uses named user and chroots to /var/named

# This policy works for the default configuration of named.

Policy: /usr/sbin/named, Emulation: native

The Policy statement gives the full path to the program this policy is for. You can't fool systrace by giving the same name to a program elsewhere on the system. The Emulation entry shows which ABI this policy is for. Remember, BSD systems expose ABIs for a variety of operating systems. Systrace can theoretically manage system-call access for any ABI, although only native and Linux binaries are supported at the moment.

The remaining lines define a variety of system calls that the program may or may not use. The sample policy for named includes 73 lines of system-call rules. The most basic look like this:

native-accept: permit

When /usr/sbin/named tries to use the accept() system call to accept a connection on a socket, under the native ABI, it is allowed. Other rules are far more restrictive. Here's a rule for bind( ), the system call that lets a program request a TCP/IP port to attach to:

native-bind: sockaddr match "inet-*:53" then permit

sockaddr is the name of an argument taken by the accept() system call. The match keyword tells systrace to compare the given variable with the string inet-*:53, according to the standard shell pattern-matching (globbing) rules. So, if the variable sockaddr matches the string inet-*:53, the connection is accepted. This program can bind to port 53, over both TCP and UDP protocols. If an attacker had an exploit to make named attach a command prompt on a high-numbered port, this systrace policy would prevent that exploit from working.

At first glance, this seems wrong:

native-chdir: filename eq "/" then permit

native-chdir: filename eq "/namedb" then permit

The eq keyword compares one string to another and requires an exact match. If the program tries to go to the root directory, or to the directory /namedb, systrace will allow it. Why would you possibly want to allow named to access the root directory? The next entry explains why:

native-chroot: filename eq "/var/named" then permit

We can use the native chroot() system call to change our root directory to /var/named, but to no other directory. At this point, the /namedb directory is actually /var/named/namedb. We also know that named logs to syslog. To do this, it will need access to /dev/log:

native-connect: sockaddr eq "/dev/log" then permit

This program can use the native connect() system call to talk to /dev/log and only /dev/log. That device hands the connections off elsewhere.

We'll also see some entries for system calls that do not exist:

native-fsread: filename eq "/" then permit

native-fsread: filename eq "/dev/arandom" then permit

native-fsread: filename eq "/etc/group" then permit

Systrace aliases certain system calls with very similar functions into groups. You can disable this functionality with a command-line switch and only use the exact system calls you specify, but in most cases these aliases are quite useful and shrink your policies considerably. The two aliases are fsread and fswrite. fsread is an alias for stat(), lstat(), readlink(), and access() under the native and Linux ABIs. fswrite is an alias for unlink(), mkdir(), and rmdir(), in both the native and Linux ABIs. As open() can be used to either read or write a file, it is aliased by both fsread and fswrite, depending on how it is called. So named can read certain /etc files, it can list the contents of the root directory, and it can access the groups file.

Systrace supports two optional keywords at the end of a policy statement, errorcode and log. The errorcode is the error that is returned when the program attempts to access this system call. Programs will behave differently depending on the error that they receive. named will react differently to a "permission denied" error than it will to an "out of memory" error. You can get a complete list of error codes from the errno manpage. Use the error name, not the error number. For example, here we return an error for nonexistent files:

filename sub "<non-existent filename>" then deny[enoent]

If you put the word log at the end of your rule, successful system calls will be logged. For example, if we wanted to log each time named attached to port 53, we could edit the policy statement for the bind() call to read:

native-bind: sockaddr match "inet-*:53" then permit log

You can also choose to filter rules based on user ID and group ID, as the example here demonstrates.

native-setgid: gid eq "70" then permit

This very brief overview covers the vast majority of the rules you will see. For full details on the systrace grammar, read the systrace manpage. If you want some help with creating your policies, you can also use systrace's automated mode [Hack #16] .

The original article that this hack is based on is available online at http://www.onlamp.com/pub/a/bsd/2003/01/30/Big_Scary_Daemons.html.

?Michael Lucas