Section 8.5. Special Topics

The following discussions involve not only CGI script security, but also Apache and Linux configuration and administration.

8.5.1 Authentication

Your web site may have some restricted content, such as premium pages for registered customers or administrative functions for web site maintainers. Use authentication to establish the identity of the visitor.

8.5.1.1 Basic authentication

The simplest authentication method in Apache is basic authentication . This requires a password file on the web server and a require directive in a config file:

<Location /auth_demo_dir>

AuthName "My Authorization"

AuthType Basic

# Note: Keep the password files in their own directory

AuthUserFile /usr/local/apache/auth_dir/auth_demo_password "

Order deny, allow

Require valid-user

</Location>

I suggest storing password files in their own directories, outside the document root. You may use subdirectories to segregate files by user or virtual host. This is more manageable than .htaccess files all over the site, and it keeps Apache running faster.

You can specify any matching user, a list of users, or a list of groups:

require valid-user

require user user1 user2 ...

require group group1 group2 ...

Where are the names and passwords stored? The simplest, specified by AuthUserFile in the example, is a flat text file on the server. To create the password file initially, type the following:

htpasswd -c /usr/local/apache/auth_dir/auth_demo_password

To add entries to the password file:

htpasswd /usr/local/apache/auth_dir/auth_demo_password -u raoul

... (prompt for password for raoul) ...

When a visitor attempts to access /auth_demo_dir on this site, a dialog box pops up and prompts him for his name and password. These will be sent with the HTTP stream to the web server. Apache will read the password file /etc/httpd/authfiles/auth_demo_password, get the encrypted password for the user raoul, and see if they match.

Don't put the password file anywhere under your DocumentRoot! Use one or more separate directories, with read-write permissions for the Apache UID group and none for others.

An authentication method connects with a particular storage implementation (DBM, DB, MySQL, LDAP) by matching Apache modules and configuration directives. For example, mod_auth_mysql is configured with the table and column names in a customer table in a MySQL database. After the name and password are sent to Apache from the browser, mod_auth_mysql queries the database and Apache allows access if the query succeeds and the username and password were found.

Browsers typically cache this authentication information and send it to the web server as part of each HTTP request header for the same realm (a string specified to identify this resource). What if the user changes her password during her session? Or what if the server wants to log the client off after some period of inactivity? In either case, the cached credentials could become invalid, but the browser still holds them tight. Unfortunately, HTTP has no way for a server to expire credentials in the client. It may be necessary to clear all browser caches (memory and disk) to clear the authentication data, forcing the server to request reauthentication and causing the client to open a new dialogue box. Sessions and cookies are often used to limit login times.

One problem with basic authentication is that it is not encrypted. A sniffer can and will pick up the name and password. You can use SSL for the initial authentication (a URL starting with https://) and then use normal (http://) URLs thereafter, with the session ID in a cookie or appended to the URL. This gives some privacy during login and better performance afterwards.

Direct authentication with a scripting language gives more flexibility than the built-in browser dialogue box. The script writes the proper HTTP server headers to the client, and it processes the reply as though it came from the standard dialogue box.

8.5.1.2 Digest authentication

The second HTTP client authentication method, digest authentication , is more secure, since it uses an MD5 hash of data rather than clear-text passwords. RFC 2617 documents basic and digest authentication. The Apache server and Mozilla implement the standard correctly. Microsoft did not, so digest authentication in IE 5 and IIS 5 does not currently interoperate with other web servers and browsers.

8.5.1.3 Safer authentication

It's surprisingly tricky to create secure client authentication. User input can be forged, HTTP referrals are unreliable, and even the client's apparent IP can change from one access to the next if the user is behind a proxy farm. It would be beneficial to have a method that's usable within and across sites. For cross-site authentication, the authenticating server must convey its approval or disapproval in a way that can't be easily forged and that will work even if the servers aren't homogeneous and local.

A simple adaptation of these ideas follows. It uses a public variable with unique values to prevent a replay attack. A timestamp is useful since it can also be used to expire old logins. This value is combined with a constant string that is known only by the cooperating web servers to produce another string. That string is run through a one-way hash function. The timestamp and hashed string are sent from the authenticating web server (A) to the target web server (B).

Let's walk through the process. First, the client form gets the username and password and submits them to Server A:

# Client form

<form method="get" action="https://a.test.com/auth.php">

User: <input type="text" name="user">

Password: <input type="password" name="password">

<input type="submit">

</form>

On Server A, get the timestamp, combine it with the secret string, hash the result, and redirect to Server B:

<?

// a.test.com/auth.php

$time_arg = Date( );

$secret_string = "babaloo";

$hash_arg = md5($time_arg . $secret_string);

$url = "http://b.test.com/login.php" .

    "?" .

     "t=" . urlencode($time_arg) .

    "&h=" . urlencode($hash_arg);

header("Location: $url");

?>

On Server B, confirm the input from Server A:

<? 

// b.test.com/login.php

// Get the CGI variables:

$time_arg = $_GET['t'];

$hash_arg = $_GET['h'];



// Servers A and B both know the secret string,

// the variable(s) it is combined with, and their

// order:

$secret_string = "babaloo";

$hash_calc = md5($time_arg . $secret_string);



if ($hash_calc == $hash_arg)

    {

    // Check $time_arg against the current time.

    // If it's too old, this input may have come from a

    // bookmarked URL, or may be a replay attack; reject it.

    // If it's recent and the strings match, proceed with the login...

    }

else

    {

    // Otherwise, reject with some error message.

    }

?>

This is a better-than-nothing method, simplified beyond recognition from the following sources, which should be consulted for greater detail and security:

Example 16-2 in Web Security, Privacy and Commerce by Simson Garfinkel and Gene Spafford (O'Reilly).
Dos and Donts of Client Authentication on the Web (http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-818.pdf) describes how a team at MIT cracked the authentication schemes of a number of commercial sites, including the Wall Street Journal. Visit http://cookies.lcs.mit.edu/ for links to the Perl source code of their Kooky Authentication Scheme.

8.5.2 Access Control and Authorization

Once authenticated, what is the visitor allowed to do? This is the authorization or access control step. You can control access by a hostname or address, the value of an environment variable, or by a person's ID and password.

8.5.2.1 Host-based access control

This grants or blocks access based on a hostname or IP address. Here is a sample directive to prevent everyone at evil.com from viewing your site:

<Location />

order deny, allow

deny from .evil.com

allow from all

</Location>

The . before evil.com is necessary. If I said:

deny from evil.com

I would also be excluding anything that ends with evil.com, such as devil.com or www.bollweevil.com.

You may also specify addresses:

full IP (200.201.202.203)
subnet (200.201.202.)
explicit netmask (200.201.202.203/255.255.255.0)
CIDR (200.201.202.203/24).

8.5.2.2 Environment-variable access control

This is a very flexible solution to some tricky problems. Apache's configuration file can set new environment variables based on patterns in the information it receives in HTTP headers. For example, here's how to serve images from /image_dir on http://www.hackenbush.com, but keep people from linking to the images from their own sites or stealing them:

SetEnvIf Referer "^www.hackenbush.com" local

<Location /image_dir>

order deny,allow

deny from all

allow from env=local

</Location>

SetEnvIf defines the environment variable local if the referring page was from the same site.

8.5.2.3 User-based access control

If you allow any .htaccess files in your Apache configuration, Apache must check for a possible .htaccess file in every directory leading to every file that it serves, on every access. This is slow: look at a running httpd process sometime (try strace httpd) to see the statistics from all these lookups. Also, .htaccess files can be anywhere, modified by anyone, and very easy to overlook. You can get surprising interactions between your directives and those in these far-flung files. So let's fling them even farther and consider them a hazard.

Try to put your access-control directives directly in your Apache configuration file (httpd.conf or access.conf). Disallow overrides for your whole site with the following:

<Location />

AllowOverride false

</Location>

Any exceptions must be made in httpd.conf or access.conf, including granting the ability to use .htaccess files. You might do this if you serve many independent virtual hosts and want to let them specify their own access control and CGI scripts. But be aware that you're increasing your server's surface area.

8.5.2.4 Combined access control

Apache's configuration mechanism has surprising flexibility, allowing you to handle some tricky requirements. For instance, to allow anyone from good.com or a registered user:

<Location />

order deny, allow

deny from all



# Here's the required domain:

allow from .good.com



# Any user in the password file:

require valid-user



# This does an "or" instead of an "and":

satisfy any

</Location>

If you leave out satisfy any, the meaning changes from or to and, a much more restrictive setting.

8.5.3 SSL

SSL is a secure HTML form for submitting data to an SSL-enabled web server with an https: URL. SSL encrypts sensitive data between the browser and the server, including login names, passwords, personal information, and, of course, credit card numbers. SSL encryption is computationally expensive and dramatically slows down a web server without a hardware SSL accelerator. Therefore, it's common to use SSL while logging in or filling in an order form and then to use standard HTTP the rest of the time.

Until recently, people tended to buy a commercial server to offer SSL. RSA Data Security owned a patent on a public-key encryption method used by SSL, and they licensed it to companies. After the patent expired in September 2000, free implementations of Apache+SSL emerged. Two modules ? Apache-SSL and mod_ssl ? have competed for the lead position. mod_ssl is more popular and easier to install, and it can be integrated as an Apache DSO. It's included with Apache 2 as a standard module. For Apache 1.x, you need to get mod_ssl from http://www.modssl.org and OpenSSL from http://www.openssl.org.

Early in the SSL process, Apache requires a server certificate to authenticate its site's identity to the browser. Browsers have built-in lists of CAs and their credentials. If your server certificate was provided by one of these authorities, the browser will silently accept it and establish an SSL connection. The process of obtaining a server certificate involves proving your identity to a CA and paying a license fee. If the server certificate comes from an unrecognized CA or is self-signed, the browser will prompt the user to confirm or reject it. Large commercial sites pay fees to the annual CA to avoid this extra step, as well as to avoid the appearance of being somehow less trustworthy.

8.5.4 Sessions and Cookies

Once a customer has been authenticated for your site, you want to keep track of her. You don't want to force a login on every page, so you need a way to maintain state over time and multiple page visits.

Since HTTP is stateless, visits need to be threaded together. If a person adds items to a shopping cart, they should stay there even if the user takes side trips through the site.

A session is a sequence of interactions. It has a session ID (a unique identifier), data, and a time span. A good session ID should be difficult to guess or reverse-engineer. It may be calculated from some input variables, such as the user's IP or the time. PHP, Perl, and other languages have code to create and manage web sessions.

If the web user allows cookies in her browser, the web script may write the session ID as a variable in a cookie for your web site. If cookies are not allowed, you need to propagate the session ID with every URL. Every GET URL needs an extra variable, and every POST URL needs some hidden field to house this ID.

8.5.4.1 PHP

PHP can be configured to check every URL on a page and tack on the session ID, if needed. In php.ini, add the following:

enable_trans_sid on

This is slower, since PHP needs to examine every URL on every page. It doesn't recognize URLs that are constructed within JavaScript or PHP.

Without this, you need to track the sessions yourself. If cookies are enabled in the browser, PHP defines the constant SID to be an empty string. If cookies are disabled, SID is defined as PHPSESSID=id, where id is the 32-character session ID string. To handle either case in your script, append SID to your links:

<a href="sample_link.html?<?=SID?>">link</a>

If cookies are enabled, the HTML created by the previous example would be as follows:

<a href="sample_link.html?">link</a>

If cookies are disabled, the session ID becomes part of the URL:

<a href="sample_link.html?PHPSESSID=379d65e3921501cc79df7d02cfbc24c3">link</a>

By default, session variables are written to /tmp/sess_id. Anyone who can list the contents of /tmp can hijack a session ID, or possibly forge a new one. To avoid this, change the session directory to a more secure location (outside of DocumentRoot, of course):

# in php.ini:

session.save_path=/usr/local/apache/sessions



# or in apache's httpd.conf:

php_admin_valuesession.save_path /usr/local/apache/sessions

The directory and files should be owned by the web-server user ID and hidden from others:

chmod 700 /usr/local/apache/sessions

You can also tell PHP to store session data in shared memory, a database, or some other storage method.

8.5.4.2 Perl

The Apache::Session module provides session functions for mod_perl. The session ID can be saved in a cookie or manually appended to URLs. Session storage may use the filesystem, a database, or RAM. See the documentation at http://www.perldoc.com/cpan/Apache/Session.html.

Apache provides its own language-independent session management with mod_session. This works with or without cookies (appending the session ID to the URL in the QUERY_STRING environment variable) and can exempt certain URLs, file types, and clients from session control.

8.5.5 Site Management: Uploading Files

As you update your web site, you will be editing and copying files. You may also allow customers to upload files for some purposes. How can you do this securely?

Tim Berners-Lee originally envisioned the Web as a two-way medium, where browsers could easily be authors. Unfortunately, as the Web commercialized, the emphasis was placed on browsing. Even today, the return path is somewhat awkward, and the issue of secure site management is not often discussed.

8.5.5.1 Not-so-good ideas

I mentioned form-based file uploads earlier. Although you can use this for site maintenance, it only handles one file at a time and forces you to choose it from a list or type its name.

Although FTP is readily available and simple to use, it is not recommended for many reasons. It still seems too difficult to secure FTP servers: account names and passwords are passed in the clear.

Network filesystems like NFS or SAMBA are appealing for web-site developers, since they can develop on their client machines and then drag and drop files to network folders. They are still too difficult to secure across the public Internet and are not recommended. At one time, Sun was promoting WebNFS as the next-generation, Internet-ready filesystem, but there has been little public discussion on this in the past few years. It might be possible to create a VPN using any of the available technologies, such as IPsec or PPTP.

The HTTP PUT method is not usually not available in web browsers. HTML authoring tools, such as Netscape Composer and AOLPress, use PUT to upload or modify files. PUT has security implications similar to form-based file uploads, and it now looks as if it's being superceded by DAV.

Microsoft's FrontPage server extensions define web-server extensions for file uploading and other tasks. The web server and FrontPage client communicate with a proprietary RPC over HTTP. The extensions are available for Apache and Linux (http://www.rtr.com/fpsupport/index.html), but only as binaries.

FrontPage has had serious security problems in the past. The author of the presentation Apache and FrontPage at ApacheCon 2001 recommended: "If at all possible, don't use FrontPage at all." There is now an independent mod_frontpage DSO for Apache and some indications of improved security. See Features of Improved mod_frontpage (http://home.edo.uni-dortmund.de/~chripo/about/features.html) and FrontPage Server Extensions 2002 Security Under Unix (http://www.microsoft.com/TechNet/prodtechnol/sharepnt/proddocs/admindoc/owsa05.asp).

8.5.5.2 Better ideas: ssh, scp, sftp, rsync

scp and sftp are good methods for encrypted file transfer. Command-line clients are freely available for Unix/Linux, and Windows clients are available (WinSCP is free; SecureCRT is commercial). To copy many files, rsync over ssh provides an incremental, compressed, encrypted data transfer. This is especially useful when mirroring or backing up a web site. I do most of my day-to-day work on live systems with ssh, vi, scp, and rsync.

8.5.5.3 WebDAV

Distributed Authoring and Versioning (DAV or WebDAV) is a recent standard for remote web-based file management. DAV lets you upload, rename, delete, and modify files on a web server. It's supported in Apache (as mod_dav) and by popular client software:

Microsoft provides web folders with IE 5 and Windows 95 and up. These look like local directories under Explorer, but they are directories on a web server under DAV management.
Macromedia Dreamweaver UltraDev.
Adobe GoLive, InDesign, and FrameMaker.
Apple MacOS X iDisk.
OpenOffice.

To add WebDAV support to Apache, ensure that mod_dav is included:

Download the source from http://www.webdav.org/mod_dav/.

Build the module:

./configure --with-apxs=/usr/local/apache/bin/apxs

Add these lines to httpd.conf:

Loadmodule dav_module libexec/libdav.so

Addmodule mod_dav.c

Create a password file:

htpasswd -s /usr/local/apache/passwords/dav.htpasswd  user password

In httpd.conf, enable DAV for the directories you want to make available. If you'll allow file upload, you should have some access control as well:

# The directory part of this must be writeable

# by the user ID running apache:

DAVLockDB /usr/local/apache/davlock/

DAVMinTimeout 600



# Use a Location or Directory for each DAV area.

# Here, let's try "/DAV":

<Location /DAV>

# Authentication:

AuthName "DAV"

AuthUserFile /usr/local/apache/passwords/dav.htpasswd"

AuthType Basic

# Some extra protection

AllowOverride None

# Allow file listing

Options indexes

# Don't forget this one!:

DAV On

# Let anyone read, but

# require authentication to do anything dangerous:

<LimitExcept GET HEAD OPTIONS>

require valid-user

</Limit>

</Location>

The security implications of DAV are the same as for basic authentication: the name and password are passed as plain text, and you need to protect the name/password files.

DAV is easy to use and quite flexible. A new extension called DELTA-V will handle versioning, so DAV could eventually provide a web-based source-control system.

8.5.6 New Frameworks: SOAP, Web Services, and REST

The Simple Object Access Protocol (SOAP) and XML-RPC are protocols for remote procedure calls using XML over HTTP. HTTP was chosen because it usually passes through corporate firewalls, and it would be difficult to establish a new specialized protocol. With other proposed standards like Web Services Description Language (WSDL) and Universal Description, Discovery, and Integration (UDDI), some large corporations are promoting a new field called web services.

There are some concerns about this. You construct a firewall based on your knowledge that server A at port B can do C and D. But with SOAP and similar protocols, HTTP becomes a conduit for remote procedure calls. Even a stateful firewall cannot interpret the protocol to see which way the data flows or the implications of the data. That would require a packet analyzer that knows the syntax and semantics of the XML stream, which is a difficult and higher-level function.

In his Crypto-Gram web newsletter (http://www.counterpane.com/crypto-gram-0202.html#2), Bruce Schneier criticizes Microsoft's "feature-above-security mindset" for statements like these, taken from Microsoft's documentation:

Currently, developers struggle to make their distributed applications work across the Internet when firewalls get in the way...Since SOAP relies on HTTP as the transport mechanism, and most firewalls allow HTTP to pass through, you'll have no problem invoking SOAP endpoints from either side of a firewall.

Microsoft designed Outlook to execute email attachments before thinking through the security implications, and customers have spent much time purging and patching their systems after infection by a relentless stream of viruses and worms. Schneier and others feel that similar problems will emerge as attackers probe this new RPC-over-HTTP architecture.

IBM, Microsoft, and others founded the Web Services Interoperability Group (http://www.ws-i.org) to create web-services standards outside of the IETF and W3C. Security was not addressed until the first draft of Web Services Security (http://www-106.ibm.com/developerworks/webservices/library/ws-secure/) appeared in April 2002. It describes an extensible XML format for secure SOAP message exchanges. This addresses the integrity of the message, but still doesn't guarantee that the message's contents aren't harmful.

An alternative to XML-based web services is Representational State Transfer (REST), which uses only traditional web components ? HTTP and URIs. A clear description is found in Second Generation Web Services (http://www.xml.com/pub/a/2002/02/20/rest.html). Its proponents argue that REST can do anything that SOAP can do, but more simply and securely. All the techniques described in this chapter, as well as functions like caching and bookmarking, could be applied, since current web standards are well established. For instance, a GET has no side effects and never modifies server state. A SOAP method may read or write, but this is a semantic agreement between the server and client that cannot be determined from the syntax of a SOAP message. See Some Thoughts About SOAP Versus REST on Security (http://www.prescod.net/rest/security.html).

As these new web services roll out, the Law of Unintended Consequences will get a good workout. Expect major surprises.

8.5.7 Robots and Spiders

A well-behaved robot is supposed to read the robots.txt file in your site's home directory. This file tells it which files and directories may be searched by web spiders to help the search engines. You should have a robots.txt file in the top directory of each web site. Exclude all directories with CGI scripts (anything marked as ScriptAlias, like /cgi-bin), images, access-controlled content, or any other content that should not be exposed to the world. Here's a simple example:

User-agent: *

Disallow: /image_dir

Disallow: /cgi-bin

Many robots are spiders, used by web search engines to help catalogue the Web's vast expanses. Good ones obey the robots.txt rules and have other indexing heuristics. They try to examine only static content and ignore things that look like CGI scripts (such as URLs containing ? or /cgi-bin). Web scripts can use the PATH_INFO environment variable and Apache rewriting rules to make CGI scripts search-engine friendly.

The robot exclusion standard is documented at http://www.robotstxt.org/wc/norobots.html. More details can be found at http://www.robotstxt.org/wc/robots.html.

If a robot behaves impolitely, you can exclude it with environment variables and access control:

BrowserMatch ^evil_robot_name begone

<Location />

order allow,deny

allow from all

deny from env=begone

</Location>

An evil robot may lie about its identity in the UserAgent HTTP request header and then make a beeline to the directories it's supposed to ignore. You can craft your robots.txt file to lure it into a tarpit, which is described in the next section.

8.5.8 Detecting and Deflecting Attackers

The more attackers know about you, the more vulnerable you are. Some use port 80 fingerprinting to determine what kind of server you're running. They can also pass a HEAD request to your web server to get its version number, modules, etc.

Script kiddies are not known for their precision, so they will often fling IIS attacks such as Code Red and Nimda at your Apache server. Look at your error_log to see how often these turn up. You can exclude them from your logs with Apache configuration tricks. A more active approach is to send email to the administrator of the offending site, using a script like NimdaNotifyer (see http://www.digitalcon.ca/nimda/). You may even decide to exclude these visitors from your site. Visit http://www.snort.org to see how to integrate an IP blocker with their intrusion detector.

The harried but defiant administrator might enjoy building a tarpit. This is a way to turn your network's unused IP addresses into a TCP-connection black hole. Attempts to connect to these addresses instead connect with something that will not let go. See http://www.hackbusters.net/LaBrea/ for details of a tarpit implementation.

8.5.9 Caches, Proxies, and Load Balancers

A proxy is a man in the middle. A caching proxy is a man in the middle with a memory. All the security issues of email apply to web pages as they stream about: they can be read, copied, forged, stolen, etc. The usual answer is to apply end-to-end cryptography.

If you use sessions that are linked to a specific server (stored in temporary files or shared memory rather than a database), you must somehow get every request with the same session ID directed to the same server. Some load balancers offer session affinity to do this. Without it, you'll need to store the sessions in some shared medium, like an NFS-mounted filesystem or a database.

8.5.10 Logging

The Apache log directories should be owned by root and visible to no one else. Logs can reveal sensitive information in the URLs (GET parameters) and in the referrer. Also, an attacker with write access can plant cross-site scripting bugs that would be triggered by a log analyzer as it processes the URLs.

Logs also grow like crazy and fill up the disk. One of the more common ways to clobber a web server is to fill up the disk with log files. Use logrotate and cron to rotate them daily.