1.3 The Development of mod_perl 1.0

Of the various attempts to improve on mod_cgi's shortcomings, mod_perl has proven to be one of the better solutions and has been widely adopted by CGI developers. Doug MacEachern fathered the core code of this Apache module and licensed it under the Apache Software License, which is a certified open source license.

mod_perl does away with mod_cgi's forking by embedding the Perl interpreter into Apache's child processes, thus avoiding the forking mod_cgi needed to run Perl programs. In this new model, the child process doesn't exit when it has processed a request. The Perl interpreter is loaded only once, when the process is started. Since the interpreter is persistent throughout the process's lifetime, all code is loaded and compiled only once, the first time it is needed. All subsequent requests run much faster, because everything is already loaded and compiled. Response processing is reduced to simply running the code, which improves response times by a factor of 10-100, depending on the code being executed.

But Doug's real accomplishment was adding a mod_perl API to the Apache core. This made it possible to write complete Apache modules in Perl, a feat that used to require coding in C. From then on, mod_perl enabled the programmer to handle all phases of request processing in Perl.

The mod_perl API also allows complete server configuration in Perl. This has made the lives of many server administrators much easier, as they now benefit from dynamically generating the configuration and are freed from hunting for bugs in huge configuration files full of similar directives for virtual hosts and the like.^[8]

^[8] mod_vhost_alias offers similar functionality.

To provide backward compatibility for plain CGI scripts that used to be run under mod_cgi, while still benefiting from a preloaded Perl interpreter and modules, a few special handlers were written, each allowing a different level of proximity to pure mod_perl functionality. Some take full advantage of mod_perl, while others do not.

mod_perl embeds a copy of the Perl interpreter into the Apache httpd executable, providing complete access to Perl functionality within Apache. This enables a set of mod_perl-specific configuration directives, all of which start with the string Perl. Most, but not all, of these directives are used to specify handlers for various phases of the request.

It might occur to you that sticking a large executable (Perl) into another large executable (Apache) creates a very, very large program. mod_perl certainly makes httpd significantly bigger, and you will need more RAM on your production server to be able to run many mod_perl processes. However, in reality, the situation is not as bad as it first appears. mod_perl processes requests much faster, so the number of processes needed to handle the same request rate is much lower relative to the mod_cgi approach. Generally, you need slightly more available memory, but the speed improvements you will see are well worth every megabyte of memory you can add. Techniques that can reduce memory requirements are covered in Chapter 10.

According to http://netcraft.com/, as of January 2003, mod_perl has been used on more than four million web sites. Some of these sites have been using mod_perl since its early days. You can see an extensive list of sites that use mod_perl at http://perl.apache.org/outstanding/sites.html or http://perl.apache.org/outstanding/success_stories/. The latest usage statistics can be viewed at http://perl.apache.org/outstanding/stats/.

1.3.1 Running CGI Scripts with mod_perl

Since many web application developers are interested in the content delivery phase and come from a CGI background, mod_perl includes packages designed to make the transition from CGI simple and painless. Apache::PerlRun and Apache::Registry run unmodified CGI scripts, albeit much faster than mod_cgi.^[9]

^[9] Apache::RegistryNG and Apache::RegistryBB are two new experimental modules that you may want to try as well.

The difference between Apache::Registry and Apache::PerlRun is that Apache::Registry caches all scripts, and Apache::PerlRun doesn't. To understand why this matters, remember that if one of mod_perl's benefits is added speed, another is persistence. Just as the Perl interpreter is loaded only once, at child process startup, your scripts are loaded and compiled only once, when they are first used. This can be a double-edged sword: persistence means global variables aren't reset to initial values, and file and database handles aren't closed when the script ends. This can wreak havoc in badly written CGI scripts.

Whether you should use Apache::Registry or Apache::PerlRun for your CGI scripts depends on how well written your existing Perl scripts are. Some scripts initialize all variables, close all file handles, use taint mode, and give only polite error messages. Others don't.

Apache::Registry compiles scripts on first use and keeps the compiled scripts in memory. On subsequent requests, all the needed code (the script and the modules it uses) is already compiled and loaded in memory. This gives you enormous performance benefits, but it requires that scripts be well behaved.

Apache::PerlRun, on the other hand, compiles scripts at each request. The script's namespace is flushed and is fresh at the start of every request. This allows scripts to enjoy the basic benefit of mod_perl (i.e., not having to load the Perl interpreter) without requiring poorly written scripts to be rewritten.

A typical problem some developers encounter when porting from mod_cgi to Apache::Registry is the use of uninitialized global variables. Consider the following script:

use CGI;
$q = CGI->new( );
$topsecret = 1 if $q->param("secret") eq 'Muahaha';
# ...
if ($topsecret) {
    display_topsecret_data( );
}
else {
    security_alert( );
}

This script will always do the right thing under mod_cgi: if secret=Muahaha is supplied, the top-secret data will be displayed via display_topsecret_data( ), and if the authentication fails, the security_alert( ) function will be called. This works only because under mod_cgi, all globals are undefined at the beginning of each request.

Under Apache::Registry, however, global variables preserve their values between requests. Now imagine a situation where someone has successfully authenticated, setting the global variable $topsecret to a true value. From now on, anyone can access the top-secret data without knowing the secret phrase, because $topsecret will stay true until the process dies or is modified elsewhere in the code.

This is an example of sloppy code. It will do the right thing under Apache::PerlRun, since all global variables are undefined before each iteration of the script. However, under Apache::Registry and mod_perl handlers, all global variables must be initialized before they can be used.

The example can be fixed in a few ways. It's a good idea to always use the strict mode, which requires the global variables to be declared before they are used:

use strict;
use CGI;
use vars qw($top $q);
# init globals
$top = 0;
$q = undef;
# code
$q = CGI->new( );
$topsecret = 1 if $q->param("secret") eq 'Muahaha';
# ...

But of course, the simplest solution is to avoid using globals where possible. Let's look at the example rewritten without globals:

use strict;
use CGI;
my $q = CGI->new( );
my $topsecret = $q->param("secret") eq 'Muahaha' ? 1 : 0;
# ...

The last two versions of the example will run perfectly under Apache::Registry.

Here is another example that won't work correctly under Apache::Registry. This example presents a simple search engine script:

use CGI;
my $q = CGI->new( );
print $q->header('text/plain');
my @data = read_data( )
my $pat = $q->param("keyword");
foreach (@data) {
    print if /$pat/o;
}

The example retrieves some data using read_data( ) (e.g., lines in the text file), tries to match the keyword submitted by a user against this data, and prints the matching lines. The /o regular expression modifier is used to compile the regular expression only once, to speed up the matches. Without it, the regular expression will be recompiled as many times as the size of the @data array.

Now consider that someone is using this script to search for something inappropriate. Under Apache::Registry, the pattern will be cached and won't be recompiled in subsequent requests, meaning that the next person using this script (running in the same process) may receive something quite unexpected as a result. Oops.

The proper solution to this problem is discussed in Chapter 6, but Apache::PerlRun provides an immediate workaround, since it resets the regular expression cache before each request.

So why bother to keep your code clean? Why not use Apache::PerlRun all the time? As we mentioned earlier, the convenience provided by Apache::PerlRun comes at a price of performance deterioration.

In Chapter 9, we show in detail how to benchmark the code and server configuration. Based on the results of the benchmark, you can tune the service for the best performance. For now, let's just show the benchmark of the short script in Example 1-6.

Example 1-6. readdir.pl

use strict;

use CGI ( );
use IO::Dir ( );

my $q = CGI->new;
print $q->header("text/plain");
my $dir = IO::Dir->new(".");
print join "\n", $dir->read;

The script loads two modules (CGI and IO::Dir), prints the HTTP header, and prints the contents of the current directory. If we compare the performance of this script under mod_cgi, Apache::Registry, and Apache::PerlRun, we get the following results:

  Mode          Requests/sec
-------------------------------
  Apache::Registry       473
  Apache::PerlRun        289
  mod_cgi                 10

Because the script does very little, the performance differences between the three modes are very significant. Apache::Registry thoroughly outperforms mod_cgi, and you can see that Apache::PerlRun is much faster than mod_cgi, although it is still about twice as slow as Apache::Registry. The performance gap usually shrinks a bit as more code is added, as the overhead of fork( ) and code compilation becomes less significant compared to execution times. But the benchmark results won't change significantly.

Jumping ahead, if we convert the script in Example 1-6 into a mod_perl handler, we can reach 517 requests per second under the same conditions, which is a bit faster than Apache::Registry. In Chapter 13, we discuss why running the code under the Apache::Registry handler is a bit slower than using a pure mod_perl content handler.

It can easily be seen from this benchmark that Apache::Registry is what you should use for your scripts to get the most out of mod_perl. But Apache::PerlRun is still quite useful for making an easy transition to mod_perl. With Apache::PerlRun, you can get a significant performance improvement over mod_cgi with minimal effort.

Later, we will see that Apache::Registry's caching mechanism is implemented by compiling each script in its own namespace. Apache::Registry builds a unique package name using the script's name, the current URI, and the current virtual host (if any). Apache::Registry prepends a package statement to your script, then compiles it using Perl's eval function. In Chapter 6, we will show how exactly this is done.

What happens if you modify the script's file after it has been compiled and cached? Apache::Registry checks the file's last-modification time, and if the file has changed since the last compile, it is reloaded and recompiled.

In case of a compilation or execution error, the error is logged to the server's error log, and a server error is returned to the client.