12.5 Adding a Proxy Server in httpd Accelerator Mode

We have already presented a solution with two servers: one plain Apache server, which is very light and configured to serve static objects, and the other with mod_perl enabled (very heavy) and configured to serve mod_perl scripts and handlers. We named them httpd_docs and httpd_perl, respectively.

In the dual-server setup presented earlier, the two servers coexist at the same IP address by listening to different ports: httpd_docs listens to port 80 (e.g., http://www.example.com/images/test.gif) and httpd_perl listens to port 8000 (e.g., http://www.example.com:8000/perl/test.pl). Note that we did not write http://www.example.com:80 for the first example, since port 80 is the default port for the HTTP service. Later on, we will change the configuration of the httpd_docs server to make it listen to port 81.

This section will attempt to convince you that you should really deploy a proxy server in httpd accelerator mode. This is a special mode that, in addition to providing the normal caching mechanism, accelerates your CGI and mod_perl scripts by taking the responsibility of pushing the produced content to the client, thereby freeing your mod_perl processes. Figure 12-3 shows a configuration that uses a proxy server, a standalone Apache server, and a mod_perl-enabled Apache server.

Figure 12-3. A proxy server, standalone Apache, and mod_perl-enabled Apache

The advantages of using the proxy server in conjunction with mod_perl are:

You get all the benefits of the usual use of a proxy server that serves static objects from the proxy's cache. You get less I/O activity reading static objects from the disk (the proxy serves the most "popular" objects from RAM?of course you benefit more if you allow the proxy server to consume more RAM), and since you do not wait for the I/O to be completed, you can serve static objects much faster.
You get the extra functionality provided by httpd accelerator mode, which makes the proxy server act as a sort of output buffer for the dynamic content. The mod_perl server sends the entire response to the proxy and is then free to deal with other requests. The proxy server is responsible for sending the response to the browser. This means that if the transfer is over a slow link, the mod_perl server is not waiting around for the data to move.
This technique allows you to hide the details of the server's implementation. Users will never see ports in the URLs (more on that topic later). You can have a few boxes serving the requests and only one serving as a frontend, which spreads the jobs between the servers in a way that you can control. You can actually shut down a server without the user even noticing, because the frontend server will dispatch the jobs to other servers. This is called load balancing?it's too big an issue to cover here, but there is plenty of information available on the Internet (refer to Section 12.16 at the end of this chapter).
For security reasons, using an httpd accelerator (or a proxy in httpd accelerator mode) is essential because it protects your internal server from being directly attacked by arbitrary packets. The httpd accelerator and internal server communicate only expected HTTP requests, and usually only specific URI namespaces get proxied. For example, you can ensure that only URIs starting with /perl/ will be proxied to the backend server. Assuming that there are no vulnerabilities that can be triggered via some resource under /perl, this means that only your public "bastion" accelerating web server can get hosed in a successful attack?your backend server will be left intact. Of course, don't consider your web server to be impenetrable because it's accessible only through the proxy. Proxying it reduces the number of ways a cracker can get to your backend server; it doesn't eliminate them all.
Your server will be effectively impenetrable if it listens only on ports on your localhost (127.0.0.1), which makes it impossible to connect to your backend machine from the outside. But you don't need to connect from the outside anymore, as you will see when you proceed to this technique's implementation notes.

In addition, if you use some sort of access control, authentication, and authorization at the frontend server, it's easy to forget that users can still access the backend server directly, bypassing the frontend protection. By making the backend server directly inaccessible you prevent this possibility.

Of course, there are drawbacks. Luckily, these are not functionality drawbacks?they are more administration hassles. The disadvantages are:

You have another daemon to worry about, and while proxies are generally stable, you have to make sure to prepare proper startup and shutdown scripts, which are run at boot and reboot as appropriate. This is something that you do once and never come back to again. Also, you might want to set up the crontab to run a watchdog script that will make sure that the proxy server is running and restart it if it detects a problem, reporting the problem to the administrator on the way. Chapter 5 explains how to develop and run such watchdogs.
Proxy servers can be configured to be light or heavy. The administrator must decide what gives the highest performance for his application. A proxy server such as Squid is light in the sense of having only one process serving all requests, but it can consume a lot of memory when it loads objects into memory for faster service.
If you use the default logging mechanism for all requests on the front- and backend servers, the requests that will be proxied to the backend server will be logged twice, which makes it tricky to merge the two log files, should you want to. Therefore, if all accesses to the backend server are done via the frontend server, it's the best to turn off logging of the backend server.
If the backend server is also accessed directly, bypassing the frontend server, you want to log only the requests that don't go through the frontend server. One way to tell whether a request was proxied or not is to use mod_proxy_add_forward, presented later in this chapter, which sets the HTTP header X-Forwarded-For for all proxied requests. So if the default logging is turned off, you can add a custom PerlLogHandler that logs only requests made directly to the backend server.

If you still decide to log proxied requests at the backend server, they might not contain all the information you need, since instead of the real remote IP of the user, you will always get the IP of the frontend server. Again, mod_proxy_add_forward, presented later, provides a solution to this problem.

Let's look at a real-world scenario that shows the importance of the proxy httpd accelerator mode for mod_perl.

First let's explain an abbreviation used in the networking world. If someone claims to have a 56-kbps connection, it means that the connection is made at 56 kilobits per second (~56,000 bits/sec). It's not 56 kilobytes per second, but 7 kilobytes per second, because 1 byte equals 8 bits. So don't let the merchants fool you?your modem gives you a 7 kilobytes-per-second connection at most, not 56 kilobytes per second, as one might think.

Another convention used in computer literature is that 10 Kb usually means 10 kilo-bits and 10 KB means 10 kilobytes. An uppercase B generally refers to bytes, and a lowercase b refers to bits (K of course means kilo and equals 1,024 or 1,000, depending on the field in which it's used). Remember that the latter convention is not followed everywhere, so use this knowledge with care.

In the typical scenario (as of this writing), users connect to your site with 56-kbps modems. This means that the speed of the user's network link is 56/8 = 7 KB per second. Let's assume an average generated HTML page to be of 42 KB and an average mod_perl script to generate this response in 0.5 seconds. How many responses could this script produce during the time it took for the output to be delivered to the user? A simple calculation reveals pretty scary numbers:

(42KB)/(0.5sx7KB/s) = 12

Twelve other dynamic requests could be served at the same time, if we could let mod_perl do only what it's best at: generating responses.

This very simple example shows us that we need only one-twelfth the number of children running, which means that we will need only one-twelfth of the memory.

But you know that nowadays scripts often return pages that are blown up with JavaScript and other code, which can easily make them 100 KB in size. Can you calculate what the download time for a file that size would be?

Furthermore, many users like to open multiple browser windows and do several things at once (e.g., download files and browse graphically heavy sites). So the speed of 7 KB/sec we assumed before may in reality be 5-10 times slower. This is not good for your server.

Considering the last example and taking into account all the other advantages that the proxy server provides, we hope that you are convinced that despite a small administration overhead, using a proxy is a good thing.

Of course, if you are on a very fast local area network (LAN) (which means that all your users are connected from this network and not from the outside), the big benefit of the proxy buffering the output and feeding a slow client is gone. You are probably better off sticking with a straight mod_perl server in this case.

Two proxy implementations are known to be widely used with mod_perl: the Squid proxy server and the mod_proxy Apache module. We'll discuss these in the next sections.