Introduction

Changes in the environment or the availability of food can make certain species more successful than others at finding food or avoiding predators. Many scientists believe a comet struck the Earth millions of years ago, throwing an enormous cloud of dust into the atmosphere. Subsequent radical changes to the environment proved too much for some organisms, say dinosaurs, and hastened their extinction. Other creatures, such as mammals, found new food supplies and freshly exposed habitats to compete in.

Much as the comet altered the environment for prehistoric species, the Web has altered the environment for modern programming languages. It's opened up new vistas, and although some languages have found themselves eminently unsuited to this new world order, Perl has positively thrived. Because of its strong background in text processing and system glue, Perl has readily adapted itself to the task of providing information using text-based protocols.

Architecture

The Web is driven by plain text. Web servers and web browsers communicate using a text protocol called HTTP, Hypertext Transfer Protocol. Many of the documents exchanged are encoded in a text markup system called HTML, Hypertext Markup Language. This grounding in text is the source of much of the Web's flexibility, power, and success. The only notable exception to the predominance of plain text is the Secure Socket Layer (SSL) protocol that encrypts other protocols like HTTP into binary data that snoopers can't decode.

Web pages are identified using the Uniform Resource Locator (URL) naming scheme. URLs look like this:

http://www.perl.com/CPAN/
http://www.perl.com:8001/bad/mojo.html
ftp://gatekeeper.dec.com/pub/misc/netlib.tar.Z
ftp://anonymous@myplace:gatekeeper.dec.com/pub/misc/netlib.tar.Z
file:///etc/motd

The first part (http, ftp, file) is called the scheme, which identifies how the file is retrieved. The next part (://) means a hostname will follow, whose interpretation depends on the scheme. After the hostname comes the path identifying the document. This path information is also called a partial URL.

The Web is a client-server system. Client browsers like Netscape and Lynx request documents (identified by a partial URL) from web servers like Apache. This browser-to-server dialog is governed by the HTTP protocol. Most of the time, the server merely sends back the file contents. Sometimes, however, the web server runs another program to return a document that could be HTML text, binary image, or any other document type.

The server-to-program dialog can be handled in two ways. Either the code to handle the request is part of the web server process, or else the web server runs an external program to generate a response. The first scenario is the model of Java servlets and mod_perl (covered in Chapter 21). The second is governed by the Common Gateway Interface (CGI) protocol, so the server runs a CGI program (sometimes known as a CGI script). This chapter deals with CGI programs.

The server tells the CGI program what page was requested, what values (if any) came in through HTML forms, where the request came from, whom they authenticated as (if they authenticated at all), and much more. The CGI program's reply has two parts: headers to say "I'm sending back an HTML document," "I'm sending back a GIF image," or "I'm not sending you anything; go to this page instead," and a document body, perhaps containing image data, plain text, or HTML.

The CGI protocol is easy to implement wrong and hard to implement right, which is why we recommend using Lincoln Stein's excellent CGI.pm module. It provides convenient functions for accessing the information the server sends you, and for preparing the CGI response the server expects. It's so useful, it's included in the standard Perl distribution, along with helper modules such as CGI::Carp and CGI::Fast. We show it off in Recipe 19.1.

Some web servers come with a Perl interpreter embedded in them. This lets Perl generate documents without starting a new process. The system overhead of reading an unchanging page isn't noticeable on infrequently accessed pages, even when it's happening several times a second. CGI accesses, however, bog down the machine running the web server. Chapter 21 shows how to use mod_perl, the Perl interpreter embedded in the Apache web server to get the benefits of CGI programs without the overhead.

Behind the Scenes

CGI programs are called each time the web server needs a dynamic document generated. It is important to understand that your CGI program doesn't run continuously, with the browser calling different parts of the program. Each request for a partial URL corresponding to your program starts a new copy. Your program generates a page for that request, then quits.

A browser can request a document in several distinct ways called methods. (Don't confuse HTTP methods with the methods of object-orientation. They have nothing to do with each other). The GET method is the most common, indicating a simple request for a document. The HEAD method supplies information about the document without actually fetching it. The POST method submits form values.

Form values can be encoded in both GET and POST methods. With the GET method, values are encoded directly in the URL, leading to ugly URLs like this:

http://www.perl.com/cgi-bin/program?name=Johann&born=1685

With the POST method, values are encoded in a separate part of the HTTP request that the client browser sends the server. If the form values in the previous example URL were sent with a POST request, the user, server, and CGI script would all see the URL:

http://www.perl.com/cgi-bin/program

The GET and POST methods differ in another respect: idempotency. This simply means that making a GET request for a particular URL once or multiple times should be no different. The HTTP protocol definition says that a GET request may be cached by the browser, the server, or an intervening proxy. POST requests cannot be cached, because each request is independent and matters. Typically, POST requests any changes or depends on the state of the server (query or update a database, send mail, or purchase a computer).

Most servers log requests to a file (the access log) for later analysis by the webmaster. Error messages produced by CGI programs don't by default go to the browser. Instead they are logged to a file on the server (the error log), and the browser simply gets a "500 Server Error" message, which means that the CGI program didn't uphold its end of the CGI bargain.

Error messages are useful in debugging any program, but they are especially so with CGI scripts. Sometimes, though, the authors of CGI programs either don't have access to the error log or don't know where it is. Sending error messages to a more convenient location is discussed in Recipe 19.2. Tracking down errors is covered in Recipe 19.3.

Recipe 19.8 shows how to learn what your browser and server are really saying to one another. Unfortunately, some browsers do not implement the HTTP specification correctly, and this recipe helps you determine whether your program or your browser is the cause of a problem.

Security

CGI programs let anyone run a program on your system. Sure, you get to pick the program, but the anonymous user from Out There can send unexpected values, hoping to trick it into doing the wrong thing. Thus security is a big concern on the Web.

Some sites address this concern by banning CGI programs. Sites that can't do without the power and utility of CGI programs must find ways to secure their CGI programs. Recipe 19.4 gives a checklist of considerations for writing a secure CGI script, briefly covering Perl's tainting mechanism for guarding against accidental use of unsafe data. Recipe 19.5 shows how your CGI program can safely run other programs.

HTML and Forms

Some HTML tags let you create forms, where the user can fill in values to submit to the server. The forms are composed of widgets, such as text entry fields and check boxes. CGI programs commonly return HTML, so the CGI module has helper functions to create HTML for everything from tables to form widgets.

In addition to Recipe 19.6, this chapter also has Recipe 19.10, which shows how to create forms that retain values over multiple calls. Recipe 19.11 shows how to make a single CGI script that produces and responds to a set of pages, such as a product catalog and ordering system.

Web-Related Resources

Unsurprisingly, some of the best references on the Web are found on the Web:

WWW Security FAQ

http://www.w3.org/Security/Faq/

Web FAQ

http://www.boutell.com/faq/

CGI FAQ

http://www.webthing.com/tutorials/cgifaq.html

HTTP Specification

http://www.w3.org/pub/WWW/Protocols/HTTP/

HTML Specification

http://www.w3.org/TR/REC-html40/

http://www.w3.org/pub/WWW/MarkUp/

CGI Specification

http://www.w3.org/CGI/

CGI Security FAQ

http://www.go2net.com/people/paulp/cgi-security/safe-cgi.txt

We recommend CGI Programming with Perl, by Scott Guelich, Shishir Gundavaram, and Gunther Birznieks (O'Reilly); HTML & XHTML: The Definitive Guide, by Chuck Musciano and Bill Kennedy (O'Reilly); and HTTP: The Definitive Guide, by David Gourley and Brian Totty, et al (O'Reilly).