D.2 Hypertext Transfer Protocol

As discussed in Chapter 1, HTTP is the standard that allows documents to be communicated and shared over the Web. From a network perspective, HTTP is an application-layer protocol that is built on top of TCP/IP. Since the original version, HTTP/0.9, there have only been two revisions of the HTTP standard. HTTP/1.0 was released as RFC-1945[1] in May 1996 and HTTP/1.1 as RFC-2616 in June 1999.

[1] Request for Comments, or RFCs, are submitted to the RFC editor (http://www.rfc-editor.org) usually by authors attached to organizations such as the Internet Engineering Task Force (IETF at http://www.ietf.org). RFCs date back to the early ARPAnet days and are used to present networking protocols, procedures, programs, and concepts. They also include meeting notes, opinions, bad poems, and other humor: RFC-2324 describes the Hypertext Coffee Pot Control Protocol.

In Chapter 1, we told you that HTTP is very simple: a client?most conspicuously a web browser?sends a request for some resource to a web (HTTP) server, and the server sends back a response. The HTTP response carries the resource?the HTML document or image or whatever?as its payload back to the client.

Continuing our analogy from the previous section, HTTP is a kind of cover letter?like a fax cover sheet?that is stored in an envelope and tells the receiver what language the document is in, instructions on how to read the letter, and how to reply.

D.2.1 Uniform Resource Locators

Uniform resource locators?more commonly known as URLs?are used as the primary naming and addressing method of the Web. URLs belong to the larger class of uniform resource identifiers ; both identify resources, but URLs include specific host details that allow connection to a server that holds the resource.

A URL can be broken into three basic parts: first, the protocol identifier; second, the host and service identifier; and, last, a resource identifier that contains a path with optional parameters and an optional query that identifies the resource. The following example shows a URL that identifies an HTTP resource:


The HTTP standard doesn't place any limit on the length of a URL, but some older browsers and proxy servers do. The structure of a URL is formally described by RFC-2396: Uniform Resource Identifiers (URI): Generic Syntax.

D.2.1.1 Protocol

The first part of the URL identifies the application protocol. HTTP URLs start with the familiar http://. Other applications that use URLs to locate resources identify different protocols; for example, URLs used with the File Transfer Protocol (FTP) begin with ftp://. URLs that identify HTTP resources served over connections that are encrypted using the Secure Sockets Layer start with https://. We discuss the use of the Secure Sockets Layer to protect data transmitted over the Internet in Chapter 11.

D.2.1.2 Host and service identification

The next part of the HTTP URL identifies the host on which the web server is running, and the port on which the server listens for HTTP requests. The domain name or the IP address can identify the host component. Using the domain name allows user-friendly web addresses such as:


The equivalent URL using the IP address is:

Domain names are not case sensitive.

D.2.1.3 Nonstandard TCP ports

By default, a HTTP server listens for requests on port 80. So, for example, requests for the URL http://www.oreilly.com are made to the host machine www.oreilly.com on port 80. When a nonstandard port is used, the URL must include the port number so the browser can successfully connect to the service. For example, the URL http://example.com:8080 connects to the web server running on port 8080 on the host example.com.

D.2.1.4 Resource identification

The remaining URL components help locate a specific resource. The path, with optional parameters, and an optional query are processed by the web server to locate or compute a response.

The path often corresponds to an actual file path on the host's filesystem. For example, an Apache web server running on a Unix machine that hosts example.com may store all the web content under the directory /usr/local/apache2/htdocs and be configured to use the path component of the URL relative to that directory. In this case, the HTTP response to the URL http://example.com/marketing/home.html contains the file /usr/local/apache2/htdocs/marketing/home.html.

In contrast to domain names, the resource identification component is usually case sensitive. This is because it refers to a directory or file on the web server, and Unix servers (which host the majority of web sites) are case sensitive.

D.2.1.5 Parameters and queries

The path component of a URL can include parameters and queries that are used by the web server. A common example is to include a query as part of the URL that runs a search script. The following example shows the string q=red as a query that the script search.php can use:


Multiple query terms can be encoded using the & character as a separator:


Parameters allow other information not related to a query to be encoded. For example, consider the parameter lines=10 in the URL:


This can be used by the search.php script to modify the number of lines to display in a result screen.

HTTP provides the distinction between parameters and queries, but parameters are more complex than described here and are not commonly used in practice. We discussed how PHP can use query variables encoded into URLs in Chapter 6.

D.2.1.6 Fragment identifiers

A URL can include a fragment identifier that is interpreted by the client once a requested resource has been received. A fragment identifier is included at the end of a URL separated from the path by the # character. The meaning of the fragment identifier depends on the type of the resource. For example, the following URL includes the fragment identifier tannin for a HTML document:


When a web browser receives the HTML resource, it then positions the rendered document in the display to start at the anchor element <a name="tannin"> if the named anchor exists.

D.2.1.7 Absolute and relative URLs

The URL general syntax allows a resource to be specified as an absolute or a relative URL. Absolute URLs identify the protocol http://, the host, and the path of the resource, and can be used alone to locate a resource. Here's an example absolute URL:


Relative URLs don't contain all the components and are always considered with respect to a base URL. A relative URL is resolved to an absolute URL, with respect to the base URL. Typically, a relative URL contains the path components of a resource and allows related sets of resources to reference each other in a relative way. This allows path hierarchies to be readily changed without the need to change every URL embedded in a set of documents.

A web browser has two ways to set base URLs when resolving relative URLs. The first method allows a base URL to be encoded into the HTML using the <base> element. The second method sets the base URL to that of the current document; this is done in the absence of a <base> element. For example, the following HTML document contains three relative URLs embedded into <a> elements:

  <p>Read my <a href="cv.html">Curriculum Vitae</a>

  <p>Read my <a href="work/emp.html">employment history</a>

  <p>Visit <a href="/admin/fred.html">Fred's home page</a>

Consider what happens if the page that contains the example is requested with the following URL:


The three relative URLs are resolved to the following absolute URLs by the browser:




Table D-1 shows several relative URLs and how they are resolved to the corresponding absolute URLs given the base URL http://example.com/a/b/c.html?foo=bar.

Table D-1. Example relative URLs resolved to absolute URLs

Relative URL

Absolute URL with respect to http://example.com/a/b/c.html?foo=bar















D.2.1.8 URL encoding

The characters used in resource names, query strings, and parameters must not conflict with the characters that have special meanings or aren't allowed in a URL. For example, a question mark character identifies the beginning of a query, and an ampersand (&) character separates multiple terms in a query.

The meanings of these characters can be escaped using a hexadecimal encoding consisting of the percent character (%) followed by the two hexadecimal digits representing the ASCII encoded of the character. For example, an ampersand (&) character is encoded as %26.

The characters that need to be escape-encoded are the control, space, and reserved characters:

; / ? : @ & = + $ ,

Delimiter characters must also be encoded:

< > # % "

The following characters can cause problems with gateways and network agents, and should also be encoded:

{} | \ ^ [ ] `

PHP provides the rawurlencode( ) function to encode special characters. For example, rawurlencode( ) can build the href attribute of an embedded link:

echo '<a href="search.php?q=' . rawurlencode("100% + more") . '">';

The result is an <a> element with an embedded URL correctly encoded:

<a href="search.php?q=100%25%20%2B%20more">

D.2.2 HTTP Requests

The model used for HTTP requests is to apply methods to identified resources. A HTTP request message contains a method name, a URL to which the method is to be applied, and header fields. Some requests can include a body?for example, the data collected in a form?that is referred to in the HTTP standard as the entity-body.

The following is the example HTTP request we showed you in Chapter 1:

GET /~hugh/index.html HTTP/1.1

Host: goanna.cs.rmit.edu.au

From: hugh@hughwilliams.com (Hugh Williams)

User-agent: Hugh-fake-browser/version-1.0

Accept: text/plain, text/html

The request applies the GET method to the /~hugh/index.html resource. The action is to retrieve the HTML document stored in the file index.html.

The first line of the message is the request and contains the method name GET, the request URL /~hugh/index.html, and the HTTP version HTTP/1.1, each separated by a space character. The request is followed by a list of header fields. Each field is represented as a name and value pair separated with a colon character, and each field is on a separate line.

The header fields are followed by a blank line and then by the optional body of the message. A POST method request usually contains a body of text, as we discuss in the next section.

D.2.2.1 Request methods

There are six request methods, but only three are used in practice:


Retrieves a resource. A query can be used to add extra information to the GET request and, as we discussed in our introduction to URLs, these are appended to the URL itself. A database search is a good example of an application of the GET request: the resource is likely to be a web script, and the query component of the URL is the search conditions.


Sends data to a server. Rather than appending data to the URL, the data is sent in the body of the HTTP request.


Requests only the header fields as a response, not the resource itself. This can be used for lightweight retrieval, so that the modification date of a resource can be checked before the full resource is retrieved with GET.


Allows a resource identified by the URL to be deleted from a server. This is the counterpart to the PUT method discussed next and it allows an author to remove a resource from the specified URL. It's usually not implemented by web servers.


Similar to the POST method, this method is designed to put a resource onto a server. Some HTML editors and web servers support the PUT methods allowing authors to put resources onto a web site at the specified URL. However, it's usually not implemented by web servers.


Produces diagnostic information.

The HTTP standard divides these methods into those that are safe and those that aren't. The safe methods?GET and HEAD?don't have any persistent side effects on the server. The unsafe methods?POST, PUT, and DELETE?are designed to have persistent effects on the server. The standard allows for clients to warn users that a request may be unsafe and, for example, most browsers won't resend a request with the POST method without user confirmation.

The HTTP standard further classifies methods as idempotent when a request can be repeated many times and have the same effect as if the method was called once. The GET, HEAD, PUT, and DELETE methods are classified as idempotent. The POST method isn't.

D.2.2.2 GET versus POST

Both the GET and POST methods send data to the server, but which method should you use?

The HTTP standard includes the two methods to achieve different goals. The POST method was intended to create a resource. The contents of the resource would be encoded into the body of the HTTP request. For example, an order form might be processed and a new row in a database created.

The GET method is used when a request has no side effects (such as performing a search) and the POST method is used when a request has side effects (such as adding a new row to a database). A more practical issue is that the GET method may result in long URLs, and may even exceed some browser and server limits on URL length.

Use the POST method if any of the following are true:

  • The result of the request has persistent side effects such as adding a new database row.

  • The data collected on the form is likely to result in a long URL if you used the GET method.

  • The data to be sent is in any encoding other than seven-bit ASCII.

Use the GET method if all the following are true:

  • The request is to find a resource, and HTML form data is used to help that search.

  • The result of the request has no persistent side effects.

  • The data collected and the input field names in a HTML form are in total less than 1,024 characters in size.

D.2.3 HTTP Responses

When a web server processes a request from a browser, it attempts to apply the method to the identified resource and create a response. The action of the request may succeed or fail, but the web server always sends a response message back to the browser.

A HTTP response message contains a status line, header fields, and (usually) the requested entity as the body of the message. For example, the following is the result of a GET method request for a small HTML file:

HTTP/1.1 200 OK

Date: Sun, 19 Dec 2004 02:54:37 GMT

Server: Apache/2.0.48

Last-Modified: Fri, 19 Dec 2003 02:53:08 GMT

ETag: "4445f-bf-39f4f994"

Content-Length: 321

Accept-Ranges: bytes

Connection: close

Content-Type: text/html



   "-//W3C//DTD HTML 4.0 Transitional//EN"

   "http://www.w3.org/TR/html4/loose.dtd" >


<head><title>Grapes and Glass</title></head>


<img src="http://example.com/grapes.gif">

<p>Welcome to my simple page 

<p><img src="http://example.com/glass.gif">



The first, status line begins with the protocol version of the message, followed by a status code and a reason phrase, each separated by a space character. The status code is a number and the reason phrase describes its meaning; these are discussed in the next section. The status line is then followed by the header fields. As with the request, each field is represented as a name and value pair separated with a colon character. A blank line separates the header fields from the body of the response, in this case an HTML document.

D.2.3.1 Status codes

HTTP status codes are used to classify responses to requests. The HTTP status code system is extensible, with a set of codes described in the standard that are "generally recognized in current practice". HTTP defines a status code as a three-digit number, where the first digit is the class of response. The following list shows the five classes of codes defined by HTTP:


Informational. HTTP 1.1 uses codes in this class to indicate the request has been received by the server and that processing is continuing.


Success. The request was successfully received, and the action successfully performed.


Redirection. When a response has a redirection code, the client needs to make a further request to get the specified resource. The URL of the actual resource is included in the response header field Location. When the status code is set to 301, the browser automatically makes the request for the URL specified in the Location header field. The use of the Location header field is discussed further in Chapter 6, and used in many examples throughout this book.


Client error. The request can't be processed because of bad syntax of the message, the sender is unauthorized or forbidden to access the resource, or the resource can't be found.


Server error. The server failed to fulfill a valid request.

D.2.4 Caching

Most user agents, such as web browsers, allow HTTP responses to be cached. HTTP responses are cached by saving a response to a request in memory. When a browser considers a request, it first looks to its local cache to see if it has an up-to-date copy of the response before sending the request to the web server. This can significantly reduce the number of requests sent to a web server, improving the performance of the web application and responsiveness to users.

Consider a web site that includes a company logo on the top of each HTML page:

<img src="/images/logo.gif">

When the browser requests a page that contains the image, a separate request is sent to retrieve the image /images/logo.gif. If the image resource is cacheable, and browser caching is enabled, the browser saves the response. A subsequent request for the image is recognized, and the local copy from the cache is used rather than sending another request to the web server.

A browser uses a cached response until the response becomes stale, or the cache becomes full and the response is displaced by the resources from other requests. The primary mechanism for determining if a response is stale is comparing the date and time set in the Expires header field with the date and time of the machine running the browser. If the date and time are incorrectly set on the machine, a cached response may expire immediately or be cached longer than intended.

HTTP describes the conditions that allow a user agent to cache a response. However, there are many situations in which an application may wish to prevent a page from being cached, particularly when the content of a response is dynamically generated, such as in a web database application.

HTTP/1.1 uses the Cache-Control header field as its basic caching control mechanism. For example, setting the Cache-Control header field to no-cache in a HTTP response prevents the response from being cached by a HTTP/1.1 user agent. The header can be used in requests and responses, but we consider only responses here.

Some HTTP/1.1 Cache-Control settings are directed to user agents that maintain caches for more that one user, such as proxy servers. Proxy servers are used to achieve several goals, the most important of which is to provide caching of responses for a group of users. A local network, such as that found in a university department, can be configured to send all HTTP requests to a proxy server. The proxy server forwards requests to the destination web server and passes back the responses to the originating client.

Proxy servers can cache responses and thus reduce requests sent outside the local network. Setting the Cache-Control header field to public allows a user agent to make the cached response available to any request. Setting the Cache-Control header field to private allows a user agent to make the cached response available only to the client who made the initial request.

Setting the Cache-Control header to no-store prevents a user agent from storing the response on disk. This prevents sensitive information from being inadvertently saved beyond the life of a browser session. HTTP/1.1 defines several other Cache-Control header fields not described here.