13.2 access.log

Squid saves key information about HTTP transactions in access.log. This file is line-based, such that each line corresponds to one client request. Squid records the client IP address (or hostname), requested URI, response size, and other information.

Squid records all HTTP accesses in access.log, except for those that disconnect before sending any data. Squid also records all ICP (but not HTCP) transactions unless you disable them with the log_icp_queries directive. Section 13.2.4 describes the other squid.conf directives that affect the access log.

The default access.log format contains 10 fields. Here are some examples, with long lines split and indented:

1066037222.011  126389 9.121.105.207 TCP_MISS/503 1055

        GET http://home.gigigaga.com/n8342133/Miho.DAT.019 -

        DIRECT/203.187.1.180 -

1066037222.011   19120 12.83.179.11 TCP_MISS/200 359

        GET http://ads.x10.com/720x300/Z2FtZ3JlZXRpbmcxLmRhd/7/AMG -

        DIRECT/63.211.210.20 text/html

1066037222.011   34173 166.181.33.71 TCP_MISS/200 559

        GET http://coursesites.blackboard.com:8081/service/collab/../1010706448190/ -

        DIRECT/216.200.107.101 application/octet-stream

1066037222.011   19287 41.51.105.27 TCP_REFRESH_MISS/200 500

        GET http://fn.yam.com/include/tsemark/show.js -

        DIRECT/210.59.224.59 application/x-javascript

1066037222.011   19395 41.51.105.27 TCP_MISS/304 274

        GET http://fnasp.yam.com/image/coin3.gif -

        DIRECT/211.72.254.133 -

1066037222.011   19074 30.208.85.76 TCP_CLIENT_REFRESH_MISS/304 197

        GET http://ads.icq.com/content/B0/0/..bC6GygEYNeHGjBUin5Azfe68m5hD1jLk$/aol -

        DIRECT/64.12.184.121 -

1066037222.011   19048 12.83.179.11 TCP_MISS/200 261

        GET http://ads.adsag.com/js.ng/...ne&cat=friendship&subcat=girltalk -

        DIRECT/209.225.54.119 application/x-javascript

1066037222.118     106 41.51.105.27 TCP_HIT/200 536

        GET http://rcm-images.amazon.com/images/G/01/rcm/privacy.gif -

        NONE/- image/gif

1066037222.352   19475 27.34.49.248 TCP_MISS/200 12387

        GET http://espanol.geocities.com/lebastias/divulgacion/budismo-tarot.html -

        DIRECT/209.1.225.139 text/html

1066037222.352     132 144.157.100.17 TCP_MISS/504 1293

        GET http://ar.atwola.com/image/93101912/aol -

        NONE/- -

Here are the definitions for all fields:

1: timestamp

The completion time of the request, expressed as the number of seconds since the Unix epoch (Thu Jan 1 00:00:00 UTC 1970), with millisecond resolution. Squid uses this format, instead of something more human-friendly, to simplify the work of various log file processing programs.

You can use a simple Perl command to convert the Unix timestamps into local time. For example:

perl -pe 's/^\d+\.\d+/localtime($&)/e;' access.log

2: response time

For HTTP transactions, this field indicates how much time it took to process the request. The timer starts when Squid receives the HTTP request and stops when the response has been fully delivered. The response time is given in milliseconds.

The response time is usually 0 for ICP queries. This is because Squid answers ICP queries very quickly. Furthermore, Squid doesn't update the process clock between receiving an ICP query and sending the reply.

While time values are reported with millisecond resolution, the precision of those entries is probably about 10 milliseconds. Timing becomes even less precise when Squid is heavily loaded.

3: client address

This field contains the client's IP address, or hostname if you enable log_fqdn. For security or privacy reasons, you may want to mask a part of client's address out using the client_netmask directive. However, that also makes it impossible to group requests coming from the same client.

4: result/status codes

This field consists of two tokens separated by a slash. The first token, result code, classifies the protocol and the result of a transaction (e.g., TCP_HIT or UDP_DENIED). These are Squid-specific codes, defined in Section 13.2.1. The codes that begin with TCP_ refer to HTTP requests, while UDP_ refers to ICP queries.

The second token is the HTTP response status code (e.g, 200, 304, 404, etc.). The status code normally comes from the origin server. In some cases, however, Squid may be responsible for selecting the status code. These codes, defined by the HTTP RFC, are summarized later in Table 13-1.

5: transfer size

This field indicates the number of bytes transferred to the client. Strictly speaking, it is the number of bytes that Squid told the TCP/IP stack to send to the client. Thus, it doesn't include overheads from TCP/IP headers. Also note that the transfer size is normally larger than the response's Content-Length. This value includes the HTTP response headers, while Content-Length does not.

These properties make the transfer size field useful for approximate bandwidth usage analysis but not for exact HTTP entity size calculations. If you need to know a response's Content-Length, you can find it in the store.log file.

6: request method

This field contains the request method. Because Squid clients may use ICP or HTTP, the request method is either HTTP- or ICP-specific. The most common HTTP request method is GET. ICP queries are always logged with ICP_QUERY. See Section 6.1.2.8 for a list of HTTP methods Squid knows about.

7: URI

This field contains the URI from the client's request. The vast majority of logged URIs are actually URLs (i.e., they have hostnames).

Squid uses a special format for certain failures. These are cases when Squid can't parse the HTTP request or otherwise determine the URI. Instead of a URI/URL, you'll see a string such as "error:invalid-request." For example:

1066036250.603 310 192.0.34.70 NONE/400 1203 GET error:invalid-request - NONE/- -

Also in this field look out for whitespace characters in the URI. Depending on your uri_whitespace setting, Squid may print the URI in the log file with whitespace characters. When this happens, the tools that read access.log files may become confused by the extra fields.

When logging, Squid strips all URI characters after the first question mark unless the strip_query_terms directive is disabled.

8: client identity

Squid can determine a user's identity in two different ways. One is with the RFC 1413 ident protocol; the other is from HTTP authentication headers.

Squid attempts ident lookups based on the ident_lookup_access rules, if any (see Section 6.2). Alternatively, if you use proxy authentication (or regular server authentication in surrogate mode), Squid places the given username in this field. If both methods provide Squid with a username, and you're using the native access.log format, the HTTP authentication name is logged, and the RFC 1413 name is ignored. The common log file format has separate fields for both names.

9: peering code/peerhost

The peering information consists of two tokens, separated by a slash. It is relevant only for requests that are cache misses. The first token indicates how the next hop was chosen. The second token is the address of that next hop. The peering codes are listed in Section 13.2.3.

When Squid sends a request to a neighbor cache, the peerhost address is the neighbor's hostname. If the request is sent directly to the origin server, however, Squid writes the origin server's IP address or its hostname if log_ip_on_direct is disabled. The value NONE/- indicates that Squid didn't forward this request to any other servers.

10: content type

The final field of the default, native access.log is the content type of the HTTP response. Squid obtains the content type value from the response's Content-Type header. If that header is missing, Squid uses a hyphen (-).

If you enable the log_mime_headers directive, Squid appends two additional fields to each line:

11: HTTP request headers

Squid encodes the HTTP request headers and prints them between a pair of square brackets. The brackets are necessary because Squid doesn't encode space characters. The encoding scheme is a little strange. Carriage return (ASCII 13) and newline (ASCII 10) are printed as \r and \n, respectively. Other non-printable characters are encoded with the RFC 1738 style, such that Tab (ASCII 9) becomes %09.

12: HTTP response headers

Squid encodes the HTTP response headers and prints them between a pair of square brackets. Note that these are the headers sent to the client, which may be different from headers received from the origin server.

Squid writes to access.log only after the entire response has been sent to the client. This allows Squid to include both request and response information in the log file. However, transactions that take minutes, or even hours, to complete aren't visible in access.log at the time of the request. When these types of transactions present a performance or policy concern, the access.log may be unable help you. Instead, use the cache manager to view a list of pending transactions (see Section 14.2.1.37).

13.2.1 access.log Result Codes

The following labels may appear in the fourth field of the access.log file in response to HTTP requests:

TCP_HIT: Squid found a likely fresh copy of the requested resource and sent it immediately to the client.
TCP_MISS: Squid didn't have a cached copy of the requested resource.
TCP_REFRESH_HIT: Squid found a likely stale copy of the requested resource and sent a validation request to the origin server. The origin server sent a 304 (Not Modified) response, indicating that Squid's copy is still fresh.
TCP_REF_FAIL_HIT: Squid found a likely stale copy of the requested resource and sent a validation request to the origin server. However, the origin server failed to respond or sent a response that Squid didn't understand. In any case, Squid sent the cached (and likely stale) copy to the client.
TCP_REFRESH_MISS: Squid found a likely stale copy of the requested resource and sent a validation request to the origin server. The server responded with new content, indicating the cached response was indeed stale.
TCP_CLIENT_REFRESH_MISS: Squid found a copy of the requested resource, but the client's request included a Cache-Control: no-cache directive. Squid forwarded the client's request to the origin server, forcing a cache validation.
TCP_IMS_HIT: The client sent a validation request, and Squid found a more recent, and likely fresh, copy of the requested resource. Squid sent the newer content to the client, without contacting the origin server.
TCP_SWAPFAIL_MISS: Squid found a valid copy of the requested resource but failed to load it from disk. Squid then sent the request to the origin server as though it were a cache miss.
TCP_NEGATIVE_HIT: When a request to an origin server results in an HTTP error, Squid may cache the response anyway. Repeated requests for these resources, within a short amount of time, result in negative hits. The negative_ttl directive controls the amount of time these errors may be cached. Also note that errors are cached only in memory and never written to disk. The following HTTP status codes may be negatively cached, subject to additional constraints: 204, 305, 400, 403, 404, 405, 414, 500, 501, 502, 503, 504.
TCP_MEM_HIT: Squid found a valid copy of the requested resource in the memory cache and sent it immediately to the client. Note that this doesn't accurately represent all responses served from memory. For example, responses that are cached in memory, but require validation, are logged with TCP_REFRESH_HIT, TCP_REFRESH_MISS, etc.
TCP_DENIED: The client's request was denied, due to either the http_access or http_reply_access rules. Note that requests denied by http_access have NONE/- in the ninth field, whereas those denied by http_reply_access have a valid entry.
TCP_OFFLINE_HIT: When offline_mode is enabled, Squid returns cache hits for almost any cached response, without considering its freshness.
TCP_REDIRECT: A redirector program told Squid to generate an HTTP redirect to a new URI (see Section 11.1). Normally, Squid doesn't log these redirects. To do so, you must manually define the LOG_TCP_REDIRECTS preprocessor directive before compiling Squid.
NONE: Unclassified result used for certain errors, such as invalid hostnames.

The following labels may appear in the fourth field of the access.log file in response to ICP queries:

UDP_HIT: Squid found a likely fresh copy of the requested resource in the cache.
UDP_MISS: Squid didn't find a likely fresh copy of the requested resource in the cache. If the same object is requested via HTTP, it would probably be a cache miss. Compare with UDP_MISS_NOFETCH.
UDP_MISS_NOFETCH: Like UDP_MISS, except that this also indicates Squid's reluctance to handle the corresponding HTTP request. If you use the -Y command-line option, Squid returns this, instead of UDP_MISS, while rebuilding its in-memory indexes at startup.
UDP_DENIED: The ICP query is denied due to the icp_access rules. If more than 95% of the ICP replies to a client are UDP_DENIED, and the client database is enabled (see Appendix A), Squid stops sending any ICP replies to the client for an hour. When this happens you'll also see a warning in cache.log.
UDP_INVALID: Squid received an invalid query (e.g., truncated message, invalid protocol version, whitespace in the URI, etc.). Squid sent an ICP_INVALID reply back to the client.

13.2.2 HTTP Response Status Codes

Table 13-1 lists the numerical HTTP response codes and reason phrases. Note that Squid and other HTTP agents care only about the numeric value. The reason phrase is purely informational and doesn't affect the meaning of the response. For each status code, I also provide a reference to the particular section in RFC 2616 that describes it. Note that status codes 0 and 600 are nonstandard values used by Squid, and aren't mentioned in the RFC.

Table 13-1. HTTP response status codes
Code	Reason phrase	RFC 2616 section
0	No Response Received (Squid-specific)	N/A
1xx	Informational	10.1
100	Continue	10.1.1
101	Switching Protocols	10.1.2
2xx	Successful	10.2
200	OK	10.2.1
201	Created	10.2.2
202	Accepted	10.2.3
203	Non-Authoritative Information	10.2.4
204	No Content	10.2.5
205	Reset Content	10.2.6
206	Partial Content	10.2.7
3xx	Redirection	10.3
300	Multiple Choices	10.3.1
301	Moved Permanently	10.3.2
302	Found	10.3.3
303	See Other	10.3.4
304	Not Modified	10.3.5
305	Use Proxy	10.3.6
306	(Unused)	10.3.7
307	Temporary Redirect	10.3.8
4xx	Client Error	10.4
400	Bad Request	10.4.1
401	Unauthorized	10.4.2
402	Payment Required	10.4.3
403	Forbidden	10.4.4
404	Not Found	10.4.5
405	Method Not Allowed	10.4.6
406	Not Acceptable	10.4.7
407	Proxy Authentication Required	10.4.8
408	Request Timeout	10.4.9
409	Conflict	10.4.10
410	Gone	10.4.11
411	Length Required	10.4.12
412	Precondition Failed	10.4.13
413	Request Entity Too Large	10.4.14
414	Request-URI Too Long	10.4.15
415	Unsupported Media Type	10.4.16
416	Requested Range Not Satisfiable	10.4.17
417	Expectation Failed	10.4.18
5xx	Server Error	10.5
500	Internal Server Error	10.5.1
501	Not Implemented	10.5.2
502	Bad Gateway	10.5.3
503	Service Unavailable	10.5.4
504	Gateway Timeout	10.5.5
505	HTTP Version Not Supported	10.5.6
6xx	Proxy Error	N/A
600	Unparseable Response Headers (Squid-specific)	N/A

You'll see status code 0 in the access.log if Squid doesn't receive any response from the origin server. You'll see status code 600 if Squid received a response but couldn't find any HTTP headers. In a small fraction of cases, certain origin servers send only the response body and omit any headers.

13.2.3 access.log Peering Codes

The following codes may appear in the ninth field of the access.log. Refer to Section 10.10 for a description of how Squid selects the next-hop for cache misses.

NONE

This indicates that Squid didn't communicate with any other servers (neighbors, origin) for this request. You'll see it in association with various types of cache hits, denied requests, cache manager requests, errors, and all ICP queries.

DIRECT

Squid forwarded the request directly to the origin server. The second half of the field shows the origin server's IP address, or hostname if you've disabled log_ip_on_direct.

SIBLING_HIT

Squid sent the request to this sibling cache after the sibling returned an ICP or HTCP hit.

PARENT_HIT

Squid sent the request to this parent cache after the parent returned an ICP or HTCP hit.

DEFAULT_PARENT

Squid selected this parent because it was marked as default on the cache_peer line in squid.conf.

FIRST_UP_PARENT

Squid forwarded the request to this parent because it is the first parent in the list known to be alive.

FIRST_PARENT_MISS

Squid forwarded the request to the parent cache that was first to respond with an ICP/HTCP miss message. In other words, for this particular ICP/HTCP query, at this particular time, the selected parent had the best round-trip time. Note that measured RTTs may be artificially adjusted by the weight option to the cache_peer directive.

CLOSEST_PARENT_MISS

Squid selected this parent because it reports the lowest RTT to the origin server. This occurs only if both caches have netdb enabled (see Section 10.5), and the origin server (or other servers on its subnet) returns ICMP pings.

CLOSEST_PARENT

This is similar to CLOSEST_PARENT_MISS, except that the RTT measurements don't come from the ICP/HTCP reply messages. Instead, they come from older measurements saved by Squid, such as the netdb exchange feature.

CLOSEST_DIRECT

Squid forwarded the request to the origin server based on netdb measurements. This happens if any of these conditions occur:

The RTT between Squid and the origin server is less than the configured minimum_direct_rtt value.
The measured number of router hops between Squid and the origin server is less than the configured minimum_direct_hops value.
The RTT values returned in ICP/HTCP replies indicate that Squid is closer to the origin server than any of its neighbors.

ROUNDROBIN_PARENT

Squid forwarded the request to this parent because the round-robin option was set, and it had the lowest usage counter.

CD_PARENT_HIT

Squid forwarded the request to this parent based on the Cache Digest algorithm (see Section 10.7).

CD_SIBLING_HIT

Squid forwarded the request to this sibling based on the Cache Digest algorithm.

CARP

Squid selected this parent based on the Cache Array Routing Protocol algorithm (see Section 10.9).

ANY_PARENT

Squid selected this parent as a last resort because none of the other methods resulted in a viable next-hop.

Note that most of these codes may be preceded by TIMEOUT_ to indicate that a timeout occurred while waiting for ICP/HTCP replies. For example:

1066038165.382    345 193.233.46.21 TCP_MISS/200 2836

        GET http://www.caida.org/home/images/home.jpg

        TIMEOUT_CLOSEST_DIRECT/213.219.122.19 image/jpeg

You can adjust the timeout with the icp_query_timeout directive.

13.2.4 Configuration Directives That Affect access.log

Following are the configuration file directives that affect the access.log in one way or another.

13.2.4.1 log_icp_queries

This directive, enabled by default, causes Squid to log all ICP queries. If you're running a busy parent cache, this may make your access.log files huge. To save space, disable this directive:

log_icp_queries off

If you disable ICP query logging, I suggest that you monitor the number of queries, either through the cache manager or with SNMP.

13.2.4.2 emulate_httpd_log

The access.log file has two formats: common and native. The common format is the same as most HTTP servers (e.g., Apache) use. It contains less information than Squid's native format. However, you might want to use the common log-file format if you use Squid as a surrogate (see Chapter 15). The common format may also be useful if you have log-file analysis tools that know how to parse it. Use this directive to enable the common format:

emulate_httpd_log on

See the site http://www.w3.org/Daemon/User/Config/Logging.html#common-logfile-format, for a description of this format.

13.2.4.3 log_mime_hdrs

Use the log_mime_hdrs directive to make Squid log the HTTP request and response headers:

log_mime_headers on

When enabled, Squid appends the request and response headers to access.log. This adds two fields to each line. Each field is surrounded by square brackets to make parsing easier. Certain characters are encoded to keep the log file readable. Table 13-2 shows the encoding scheme.

Table 13-2. Character encoding rules for HTTP headers in access.log
Character	Encoding
Newline	\n
Carriage return	\r
Backslash	\\
[	%5b
]	%5d
%	%25
ASCII 0-31	`%xx` (hexadecimal value)
ASCII 127-255	`%xx` (hexadecimal value)

13.2.4.4 log_fqdn

By default, Squid puts client IP addresses in the access.log. You can record hostnames, when available, by enabling this directive:

log_fqdn on

This causes Squid to make reverse DNS lookups for the client's address when it receives a request. If an answer is available by the time the request is complete, Squid places it in the third field.

13.2.4.5 ident_lookup_access

This access rule list determines whether or not Squid makes an RFC 1413 ident query for the client's TCP connection. By default, Squid doesn't issue ident queries. To enable this feature, simply add one or more rules:

acl All src 0/0

ident_lookup_access allow All

If an answer is available by the time the request is complete, Squid places it in the eighth field. If you are also using HTTP authentication, that username is written instead of the ident answer.

13.2.4.6 log_ip_on_direct

When Squid forwards a cache miss to an origin server, it records the origin server's IP address in the ninth field. You can disable this directive so that Squid writes the hostname instead:

log_ip_on_direct off

In this case, the hostname comes from the URI. If the URI contains an IP address, Squid doesn't convert it to a hostname.

13.2.4.7 client_netmask

This directive exists to provide some level of privacy for your users. Rather than logging the entire client IP address, you can mask off some bits. For example:

client_netmask 255.255.255.0

With this setting, all client IP addresses in access.log have 0 as the last octet:

1066036246.918     35 163.11.255.0 TCP_IMS_HIT/304 266 GET http://...

1066036246.932     16 163.11.255.0 TCP_IMS_HIT/304 266 GET http://...

1066036247.616    313 140.132.252.0 TCP_MISS/200 1079 GET http://...

1066036248.598  44459 140.132.252.0 TCP_MISS/500 1531 GET http://...

1066036249.230     17 170.210.173.0 TCP_IMS_HIT/304 265 GET http://...

1066036249.752   2135 140.132.252.0 TCP_MISS/200 50230 GET http://...

1066036250.467      4 170.210.173.0 TCP_IMS_HIT/304 265 GET http://...

1066036250.762    102 163.11.255.0 TCP_IMS_HIT/304 265 GET http://...

1066036250.832     20 163.11.255.0 TCP_IMS_HIT/304 266 GET http://...

1066036251.026     74 203.91.150.0 TCP_CLIENT_REFRESH_MISS/304 267 GET http://...

13.2.4.8 strip_query_terms

This directive is another privacy feature. It removes query terms from URIs before logging them. If your log files somehow fall into the wrong hands, they won't be able to find any usernames and passwords. When this directive is enabled, all characters after a question mark (?) are removed. For example, a URI like this:

http://auto.search.msn.com/response.asp?MT=www.kimo.com.yw&srch=3

&prov=&utf8

is logged like this:

http://auto.search.msn.com/response.asp?

13.2.4.9 uri_whitespace

Earlier, I mentioned the problem with whitespace appearing in some URIs. The RFCs state that URIs must not contain whitespace, but in reality it happens all too often. The uri_whitespace directive dictates how Squid should handle such cases. The allowed settings are: strip (default), deny, allow, encode, and chop. Of these, strip, encode, and chop ensure that the URI field doesn't contain any whitespace (thus adding more fields to access.log).

The allow setting allows the request to pass through Squid unmodified. It is likely to cause trouble for redirectors and log file parsers. The deny setting, on the other hand, causes Squid to deny the request. The user receives an error message, but the request is still written to access.log with the whitespace characters.

If you set it to encode, Squid changes the whitespace characters to their RFC 1738 equivalents. This is probably what the user-agent should have done in the first place. The chop setting causes Squid to cut off the URI at the first whitespace character.

The default setting is strip, which makes Squid remove the whitespace characters from the URI. It ensures that your log-file parsers and redirectors will be happy, but it might break certain things, such as improperly encoded search engine queries.

13.2.4.10 buffered_logs

By default, Squid disables buffering for the cache.log file, which allows you to run tail -f and watch log file entries appear in real time. If you think this will cause an unnecessary overhead, you can disable buffering:

buffered_logs off

However, it probably doesn't matter unless you are running Squid with full debugging. Note that this option affects only cache.log. The others always use unbuffered writes.

13.2.5 access.log Analysis Tools

The access.log file contains a wealth of information, much more than you can see by just browsing through it. In order to get the big picture view, you'll need to use a third-party log-file analysis package. You can find a long list of them linked from the Squid web page, or by going directly to http://www.squid-cache.org/Scripts/.

One of the most popular tools is Calamarisa Perl script that parses the log file and generates either text or HTML-based reports. It provides a breakdown of traffic by request method, client IP address, origin server domain name, content types, filename extensions, reply size, and more. Calamaris also reports on ICP query traffic and even understands log files from other caching products. Check it out by visiting http://calamaris.cord.de/.

Squeezer, and its derivative, Squeezer2, are Squid-specific analysis tools. They provide many statistics that can help you understand Squid's performance, especially when you have neighbors. Both generate HTML pages as output. Visit the Logfile Analysis page on the squid-cache.org site for links to these programs.

Webalyzer is another good utility. It is designed to be fast and produces HTML pages with tables and bar charts. It was originally designed for origin server access logs. Although it can parse Squid's logs, it doesn't report on such things as hit ratios and response times. It also uses some terms differently than I do. For example, Webalyzer calls any request a "hit," which isn't the same as a cache hit. It also makes a distinction between "pages" and "files." For more information, visit the Webalyzer home page at http://www.mrunix.net/webalyzer/.