3.10 Preventing Cross-Site Scripting

3.10.1 Problem

You are developing a web-based application, and you want to ensure that an attacker cannot exploit it in an effort to steal information from the browsers of other people visiting the same site.

3.10.2 Solution

When you are generating HTML that must contain external input, be sure to escape that input so that if it contains embedded HTML tags, the tags are not treated as HTML by the browser.

3.10.3 Discussion

Cross-site scripting attacks (often called CSS, but more frequently XSS in an effort to avoid confusion with cascading style sheets) are a general class of attacks with a common root cause: insufficient input validation. The goal of many cross-site scripting attacks is to steal information (usually the contents of some specific cookie) from unsuspecting users. Other times, the goal is to get an unsuspecting user to launch an attack on himself. These attacks are especially a problem for sites that store sensitive information, such as login data or session IDs, in cookies. Cookie theft could allow an attacker to hijack a session or glean other information that is intended to be private.

Consider, for example, a web-based message board, where many different people visit the site to read the messages that other people have posted, and to post messages themselves. When someone posts a new message to the board, if the message board software does not properly validate the input, the message could contain malicious HTML that, when viewed by other people, performs some unexpected action. Usually an attacker will attempt to embed some JavaScript code that steals cookies, or something similar.

Often, an attacker has to go to greater lengths to exploit a cross-site script vulnerability; the example described above is simplistic. An attacker can exploit any page that will include unescaped user input, but usually the attacker has to trick the user into displaying that page somehow. Attackers use many methods to accomplish this goal, such as fake pages that look like part of the site from which the attacker wishes to steal cookies, or embedded links in innocent-looking email messages.

It is not generally a good idea to allow users to embed HTML in any input accepted from them, but many sites allow simple tags in some input, such as those that enable bold or italics on text. Disallowing HTML altogether is the right solution in most cases, and it is the only solution that will guarantee that cross-site scripting will be prevented. Other common attempts at a solution, such as checking the referrer header for all requests (the referrer header is easily forged), do not work.

To disallow HTML in user input, you can do one of the following:

  • Refuse to accept anything that looks as if it may be HTML

  • Escape the special characters that enable a browser to interpret data as HTML

Attempting to recognize HTML and refuse it can be error-prone, unless you only look for the use of the greater-than (>) and less-than (<) symbols. Trying to match tags that will not be allowed (i.e., a blacklist) is not a good idea because it is difficult to do, and future revisions of HTML are likely to introduce new tags. Instead, if you are going to allow some tags to pass through, you should take the whitelist approach and only allow tags that you know are safe.

JavaScript code injection does not require a <script> tag; many other tags can contain JavaScript code as well. For example, most tags support attributes such as "onclick" and "onmouseover" that can contain JavaScript code.

The following spc_escape_html( ) function will replace occurrences of special HTML characters with their escape sequences. For example, input that contains something like "<script>" will be replaced with "&lt;script&gt;", which no browser should ever interpret as HTML.

Our function will escape most HTML tags, but it will also allow some through. Those that it allows through are contained in a whitelist, and it will only allow them if the tags are used without any attributes. In addition, the a (anchor) tag will be allowed with a heavily restricted href attribute. The attribute must begin with "http://", and it must be the only attribute. The character set allowed in the attribute's value is also heavily restricted, which means that not all necessarily valid URLs will successfully make it through. In particular, if the URL contains "#", "?", or "&", which are certainly valid and all have special meaning, the tag will not be allowed.

If you do not want to allow any HTML through at all, you can simply remove the call to spc_allow_tag() in spc_escape_html(), and force all possible HTML to be properly escaped. In many cases, this will actually be the behavior that you'll want.

spc_escape_html() will return a C-style string dynamically allocated with malloc(), which the caller is responsible for deallocating with free(). If memory cannot be allocated, the return will be NULL. It also expects a C-style string containing the text to filter as its only argument.

#include <stdlib.h>
#include <string.h>
#include <ctype.h>

/* These are HTML tags that do not take arguments.  We special-case the <a> tag
 * since it takes an argument.  We will allow the tag as-is, or we will allow a
 * closing tag (e.g., </p>).  Additionally, we process tags in a case-
 * insensitive way.  Only letters and numbers are allowed in tags we can allow.
 * Note that we do a linear search of the tags.  A binary search is more
 * efficient (log n time instead of linear), but more complex to implement.
 * The efficiency hit shouldn't matter in practice.
static unsigned char *allowed_formatters[]  = {
  "b", "big", "blink", "i", "s", "small", "strike", "sub", "sup", "tt", "u",
  "abbr", "acronym", "cite", "code", "del", "dfn", "em", "ins", "kbd", "samp",
  "strong", "var", "dir", "li", "dl", "dd", "dt", "menu", "ol", "ul", "hr",
  "br", "p", "h1", "h2", "h3", "h4", "h5", "h6", "center", "bdo", "blockquote",
  "nobr", "plaintext", "pre", "q", "spacer",
  /* include "a" here so that </a> will work */

#define SKIP_WHITESPACE(p) while (isspace(*p)) p++

static int spc_is_valid_link(const char *input) {
  static const char *href="href";
  static const char *http = "http://";
  int               quoted_string = 0, seen_whitespace = 0;

  if (!isspace(*input)) return 0;
  if (strncasecmp(href, input, strlen(href))) return 0;
  input += strlen(href);
  if (*input++ != '=') return 0;
  if (*input == '"') {
    quoted_string = 1;
  if (strncasecmp(http, input, strlen(http))) return 0;
  for (input += strlen(http);  *input && *input != '>';  input++) {
    switch (*input) {
      case '.': case '/': case '-': case '_':
      case '"':
        if (!quoted_string) return 0;
        if (*input != '>') return 0;
        return 1;
        if (isspace(*input)) {
          if (seen_whitespace && !quoted_string) return 0;
          seen_whitespace = 1;
        if (!isalnum(*input)) return 0;
  return (*input && !quoted_string);

static int spc_allow_tag(const char *input) {
  int  i;
  char *tmp;

  if (*input == 'a')
    return spc_is_valid_link(input + 1);
  if (*input == '/') {
  for (i = 0;  i < sizeof(allowed_formatters);  i++) {
    if (strncasecmp(allowed_formatters[i], input, strlen(allowed_formatters[i])))
    else {
      tmp = input + strlen(allowed_formatters[i]);
      if (*input == '>') return 1;
  return 0;

/* Note: This interface expects a C-style NULL-terminated string. */
char *spc_escape_html(const char *input) {
  char       *output, *ptr;
  size_t     outputlen = 0;
  const char *c;

  /* This is a worst-case length calculation */
  for (c = input;  *c;  c++) {
    switch (*c) {
      case '<':  outputlen += 4; break; /* &lt; */
      case '>':  outputlen += 4; break; /* &gt; */
      case '&':  outputlen += 5; break; /* &amp; */
      case '\':  outputlen += 6; break; /* &quot; */
      default:   outputlen += 1; break;

  if (!(output = ptr = (char *)malloc(outputlen + 1))) return 0;
  for (c = input;  *c;  c++) {
    switch (*c) {
      case '<':
        if (!spc_allow_tag(c + 1)) {
          *ptr++ = '&';  *ptr++ = 'l';  *ptr++ = 't';  *ptr++ = ';';
        } else {
          do {
            *ptr++ = *c;
          } while (*++c != '>');
          *ptr++ = '>';
      case '>':
        *ptr++ = '&';  *ptr++ = 'g';  *ptr++ = 't';  *ptr++ = ';';
      case '&':
        *ptr++ = '&';  *ptr++ = 'a';  *ptr++ = 'm';  *ptr++ = 'p';
        *ptr++ = ';';
      case ''':
        *ptr++ = '&';  *ptr++ = 'q';  *ptr++ = 'u';  *ptr++ = 'o';
        *ptr++ = 't';  *ptr++ = 't';
        *ptr++ = *c;
  *ptr = 0;
  return output;