18.3 Tokenizing Rules

The sendmail program views the text that makes up rules and addresses as being composed of individual tokens. Rules are tokenizeddivided into individual partswhile the configuration file is being read and while they are being normalized. Addresses are tokenized at another time (as we'll show later), but the process is the same for both.

The text our.domain, for example, is composed of three tokens: our, a dot, and domain. Tokens are separated by special characters that are defined by the OperatorChars option (OperatorChars), or the $o macro prior to V8.7:

define(`confOPERATORS', `.:%@!^/[  ]+')  m4 configuration
O OperatorChars=.:%@!^/[  ]+         V8.7 and above
Do.:%@!^=/[  ]                       prior to V8.7

When any of these separation characters are recognized in text, they are considered individual tokens. Any leftover text is then combined into the remaining tokens:

xxx@yyy;zzz    becomes  xxx  @   yyy;zzz

@ is defined to be a token, but ; is not. Therefore, the text xxx@yyy;zzz is divided into three tokens.

In addition to the characters in the OperatorChars option, sendmail also defines 10 tokenizing characters internally:

( )<>,;"\r\n

This internal list, and the list defined by the OperatorChars option, are combined into one master list that is used for all tokenizing. The previous example, when divided by using this master list, becomes five tokens instead of just three:

xxx@yyy;zzz    becomes  xxx  @   yyy  ;  zzz

In rules, quotation marks can be used to override the meaning of tokenizing characters defined in the master list. For example:

"xxx@yyy";zzz    becomes  "xxx@yyy"  ;  zzz

Here, three tokens are produced because the @ appears inside quotation marks. Note that the quotation marks are retained.

Because the configuration file is read sequentially from start to finish, the OperatorChars option should be defined before any rules are declared. But note, beginning with V8.7 sendmail, if you omit this option you cause the separation characters to default to:

. : % @ ! ^ / [ ]

Also note that beginning with V8.10, if you declare the OperatorChars option after any rule, the following error will be produced:

Warning: OperatorChars is being redefined.
         It should only be set before ruleset definitions.

To prevent this error, declare the OperatorChars option in your mc configuration file only with the confOPERATORS m4 macro (OperatorChars):

define(`confOPERATORS', `.:%@!^/[  ]-')

Here, we have added a dash character (-) to the default list. Note that you should not define your own operator characters unless you first create and examine a configuration file with the default settings. That way you can be sure you always augment the actual defaults you find, and avoid the risk that you might miss new defaults in the future.

18.3.1 $-operators Are Tokens

As we progress into the details of rules, you will see that certain characters become operators when prefixed with a $ character. Operators cause sendmail to perform actions, such as looking for a match ($* is a wildcard operator) or replacing tokens with others by position ($1 is a replacement operator).

For tokenizing purposes, operators always divide one token from another, just as the characters in the master list did. For example:

xxx$*zzz    becomes  xxx  $*  zzz

18.3.2 The Space Character Is Special

The space character is special for two reasons. First, although the space character is not in the master list, it always separates one token from another:

xxx zzz    becomes  xxx  zzz

Second, although the space character separates tokens, it is not itself a token. That is, in this example the seven characters on the left (the fourth is the space in the middle) become two tokens of three letters each, not three tokens. Therefore, the space character can be used inside the LHS or RHS of rules for improved clarity but does not itself become a token or change the meaning of the rule.

18.3.3 Pasting Addresses Back Together

After an address has passed through all the rules (and has been modified by rewriting), the tokens that form it are pasted back together to form a single string. The pasting process is very straightforward in that it mirrors the tokenizing process:

xxx  @  yyy   becomes   xxx@yyy

The only exception to this straightforward pasting process occurs when two adjoining tokens are both simple text. Simple text is anything other than the separation characters (defined by the OperatorChars option, OperatorChars, and internally by sendmail) or the operators (characters prefixed by a $ character). The xxx and yyy in the preceding example are both simple text.

When two tokens of simple text are pasted together, the character defined by the BlankSub option (BlankSub) is inserted between them.[5] Usually, that option is defined as a dot, so two tokens of simple text would have a dot inserted between them when they are joined:

[5] In the old days (RFC733), usernames to the left of the @ could contain spaces. But Unix also uses spaces as command-line argument separators, so the BlankSub option was introduced.

xxx  yyy   becomes   xxx.yyy

Note that the improper use of a space character in the LHS or RHS of rules can lead to addresses that have a dot (or other character) inserted where one was not intended.

    Part I: Build and Install
    Part II: Administration
    Part III: The Configuration File
    Chapter 21. The D (Define a Macro) Configuration Command
    Chapter 24. The O (Options) Configuration Command