sed and awk

So far, we’ve looked at some fairly simple examples of text processing. However, the power of Solaris-style text processing lies with advanced tools like sed and awk. sed is a command-line editing program, which can be used to perform search and replace operations on very large files, as well as other kinds of noninteractive editing. awk, on the other hand, is a complete text processing programming language and has a C-like syntax, and can be used in conjunction with sed to program repetitive text processing and editing operations on large files. These combined operations include double and triple spacing files, printing line numbers, left- and right-justifying text, performing field extraction and field substitution, and filtering on specific strings and pattern specifications. We’ll examine some of these applications below.

To start this example, we’ll create a set of customer address records stored in a flat text, tab-delimited database file called test.dat:

$ cat test.dat
Bloggs  Joe     24 City Rd      Richmond        VA      23227
Lee     Yat Sen 72 King St      Amherst MA      01002
Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074
Sakura  Akira   1 Madison Ave   New York        NY      10017

This is a fairly common type of record, storing a customer’s surname, first name, street address, city, state, and ZIP code. For presentation, we can double space the records in this file by redirecting the contents of the test.dat file through the sed command, with the G option:

$ sed G < test.dat
Bloggs  Joe     24 City Rd      Richmond        VA      23227

Lee     Yat Sen 72 King St      Amherst MA      01002

Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074

Sakura  Akira   1 Madison Ave   New York        NY      10017

The power of sed lies in its ability to be used in pipelines; thus, an action can literally be performed in conjunction with many other operations. For example, to insert double spacing and then remove it, we simply invoke sed twice with the appropriate commands:

$ sed G < test.dat | sed 'n;d'
Bloggs  Joe     24 City Rd      Richmond        VA      23227
Lee     Yat Sen 72 King St      Amherst MA      01002
Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074
Sakura  Akira   1 Madison Ave   New York        NY      10017

When printing reports, you’ll probably be using line numbering at some point to uniquely identify records. You can generate line numbers dynamically for display by using sed:

$ sed '/./=' test.dat | sed '/./N; s/\n/ /'
1 Bloggs        Joe     24 City Rd      Richmond        VA      23227
2 Lee   Yat Sen 72 King St      Amherst MA      01002
3 Rowe  Sarah   3454 Capitol St Los Angeles     CA      90074
4 Sakura        Akira   1 Madison Ave   New York        NY      10017

For large files, it’s often useful to be able to count the number of lines. While the wc command can be used for this purpose, sed can also be used in situations where wc is not available:

$ cat test.dat | sed -n '$='
4

When you’re printing databases for display, you might want comments and titles left-justified, but all records being displayed with two blank spaces before each line. This can be achieved by using sed:

$ cat test.dat | sed 's/^/  /'
  Bloggs        Joe     24 City Rd      Richmond        VA      23227
  Lee   Yat Sen 72 King St      Amherst                 MA      01002
  Rowe  Sarah   3454 Capitol St Los Angeles             CA      90074
  Sakura        Akira   1 Madison Ave   New York        NY      10017

Imagine that due to some municipal reorganization, all cities currently located in CT were being reassigned to MA. sed would be the perfect tool to identify all instances of CT in the data file and replace them with MA:

$ cat test.dat | sed 's/MA/CT/g'
Bloggs  Joe     24 City Rd      Richmond        VA      23227
Lee     Yat Sen 72 King St      Amherst         CT      01002
Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074
Sakura  Akira   1 Madison Ave   New York        NY      10017

If a data file has been entered as a first in last out (FILO) stack, then you’ll generally be reading records from the file from top to bottom. However, if the data file is to be treated as a last in first out (LIFO) stack, then it would be useful to be able to reorder the records from the last to the first:

$ cat test.dat | sed '1\!G;h;$\!d'
Sakura  Akira   1 Madison Ave   New York        NY      10017
Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074
Lee     Yat Sen 72 King St      Amherst MA      01002
Bloggs  Joe     24 City Rd      Richmond        VA      23227

Some data hiding applications require that data be encoded in some way that is nontrivial for another application to detect a file’s contents. One way to foil such programs is to reverse the character strings that comprise each record, which can be achieved by using sed:

$ cat test.dat | sed '/\n/\!G;s/\(.\)\(.*\n\)/&\2\1/;//D;s/.//'
72232   AV      dnomhciR        dR ytiC 42      eoJ     sggolB
20010   AM      tsrehmA tS gniK 27      neS taY eeL
47009   AC      selegnA soL     tS lotipaC 4543 haraS   ewoR
71001   YN      kroY weN        evA nosidaM 1   arikA   arukaS

Some reporting applications might require that the first line of a file be processed before deletion. Although the head command can be used for this purpose, sed can also be used:

$ sed q < test.dat
Bloggs  Joe     24 City Rd      Richmond        VA      23227

Alternatively, if a certain number of lines are to be printed, sed can be used to extract the first q lines:

$ sed 2q < test.dat
Bloggs  Joe     24 City Rd      Richmond        VA      23227
Lee     Yat Sen 72 King St      Amherst MA      01002

The grep command is often used to detect strings within files. However, sed can also be used for this purpose, as shown in the following example where the string CA (representing California) is searched for:

$ cat test.dat | sed '/CA/\!d'
Rowe    Sarah   3454 Capitol St Los Angeles     CA      90074

However, this is a fairly gross and inaccurate method, because CA might match a street address like “1 CALGARY Rd,” or “23 Green CAPE.” Thus, it’s necessary to use the field extraction features of awk. In the following example, we use awk to extract and print the fifth column in the data file, representing the state:

$ cat test.dat | awk 'BEGIN {FS = "\t"}{print $5}'
VA
MA
CA
NY

Note that the tab character “\t” is specified as the field delimiter. Now, if we combine the field extraction capability of awk with the string searching facility of sed, we should be able to print out a list of all occurrences of the state CA:

$ cat test.dat | awk 'BEGIN {FS = "\t"}{print $5}' | sed '/CA/\!d'
CA

Alternatively, we could simply count the number of records that contained CA in the State field:

$ cat test.dat | awk 'BEGIN {FS = "\t"}{print $5}' | sed '/CA/\!d' | sed -n '$='
1

When producing reports, it’s useful to be able to selectively display fields in a different order. For example, while surname is typically used as a primary key, and is generally the first field, most reports would display the first name before the surname, which can be achieved by using awk:

$ cat test.dat | awk 'BEGIN {FS = "\t"}{print $2,$1}'
Joe Bloggs
Yat Sen Lee
Sarah Rowe
Akira Sakura

It’s also possible to split such reordered fields across different lines and use different format specifiers. For example, the following script prints the first name and surname on one line and the state on the following line. Such code is the basis of many mail merge and bulk printing programs:

$ cat test.dat | awk 'BEGIN {FS = "\t"}{print $2,$1,"\n"$5}'
Joe Bloggs
VA
Yat Sen Lee
MA
Sarah Rowe
CA
Akira Sakura
NY

Since awk is a complete programming language, it contains many common constructs, like if/then/else evaluations of logical states. These states can be used to test business logic cases. For example, in a mailing program, the bounds of valid ZIP codes could be checked by determining whether the ZIP code lay within a valid range. For example, the following routine checks to see whether a ZIP code is less than 9999 and rejects it as invalid if it is greater than 9999:

$ cat test.dat | awk 'BEGIN {FS = "\t"}{print $2,$1}{if($6<9999) {print
 "Valid zipcode"} else {print "Invalid zipcode"}}'
Joe Bloggs
Invalid zipcode
Yat Sen Lee
Valid zipcode
Sarah Rowe
Invalid zipcode
Akira Sakura
Invalid zipcode