Recipe 7.14 Writing a Unix-Style Filter Program

7.14.1 Problem

You want to write a program that takes a list of filenames on the command line and reads from STDIN if no filenames were given. You'd like the user to be able to give the file "-" to indicate STDIN or "someprogram |" to indicate the output of another program. You might want your program to modify the files in place or to produce output based on its input.

7.14.2 Solution

Read lines with <>:

while (<>) {
    # do something with the line
}

7.14.3 Discussion

When you say:

while (<>) {
    # ...
 }

Perl translates this into:^[4]

^[4] Except that the code written here won't work, because ARGV has internal magic.

unshift(@ARGV, "-") unless @ARGV;
while ($ARGV = shift @ARGV) {
    unless (open(ARGV, $ARGV)) {
        warn "Can't open $ARGV: $!\n";
        next;
    }
    while (defined($_ = <ARGV>)) {
        # ...
    }
}

You can access ARGV and $ARGV inside the loop to read more from the filehandle or to find the filename currently being processed. Let's look at how this works.

7.14.3.1 Behavior

If the user supplies no arguments, Perl sets @ARGV to a single string, "-". This is shorthand for STDIN when opened for reading and STDOUT when opened for writing. It's also what lets the user of your program specify "-" as a filename on the command line to read from STDIN.

Next, the file-processing loop removes one argument at a time from @ARGV and copies the filename into the global variable $ARGV. If the file cannot be opened, Perl goes on to the next one. Otherwise, it processes a line at a time. When the file runs out, the loop goes back and opens the next one, repeating the process until @ARGV is exhausted.

The open statement didn't say open(ARGV, "<", $ARGV). There's no extra less-than sign supplied. This allows for interesting effects, like passing the string "gzip -dc file.gz |" as an argument, to make your program read the output of the command "gzip -dc file.gz". See Recipe 16.6 for more about this use of magic open.

You can change @ARGV before or inside the loop. Let's say you don't want the default behavior of reading from STDIN if there aren't any argumentsyou want it to default to all C or C++ source and header files. Insert this line before you start processing <ARGV>:

@ARGV = glob("*.[Cch]") unless @ARGV;

Process options before the loop, either with one of the Getopt libraries described in Chapter 15 or manually:

# arg demo 1: Process optional -c flag
if (@ARGV && $ARGV[0] eq "-c") {
    $chop_first++;
    shift;
}

# arg demo 2: Process optional -NUMBER flag
if (@ARGV && $ARGV[0] =~ /^-(\d+)$/) {
    $columns = $1;
    shift;
}

# arg demo 3: Process clustering -a, -i, -n, or -u flags
while (@ARGV && $ARGV[0] =~ /^-(.+)/ && (shift, ($_ = $1), 1)) {
    next if /^$/;
    s/a// && (++$append,      redo);
    die "usage: $0 [-ainu] [filenames] ...\n";
}

Other than its implicit looping over command-line arguments, <> is not special. The special variables controlling I/O still apply; see Chapter 8 for more on them. You can set $/ to set the line terminator, and $. contains the current line (record) number. If you undefine $/, you don't get the concatenated contents of all files at once; you get one complete file each time:

undef $/;             
while (<>) {  
    # $_ now has the complete contents of     
    # the file whose name is in $ARGV
}

If you localize $/, the old value is automatically restored when the enclosing block exits:

{     # create block for local        
    local $/;         # record separator now undef    
    while (<>) {      
        # do something; called functions still have   
        # undeffed version of $/      
    }
}                             # $/ restored here

Because processing <ARGV> never explicitly closes filehandles, the record number in $. is not reset. If you don't like that, you can explicitly close the file yourself to reset $.:

while (<>) {  
    print "$ARGV:$.:$_";      
    close ARGV if eof;
}

The eof function defaults to checking the end-of-file status of the last file read. Since the last handle read was ARGV, eof reports whether we're at the end of the current file. If so, we close it and reset the $. variable. On the other hand, the special notation eof( ) with parentheses but no argument checks if we've reached the end of all files in the <ARGV> processing.

7.14.3.2 Command-line options

Perl has command-line options, -n, -p, -a, and -i, to make writing filters and one-liners easier.

The -n option adds the while (<>) loop around your program text. It's normally used for filters like grep or programs that summarize the data they read. The program is shown in Example 7-2.

Example 7-2. findlogin1

  #!/usr/bin/perl
  # findlogin1 - print all lines containing the string "login"
  while (<>) {# loop over files on command line         
      print if /login/;
  }

The program in Example 7-2 could be written as shown in Example 7-3.

Example 7-3. findlogin2

  #!/usr/bin/perl -n
  # findlogin2 - print all lines containing the string "login"
  print if /login/;

You can combine the -n and -e options to run Perl code from the command line:

% perl -ne 'print if /login/'

The -p option is like -n but adds a print right before the end of the loop. It's normally used for programs that translate their input, such as the program shown in Example 7-4.

Example 7-4. lowercase1

  #!/usr/bin/perl
  # lowercase - turn all lines into lowercase
  while (<>) {                  # loop over lines on command line
      s/(\p{Letter})/\l$1/g;    # change all letters to lowercase
      print;
  }

The program in Example 7-4 could be written as shown in Example 7-5.

Example 7-5. lowercase2

  #!/usr/bin/perl -p
  # lowercase - turn all lines into lowercase
  s/(\p{Letter})/\l$1/g;# change all letters to lowercase

Or it could be written from the command line as:

% perl -pe 's/(\p{Letter})/\l$1/g'

While using -n or -p for implicit input looping, the special label LINE: is silently created for the whole input loop. That means that from an inner loop, you can skip to the following input record by using next LINE (which is like awk's next statement), or go on to the next file by closing ARGV (which is like awk's nextfile statement). This is shown in Example 7-6.

Example 7-6. countchunks

  #!/usr/bin/perl -n
  # countchunks - count how many words are used.
  # skip comments, and bail on file if _ _END_ _
  # or _ _DATA_ _ seen.
  for (split /\W+/) {
      next LINE if /^#/;
      close ARGV if /_ _(DATA|END)_ _/;
      $chunks++;
  }
  END { print "Found $chunks chunks\n" }

The tcsh keeps a .history file in a format such that every other line contains a commented out timestamp in Epoch seconds:

#+0894382237
less /etc/motd
#+0894382239
vi ~/.exrc
#+0894382242
date
#+0894382242
who
#+0894382288
telnet home

A simple one-liner can render that legible:

% perl -pe 's/^#\+(\d+)\n/localtime($1) . " "/e'
Tue May  5 09:30:37 1998     less /etc/motd 
Tue May  5 09:30:39 1998     vi ~/.exrc 
Tue May  5 09:30:42 1998     date
Tue May  5 09:30:42 1998     who 
Tue May  5 09:31:28 1998     telnet home

The -i option changes each file on the command line. It is described in Recipe 7.16, and is normally used in conjunction with -p.