Recipe 8.27 Program: Flat File Indexes

It sometimes happens that you need to jump directly to a particular line number in a file, but the lines vary in length, so you can't use Recipe 8.12. Although you could start at the beginning of the file and read every line, this is inefficient if you're making multiple queries.

The solution is to build an index of fixed-width records, one per line. Each record contains the offset in the data file of the corresponding line. The subroutine in Example 8-10 takes the data file and a filehandle to send the index to. It reads a record at a time and prints the current offset in the file to the index, packed into a big-ending unsigned 32-bit integer; see the documentation for the pack function in perlfunc(1) for alternative storage types.

Example 8-10. build_index
  # usage: build_index(*DATA_HANDLE, *INDEX_HANDLE)
  sub build_index {
      my $data_file  = shift;
      my $index_file = shift;
      my $offset     = 0;
      while (<$data_file>) {
          print $index_file pack("N", $offset);
          $offset = tell($data_file);

Once you have an index, it becomes easy to read a particular line from the data file. Jump to that record in the index, read the offset, and jump to that position in the data file. The next line you read will be the one you want. Example 8-11 returns the line, given the line number and the index and data file handles.

Example 8-11. line_with_index
  # usage: line_with_index(*DATA_HANDLE, *INDEX_HANDLE, $LINE_NUMBER)
  # returns line or undef if LINE_NUMBER was out of range
  sub line_with_index {
      my $data_file   = shift;
      my $index_file  = shift;
      my $line_number = shift;
      my $size;               # size of an index entry
      my $i_offset;           # offset into the index of the entry
      my $entry;              # index entry
      my $d_offset;           # offset into the data file
      $size = length(pack("N", 0));
      $i_offset = $size * ($line_number-1);
      seek($index_file, $i_offset, 0) or return;
      read($index_file, $entry, $size);
      $d_offset = unpack("N", $entry);
      seek($data_file, $d_offset, 0);
      return scalar(<$data_file>);

To use these subroutines, just say:

open($fh,  "<", $file)         or die "Can't open $file for reading: $!\n";
open($index, "+>", $file.idx)
  or die "Can't open $file.idx for read/write: $!\n";
build_index($fh, $index);
$line = line_with_index($file, $index, $seeking);

The next step is to cache the index file between runs of the program, so you're not building it each time. This is shown in Example Recipe 8.12. Then add locking for concurrent access, and check time stamps on the files to see whether a change to the data file has made an old index file out of date.

Example 8-12. cache_line_index
  #!/usr/bin/perl -w
  # cache_line_index - index style
  # build_index and line_with_index from above
  @ARGV =  = 2 or
      die "usage: print_line FILENAME LINE_NUMBER";
  ($filename, $line_number) = @ARGV;
  open(my $orig, "<", $filename) 
          or die "Can't open $filename for reading: $!";
  # open the index and build it if necessary
  # there's a race condition here: two copies of this
  # program can notice there's no index for the file and
  # try to build one.  This would be easily solved with
  # locking
  $indexname = "$filename.index";
  sysopen(my $idx, $indexname, O_CREAT|O_RDWR)
           or die "Can't open $indexname for read/write: $!";
  build_index($orig, $idx) if -z $indexname;  # XXX: race unless lock
  $line = line_with_index($orig, $idx, $line_number);
  die "Didn't find line $line_number in $filename" unless defined $line;
  print $line;