4.3 SeqFileIO.pm: Sequence File Formats

Our primary interest is bioinformatics.Can we extend the FileIO class to handle biological sequence datafiles? For example, can a class be written that takes a GenBank file and writes the sequence out in FASTA format?

Using the technique of inheritance, in this section I present a module for a new class SeqFileIO that performs several basic functions on sequence files of various formats. When you call this module's read method, in addition to reading the file's contents and setting the name, date, and write mode of the file, it automatically determines the format of the sequence file, extracts the sequence, and when available, extracts the annotation, ID, and accession number. In addition, a set of put methods makes it easy to present the sequence and annotation in other formats.[1]

[1] Don Gilbert's readseq package (see http://iobio.bio.indiana.edu/soft/molbio/readseq and ftp://ftp.bio.indiana.edu/molbio/readseq/classic/src) is the classic program (written in C) for reading and writing multiple sequence file formats.

4.3.1 Analysis of SeqFileIO.pm

The first part of the module SeqFileIO.pm contains the block with definitions of the new, or revised, class data and methods.

The first thing you should notice is the use command:

use base ( "FileIO" );

This Perl command tells the current package SeqFileIO it's inheriting from the base class FileIO. Here's another statement that's often used for this purpose:

@ISA = ( "FileIO" );

The @ISA predefined variable tells a package that it "is a" version of some base class; it then can inherit methods from that base class. The use base directive sets the @ISA array to the base class(es), plus a little else besides. (Check perldoc base for the whole story.) Without getting bogged down in details use base works a little more robustly than just setting the @ISA array, so that's what I'll use here:

package SeqFileIO;

use base ( "FileIO" );

use strict;
use warnings;
#use vars '$AUTOLOAD';
use Carp;

# Class data and methods
{
    # A list of all attributes with defaults and read/write/required/noinit properties
    my %_attribute_properties = (
        _filename    => [ '',   'read.write.required'],
        _filedata    => [ [ ],  'read.write.noinit'],
        _date        => [ '',   'read.write.noinit'],
        _writemode   => [ '>',  'read.write.noinit'],
        _format      => [ '',   'read.write'],
        _sequence    => [ '',   'read.write'],
        _header      => [ '',   'read.write'],
        _id          => [ '',   'read.write'],
        _accession   => [ '',   'read.write'],
    );
        
    # Return a list of all attributes
    sub _all_attributes {
            keys %_attribute_properties;
    }

    # Check if a given property is set for a given attribute
    sub _permissions {
        my($self, $attribute, $permissions) = @_;
        $_attribute_properties{$attribute}[1] =~ /$permissions/;
    }

    # Return the default value for a given attribute
    sub _attribute_default {
            my($self, $attribute) = @_;
        $_attribute_properties{$attribute}[0];
    }

    my @_seqfileformats = qw(
        _raw
        _embl
        _fasta
        _gcg
        _genbank
        _pir
        _staden
    );
    sub isformat {
         my($self) = @_;

        for my $format (@_seqfileformats) {
            my $is_format = "is$format";

            if($self->$is_format) {
                return $format;
            }
        }
        return 'unknown';
    }

}
4.3.1.1 The power of inheritance

A comparison with the opening block in the base class FileIO is instructive. You'll see that I'm redefining the %_attribute_properties hash and the methods in the block that access the hash as closures. This is necessary to extend the new class to handle new attributes that relate specifically to sequence datafiles:

_format

The format of the sequence datafile, such as FASTA or GenBank

_sequence

The sequence data extracted from the sequence datafile as a scalar string

_header

The header part of the annotation; defined somewhat loosely in this module

_id

The ID of the sequence datafile, such as a gene name or other identifier

_accession

The accession number of the sequence datafile, when provided

You'll notice, in comparing this opening block with the opening block of the base class FileIO, that the methods and variables relating to counting the number of objects has been omitted in this new class.

Here's where you see the power of inheritance for the first time. When the code in the new SeqFileIO.pm module tries to call, say, the get_count method, and Perl sees that it's not defined in the module, it will go on a hunt for the method in the base class and use the definition it finds there. On the other hand, if it finds the method in SeqFileIO.pm, it just uses that; if the get_count method appeared in the base class but is redefined in SeqFileIO.pm, it uses that redefinition and ignores the older definition in the base class.

So, you don't have to rewrite methods in the base class(es); you can just call them. Perl not only finds them, it also calls them as if they were defined in the new class. For example, ref($self) returns SeqFileIO, not FileIO, without regard to whether the method is defined in the new class or inherited from the old class.

Finally, there are some new definitions in this new version of the opening block. An array @_seqfileformats lists the sequence file formats the module knows about, and a method isformat calls the methods associated with each format (such as is_fasta defined later in the module) until it either finds what format the file is in or returns unknown.

The @_seqfileformats array uses the qw operator. This splits the words on whitespace (including newlines) and returns a list of quoted words. It's a convenience for giving lists of words without having to quote each one or separate them by commas. (Check the perlop manpage for the section on "RegExp Quote-Like Operators" for all the variations and alternative quoting operators.)

Following the opening block, notice that there is no new method defined in this class. The simple, bare-bones new method from the FileIO class serves just as well for this new class; thanks to the inheritance mechanism, there's no need to write a new one. A program that calls SeqFileIO->new will use the same method from the FileIO class, but the class name will be properly set to SeqFileIO for the new object created.

4.3.1.2 A new read method

The next part of the program has a rewrite of the read method. As a result, the read method from the parent FileIO class is being overridden:

# Called from object, e.g. $obj->read(  );
sub read {
    my ($self, %arg) = @_;

    # Set attributes
    foreach my $attribute ($self->_all_attributes(  )) {
        # E.g. attribute = "_filename",  argument = "filename"
        my($argument) = ($attribute =~ /^_(.*)/);

        # If explicitly given
        if (exists $arg{$argument}) {
            # If initialization is not allowed
            if($self->_permissions($attribute, 'noinit')) {
                croak("Cannot set $argument from read: use set_$argument");
            }
            $self->{$attribute} = $arg{$argument};
        # If not given, but required
        }elsif($self->_permissions($attribute, 'required')) {
            croak("No $argument attribute as required");
        # Set to the default
        }else{
            $self->{$attribute} = $self->_attribute_default($attribute);
        }
    }

    # Read file data
    unless( open( FileIOFH, $self->{_filename} ) ) {
        croak("Cannot open file " .  $self->{_filename} );
    }
    $self->{'_filedata'} = [ <FileIOFH> ];
    $self->{'_date'} = localtime((stat FileIOFH)[9]);
    $self->{'_format'} = $self->isformat;
    my $parsemethod = 'parse' . $self->{'_format'};
    $self->$parsemethod;

    close(FileIOFH);
}

The new read method just shown is almost exactly the same as the read function in the FileIO class. There are only three new lines of code, just before the end of the subroutine, preceding the call to the close function:

$self->{'_format'} = $self->isformat;
my $parsemethod = 'parse' . $self->{'_format'};
$self->$parsemethod;

These three new lines initialize the new attributes that were defined for this class. First, a call to isformat determines the format of the file and sets the _format attribute appropriately. The appropriate parse_ method name is then constructed, and finally, a call to that method is made. As you're about to see, the parse_ methods extract the sequence, header, ID, and accession number (or as many of these as possible) from the file data and set the object's attributes accordingly.

So, by the end of the read method, the file data has been read into the object, the format determined, and the interesting parts of the data (such as the sequence data) parsed out.

4.3.2 New Methods: is, parse, and put

The rest of the module consists of three groups of methods that handle the different sequence file formats. These methods didn't appear in the more simple and generic FileIO module and must be defined here:

is_

Tests to see if data is in a particular sequence file format

parse_

Parses out the sequence, and when possible the header, ID, and accession number

put_

Takes the sequence attribute, and when available the header, ID, and accession number attributes, and writes them in the sequence file format named in the format attribute

4.3.2.1 is_ methods

The first group of methods tests an object's file data to see if it conforms to a particular sequence file format:

sub is_raw {
  my($self) = @_;

  my $seq = join('', @{$self->{_filedata}} );
  ($seq =~ /^[ACGNT\s]+$/) ? return 'raw' : 0;
}

sub is_embl {
  my($self) = @_;

  my($begin,$seq,$end) = (0,0,0);

  foreach( @{$self->{_filedata}} ) {
    /^ID\s/ && $begin++;
    /^SQ\s/ && $seq++;
    /^\/\// && $end++;
    (($begin =  = 1) && ($seq =  = 1) && ($end =  = 1)) && return 'embl';
  }
  return;
}

sub is_fasta {
  my($self) = @_;

  my($flag) = 0;

  for(@{$self->{_filedata}}) {
    #This to avoid confusion with Primer, which can have input beginning ">"
    /^\*seq.*:/i && ($flag = 0) && last;
    if( /^>/ && $flag =  = 1) {
      last;
    }elsif( /^>/ && $flag =  = 0) {
      $flag = 1;
    }elsif( (! /^>/) && $flag =  = 0) { #first line must start with ">"
      last;
    }
  }
  $flag ? return 'fasta' : return;
}

sub is_gcg {
  my($self) = @_;

  my($i,$j) = (0,0);

  for(@{$self->{_filedata}}) {
    /^\s*$/ && next;
    /Length:.*Check:/ && ($i += 1);
    /^\s*\d+\s*[a-zA-Z\s]+/ && ($j += 1);
    ($i =  = 1) && ($j =  = 1) && return('gcg');
  }
  return;
}


sub is_genbank {
  my($self) = @_;

  my $Features = 0;

  for(@{$self->{_filedata}}) {
    /^LOCUS/ && ($Features += 1);
    /^DEFINITION/ && ($Features += 2);
    /^ACCESSION/ && ($Features += 4);
    /^ORIGIN/ &&  ($Features += 8);
    /^\/\// && ($Features += 16);
    ($Features =  = 31) && return 'genbank';
  }
  return;
}

sub is_pir {
  my($self) = @_;

  my($ent,$ti,$date,$org,$ref,$sum,$seq,$end) = (0,0,0,0,0,0,0,0);

  for(@{$self->{_filedata}}) {
    /ENTRY/ && $ent++;
    /TITLE/ && $ti++;
    /DATE/ && $date++;
    /ORGANISM/ && $org++;
    /REFERENCE/ && $ref++;
    /SUMMARY/ && $sum++;
    /SEQUENCE/ && $seq++;
    /\/\/\// && $end++;
    $ent =  = 1 && $ti =  = 1 && $date >= 1 && $org >= 1
      && $ref >= 1 && $sum =  = 1 && $seq =  = 1 && $end =  = 1
      && return 'pir';
  }
  return;
}

sub is_staden {
  my($self) = @_;
  for(@{$self->{_filedata}}) {
    /<-+([^-]*)-+>/ && return 'staden';
  }
  0;
}

is_ methods are designed to be fast. They don't check to see if the file conforms to the official file format in every detail. Instead, they look to see if, given that the file is supposed to be a sequence file, there is a good indication that it is a particular file format. In other words, these methods perform heuristic, not rigorous, tests. If they are well conceived, these methods correctly identify the different formats without confusion and with a minimum of code and computation. However, they don't ensure that a format is conforming in every respect to the official format definition. (See the exercises for other approaches to file format recognition.)

Also, note that these are fairly simple tests; for example, the is_raw method doesn't check for the IUB ambiguity codes but only the four bases A, C, G, T, plus N for an undetermined base. Furthermore, some software systems have more than one format, for which only one format is shown here, such as PIR and GCG. The bottom line is that this code does what it is intended to do, but not more. It is, however, fairly short and easy to modify and extend, and I encourage you to try your hand at doing so in the exercises.

An interesting Perl note: the "leaning toothpick" regular expression notation for three forward slashes /// is a bit forbidding because each forward slash needs a backslash escape in front of it:

    /\/\/\// && $end++;

An alternative would be to use m and a separator other than /, which leads to more readable code:

    m!///! && $end++;

One more interesting Perl note: in this code, I return a false value from a subroutine like so:

return;

This is a bit confusing because unless you already know or check in the documentation for return, it's not clear what's happening. In a scalar context, return; returns the undefined value undef; in a list context, return; returns an empty list. So, no matter what context the subroutine is called from, a valid value is returned.

4.3.2.2 put_ methods

The next group of put_ methods returns an object's sequence and annotation data in a particular sequence file format:

sub put_raw {
  my($self) = @_;

  my($out);
  ($out = $self->{_sequence}) =~ tr/a-z/A-Z/;
  return($out);
}

sub put_embl {
  my($self) = @_;

  my(@out,$tmp,$len,$i,$j,$a,$c,$g,$t,$o);

  $len = length($self->{_sequence});
  $a=($self->{_sequence} =~ tr/Aa//);
  $c=($self->{_sequence} =~ tr/Cc//);
  $g=($self->{_sequence} =~ tr/Gg//);
  $t=($self->{_sequence} =~ tr/Tt//);
  $o=($len - $a - $c - $g - $t);
  $i=0;
  $out[$i++] = sprintf("ID   %s %s\n",$self->{_header}, $self->{_id} );
  $out[$i++] = "XX\n";
  $out[$i++] = sprintf("SQ   sequence %d BP; %d A; %d C; %d G; %d T; %d other;\n",
              $len, $a, $c, $g, $t, $o);
  for($j = 0 ; $j < $len ; ) {
      $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));
      $j += 10;
      if( $j < $len && $j % 60 != 0 ) {
        $out[$i] .= " ";
      }elsif ($j % 60 =  = 0 ) {
        $out[$i++] .= "\n";
      }
  }
  if($j % 60 != 0 ) {
    $out[$i++] .= "\n";
  }
  $out[$i] = "//\n";
  return @out;
}

sub put_fasta {
  my($self) = @_;

  my(@out,$len,$i,$j);

  $len = length($self->{_sequence});
  $i = 0;
  $out[$i++] = "> " . $self->{_header} . "\n";
  for($j=0; $j<$len ; $j += 50) {
    $out[$i++]=sprintf("%.50s\n",substr($self->{_sequence},$j,50));
  }
  return @out;
}

sub put_gcg {
  my($self) = @_;

  my(@out,$len,$i,$j,$cnt,$sum);
  $len = length($self->{_sequence});

  #calculate Checksum
  for($i=0; $i<$len ;$i++) {
    $cnt++;
    $sum += $cnt * ord(substr($self->{_sequence},$i,1));
    ($cnt =  = 57)&& ($cnt=0);
  }
  $sum %= 10000;

  $i = 0;             
  $out[$i++] = sprintf("%s\n",$self->{_header});
  $out[$i++] = sprintf("    %s Length: %d (today)  Check: %d  ..\n", $self->{_id},
                      $len, $sum);
  for($j = 0 ; $j < $len ; ) {
      if( $j % 50 =  = 0) {
        $out[$i] = sprintf("%8d  ",$j+1);
      }
      $out[$i] .= sprintf("%s",substr($self->{_sequence},$j,10));
      $j += 10;
      if( $j < $len && $j % 50 != 0 ) {
        $out[$i] .= " ";
      }elsif ($j % 50 =  = 0 ) {
        $out[$i++] .= "\n";
      }
  }
  if($j % 50 != 0 ) {
    $out[$i] .= "\n";
  }
  return @out;
}

sub put_genbank {
  my($self) = @_;

  my(@out,$len,$i,$j,$cnt,$sum);
  my($seq) = $self->{_sequence};

  $seq =~ tr/A-Z/a-z/;
  $len = length($seq);
  for($i=0; $i<$len ;$i++) {
    $cnt++;
    $sum += $cnt * ord(substr($seq,$i,1));
    ($cnt =  = 57) && ($cnt=0);
  }
  $sum %= 10000;
  $i = 0;
  $out[$i++] = sprintf("LOCUS       %s       %d bp\n",$self->{_id}, $len);
  $out[$i++] = sprintf("DEFINITION  %s , %d bases, %d sum.\n", $self->{_header},
                       $len, $sum);
  $out[$i++] = sprintf("ACCESSION  %s\n", $self->{_accession}, );
  $out[$i++] = sprintf("ORIGIN\n");
  for($j = 0 ; $j < $len ; ) {
      if( $j % 60 =  = 0) {
        $out[$i] = sprintf("%8d  ",$j+1);
      }
      $out[$i] .= sprintf("%s",substr($seq,$j,10));
      $j += 10;
      if( $j < $len && $j % 60 != 0 ) {
        $out[$i] .= " ";
      }elsif($j % 60 =  = 0 ) {
        $out[$i++] .= "\n";
      }
  }
  if($j % 60 != 0 ) {
    $out[$i] .= "\n";
    ++i;
  }
  $out[$i] = "//\n";
  return @out;
}

sub put_pir {
  my($self) = @_;

  my($seq) = $self->{_sequence};
  my(@out,$len,$i,$j,$cnt,$sum);
  $len = length($seq);
  for($i=0; $i<$len ;$i++) {
    $cnt++;
    $sum += $cnt * ord(substr($seq,$i,1));
    ($cnt=  =57) && ($cnt=0);
  }
  $sum %= 10000;
  $i = 0;
  $out[$i++] = sprintf("ENTRY           %s\n",$self->{_id});
  $out[$i++] = sprintf("TITLE           %s\n",$self->{_header});
  #JDT ACCESSION out if defined
  $out[$i++] = sprintf("DATE            %s\n",'');
  $out[$i++] = sprintf("REFERENCE       %s\n",'');
  $out[$i++] = sprintf("SUMMARY #Molecular-weight %d  #Length %d  #Checksum %d\
                      n",0,$len,$sum);
  $out[$i++] = sprintf("SEQUENCE\n");
  $out[$i++] = sprintf("              5      10      15      20      25      30\n");
  for($j=1; $seq && $j < $len ; $j += 30) {
    $out[$i++] = sprintf("%7d ",$j);
    $out[$i++] = sprintf("%s\n", join(' ',split(//,substr($seq, $j - 1,length($seq) 
                      < 30 ? length($seq) : 30))) );
  }
  $out[$i++] = sprintf("///\n");
  return @out;
}


sub put_staden {
  my($self) = @_;

  my($seq) = $self->{_sequence};
  my($i,$j,$len,@out);

  $i = 0;
  $len = length($self->{_sequence});
  $out[$i] = ";\<------------------\>\n";
  substr($out[$i],int((20-length($self->{_id}))/2),length($self->{_id})) = $self->
                          {_id};
  $i++;
  for($j=0; $j<$len ; $j+=60) {
    $out[$i++]=sprintf("%s\n",substr($self->{_sequence},$j,60));
  }
  return @out;
}

The put_ methods are, by necessity, more cognizant of the detailed rules of a particular file format.

Note the Perl function ord in put_gcg. This built-in function gives the numeric value of a character. Other Perl functions such as sprintf and substr can be reviewed as necessary by typing perldoc -f substr (for example) or by visiting the http://www.perdoc.com web page.

4.3.2.3 parse_ methods

parse_, the third and final group of methods, parses the contents of a file in a particular format, extracting the sequence and, if possible, some additional header information.

sub parse_raw {
  my($self) = @_;

## Header and ID should be set in calling program after this 
  my($seq);
  
  $seq = join('',@{$self->{_filedata}});
  if( ($seq =~ /^([acgntACGNT\-\s]+)$/)) {
    ($self->{_sequence} = $seq) =~ s/\s//g;
  }else{
    carp("parse_raw failed");
  }
}


sub parse_embl {
  my($self) = @_;

  my($begin,$seq,$end,$count) = (0,0,0,0);
  my($sequence,$head,$acc,$id);

  for(@{$self->{_filedata}}) {
    ++$count;
    if(/^ID/) {
      $begin++;
      /^ID\s*(.*\S)\s*/ && ($id = ($head = $1)) =~ s/\s.*//;
    }elsif(/^SQ\s/){
      $seq++;
    }elsif(/^\/\//){
      $end++;
    }elsif($seq =  = 1){
      $sequence .= $_;
    }elsif(/^AC\s*(.*(;|\S)).*/){ #put this here - AC could be sequence
      $acc .= $1;
    }
    if($begin =  = 1 && $seq =  = 1 && $end =  = 1) {
      $sequence =~ tr/a-zA-Z//cd;
      $sequence =~ tr/a-z/A-Z/;
      $self->{_sequence} = $sequence;
      $self->{_header} = $head; 
      $self->{_id} = $id;
      $self->{_accession} = $acc;
      return 1;
    }
  }
  return;
}

sub parse_fasta {
  my($self) = @_;

  my($flag,$count) = (0,0);
  my($seq,$head,$id);

  for(@{$self->{_filedata}}) {
    #avoid confusion with Primer, which can have input beginning ">"
    /^\*seq.*:/i && ($flag =  = 0) && last;

    if(/^>/ && $flag =  = 1) { 
      last;
    }elsif(/^>/ && $flag =  = 0){
      /^>\s*(.*\S)\s*/ && ($id=($head=$1)) =~ s/\s.*//;
      $flag=1;
    }elsif( (! /^>/) && $flag =  = 1) { 
      $seq .= $_;
    }elsif( (! /^>/) && $flag =  = 0) {
       last;
    }
    ++$count;
  }
  if($flag) {
    $seq =~ tr/a-zA-Z-//cd;
    $seq =~ tr/a-z/A-Z/;

    $self->{_sequence} = $seq;
    $self->{_header} = $head;
    $self->{_id} = $id;
  }
}

sub parse_gcg {
  my($self) = @_;

  my($seq,$head,$id);
  my($count,$flag) = (0,0);

  for(@{$self->{_filedata}}) {
    if(/^\s*$/) { 
      ;
    }elsif($flag =  = 0 && /Length:.*Check:/){
      /^\s*(\S+).*Length:.*Check:/;
      $flag=1;
      ($id=$1) =~ s/\s.*//;
    }elsif($flag =  = 0 && /^\S/) {
      ($head = $_) =~ s/\n//; 
    }elsif($flag =  = 1 && /^\s*[^\d\s]/) {
      last;
    }elsif($flag =  = 1 && /^\s*\d+\s*[a-zA-Z \t]+$/) {
       $seq .= $_;
    }
    $count++;
  }
  $seq =~ tr/a-zA-Z//cd;
  $seq =~ tr/a-z/A-Z/;
  $head = $id unless $head;

  $self->{_sequence} = $seq;
  $self->{_header} = $head;
  $self->{_id} = $id;

  return 1;
}
#
#
sub parse_genbank {
  my($self) = @_;

  my($count,$features,$flag,$seqflag) = (0,0,0,0);
  my($seq,$head,$id,$acc);

  for(@{$self->{_filedata}}) {
    if( /^LOCUS/ && $flag =  = 0 ) {
      /^LOCUS\s*(.*\S)\s*$/;
      ($id=($head=$1)) =~ s/\s.*//;
      $features += 1;
      $flag = 1;
    }elsif( /^DEFINITION\s*(.*)/ && $flag =  = 1) {
      $head .= " $1";
      $features += 2;
    }elsif( /^ACCESSION/ && $flag =  = 1 ) {
      /^ACCESSION\s*(.*\S)\s*$/;
      $head .= " ".($acc=$1);
      $features += 4;
    }elsif( /^ORIGIN/ ) {
      $seqflag = 1;
      $features += 8;
    }elsif( /^\/\// ) {
      $features += 16;
    }elsif( $seqflag =  = 0 ) { 
      ; 
    }elsif($seqflag =  = 1) { 
      $seq .= $_;
    }
    ++$count;
    if($features =  = 31) {
      $seq =~ tr/a-zA-Z//cd;
      $seq =~ tr/a-z/A-Z/;

      $self->{_sequence} = $seq;
      $self->{_header} = $head;
      $self->{_id} = $id;
      $self->{_accession} = $acc;

      return 1;
    }
  }
  return;
}

sub parse_pir {
  my($self) = @_;

  my($begin,$tit,$date,$organism,$ref,$summary,$seq,$end,$count) = 
                     (0,0,0,0,0,0,0,0,0);
  my($flag,$seqflag) = (0,0);
  my($sequence,$header,$id,$acc);

  for(@{$self->{_filedata}}) {
    ++$count;
    if( /^ENTRY\s*(.*\S)\s*$/ && $flag =  = 0 ) {
      $header=$1;
      $flag=1;
      $begin++;
    }elsif( /^TITLE\s*(.*\S)\s*$/ && $flag =  = 1 ) {
      $header .= $1;
      $tit++;
    }elsif( /ORGANISM/ ) {
      $organism++;
    }elsif( /^ACCESSIONS\s*(.*\S)\s*$/ && $flag =  = 1 ) {
      ($id=($acc=$1)) =~ s/\s*//;
    }elsif( /DATE/ ) {
      $date++;
    }elsif( /REFERENCE/ ) {
      $ref++;
    }elsif( /SUMMARY/ ) {
      $summary++;
    }elsif( /^SEQUENCE/ ) {
      $seqflag = 1;
      $seq++;
    }elsif( /^\/\/\// && $flag =  = 1 ) {
      $end++;
    }elsif( $seqflag =  = 0) {
      next;
    }elsif( $seqflag =  = 1 && $flag =  = 1) {
      $sequence .= $_;
    }
    if( $begin =  = 1 && $tit =  = 1 && $date >= 1 && $organism >= 1
        && $ref >= 1 && $summary =  = 1 && $seq =  = 1 && $end =  = 1 
      ) {
      $sequence =~ tr/a-zA-Z//cd;
      $sequence =~ tr/a-z/A-Z/;

      $self->{_sequence} = $seq;
      $self->{_header} = $header;
      $self->{_id} = $id;
      $self->{_accession} = $acc;

      return 1;
    }
  }
  return;
}

sub parse_staden {
  my($self) = @_;

  my($flag,$count) = (0,0);
  my($seq,$head,$id);
  for(@{$self->{_filedata}}) {
    if( /<---*\s*(.*[^-\s])\s*-*--->(.*)/ && $flag =  = 0 ) {
      $id = $head = $1;
      $seq .= $2;
      $flag = 1;
    }elsif( /<---*(.*)-*--->/ && $flag =  = 1 ) {
      $count--;
      last;
    }elsif( $flag =  = 1 ) {
      $seq .= $_;
    }
    ++$count;
  }
  if( $flag ) {
    $seq =~ s/-/N/g;
    $seq =~ tr/a-zA-Z-//cd;
    $seq =~ tr/a-z/A-Z/;

    $self->{_sequence} = $seq;
    $self->{_header} = $head;
    $self->{_id} = $id;

    return 1;
  }
  return;
}

1;

That's the end of the SeqFileIO.pm module that defines the SeqFileIO class.

You can see that to add the capability to handle a new sequence file format, you simply have to write new is_, put_, and parse_ methods, and add the name of the new format to the @_seqfiletypes array. So, extending this software to handle more sequence file formats is relatively easy. (See the exercises.)

To end this chapter, here is a small test program testSeqFileIO that exercises the main parts of the SeqFileIO.pm module, followed by its output. See the exercises for a discussion of ways to improve this test.

4.3.3 Testing SeqFileIO.pm

Here is the program testSeqFileIO to test the SeqFileIO module:

#!/usr/bin/perl

use strict;
use warnings;
use lib "/home/tisdall/MasteringPerlBio/development/lib";

use SeqFileIO;

#
# First test basic FileIO operations
#  (plus filetype attribute)
#

my $obj = SeqFileIO->new(  );

$obj->read(
  filename => 'file1.txt'
);

print "The file name is ", $obj->get_filename, "\n";
print "The contents of the file are:\n", $obj->get_filedata;
print "\nThe date of the file is ", $obj->get_date, "\n";
print "The filetype of the file is ", $obj->get_filetype, "\n";

$obj->set_date('today');
print "The reset date of the file is ", $obj->get_date, "\n";

print "The write mode of the file is ", $obj->get_writemode, "\n";

print "\nResetting the data and filename\n";
my @newdata = ("line1\n", "line2\n");
$obj->set_filedata( \@newdata );

print "Writing a new file \"file2.txt\"\n";
$obj->write(filename => 'file2.txt');

print "Appending to the new file \"file2.txt\"\n";
$obj->write(filename => 'file2.txt', writemode => '>>');

print "Reading and printing the data from \"file2.txt\":\n";

my $file2 = SeqFileIO->new(  );

$file2->read(
  filename => 'file2.txt'
);

print "The file name is ", $file2->get_filename, "\n";
print "The contents of the file are:\n", $file2->get_filedata;
print "The filetype of the file is ", $file2->get_filetype, "\n";

print <<'HEADER';

##########################################
#
# Test file format recognizing and reading
#
##########################################

HEADER

my $genbank = SeqFileIO->new(  );
$genbank->read(
  filename => 'record.gb'
);
print "The file name is ", $genbank->get_filename, "\n";
print "\nThe date of the file is ", $genbank->get_date, "\n";
print "The filetype of the file is ", $genbank->get_filetype, "\n";
print "The contents of the file are:\n", $genbank->get_filedata;

print "\n####################\n####################\n####################\n";

my $raw = SeqFileIO->new(  );
$raw->read(
  filename => 'record.raw'
);
print "The file name is ", $raw->get_filename, "\n";
print "\nThe date of the file is ", $raw->get_date, "\n";
print "The filetype of the file is ", $raw->get_filetype, "\n";
print "The contents of the file are:\n", $raw->get_filedata;

print "\n####################\n####################\n####################\n";

my $embl = SeqFileIO->new(  );
$embl->read(
  filename => 'record.embl'
);
print "The file name is ", $embl->get_filename, "\n";
print "\nThe date of the file is ", $embl->get_date, "\n";
print "The filetype of the file is ", $embl->get_filetype, "\n";
print "The contents of the file are:\n", $embl->get_filedata;

print "\n####################\n####################\n####################\n";

my $fasta = SeqFileIO->new(  );
$fasta->read(
  filename => 'record.fasta'
);
print "The file name is ", $fasta->get_filename, "\n";
print "\nThe date of the file is ", $fasta->get_date, "\n";
print "The filetype of the file is ", $fasta->get_filetype, "\n";
print "The contents of the file are:\n", $fasta->get_filedata;

print "\n####################\n####################\n####################\n";

my $gcg = SeqFileIO->new(  );
$gcg->read(
  filename => 'record.gcg'
);
print "The file name is ", $gcg->get_filename, "\n";
print "\nThe date of the file is ", $gcg->get_date, "\n";
print "The filetype of the file is ", $gcg->get_filetype, "\n";
print "The contents of the file are:\n", $gcg->get_filedata;

print "\n####################\n####################\n####################\n";

my $staden = SeqFileIO->new(  );
$staden->read(
  filename => 'record.staden'
);
print "The file name is ", $staden->get_filename, "\n";
print "\nThe date of the file is ", $staden->get_date, "\n";
print "The filetype of the file is ", $staden->get_filetype, "\n";
print "The contents of the file are:\n", $staden->get_filedata;

print "\n####################\n####################\n####################\n";

print <<'REFORMAT';

##########################################
#
# Test file format reformatting and writing
#
##########################################

REFORMAT

print "At this point there are ", $staden->get_count, " objects.\n\n";

print "######\n###### Testing put methods\n######\n\n";

print "\nPrinting staden data in raw format:\n";
print $staden->put_raw;

print "\nPrinting staden data in embl format:\n";
print $staden->put_embl;

print "\nPrinting staden data in fasta format:\n";
print $staden->put_fasta;

print "\nPrinting staden data in gcg format:\n";
print $staden->put_gcg;

print "\nPrinting staden data in genbank format:\n";
print $staden->put_genbank;

print "\nPrinting staden data in PIR format:\n";
print $staden->put_pir;

The test program depends on certain files being present on the system; the contents of the files are clear from the program output, and the files can be downloaded from this book's web site, along with the rest of the programs for this book.

4.3.4 Results

Here is the output from testSeqFileIO:

The file name is file1.txt
The contents of the file are:
> sample dna  (This is a typical fasta header.)
agatggcggcgctgaggggtcttgggggctctaggccggccacctactgg
tttgcagcggagacgacgcatggggcctgcgcaataggagtacgctgcct
gggaggcgtgactagaagcggaagtagttgtgggcgcctttgcaaccgcc
tgggacgccgccgagtggtctgtgcaggttcgcgggtcgctggcgggggt
cgtgagggagtgcgccgggagcggagatatggagggagatggttcagacc
cagagcctccagatgccggggaggacagcaagtccgagaatggggagaat
acacctgagccactctcagatgaggaccta

The date of the file is Thu Dec  5 11:22:56 2002
The filetype of the file is _fasta
The reset date of the file is today
The write mode of the file is >

Resetting the data and filename
Writing a new file "file2.txt"
Appending to the new file "file2.txt"
Reading and printing the data from "file2.txt":
The file name is file2.txt
The contents of the file are:
line1
line2
line1
line2
The filetype of the file is _unknown

##########################################
#
# Test file format recognizing and reading
#
##########################################

The file name is record.gb

The date of the file is Sun Mar 30 14:30:09 2003
The filetype of the file is _genbank
The contents of the file are:
LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Sequence severely truncated for demonstration.
ACCESSION   AB031069
VERSION     AB031069.1  GI:8100074
KEYWORDS    .
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
            Takano,T.
  TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
            is regulated by proteolysis
  JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
  MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT      564 a    715 c    768 g    440 t
ORIGIN      
        1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
       61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
      121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
      181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
      241 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

####################
####################
####################
The file name is record.raw

The date of the file is Sun Mar 30 14:30:39 2003
The filetype of the file is _raw
The contents of the file are:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAA
AAAAAAAAAAAA
####################
####################
####################
The file name is record.embl

The date of the file is Sun Mar 30 14:31:23 2003
The filetype of the file is _embl
The contents of the file are:
ID   AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely 
truncated for demonstration. AB031069 AB031069
XX
SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other;
AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT
AAAAAAAAAA AAAAAAAAAA AAAAAAA
//

####################
####################
####################
The file name is record.fasta

The date of the file is Sun Mar 30 14:31:40 2003
The filetype of the file is _fasta
The contents of the file are:
> AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely 
truncated for demonstration. AB031069
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT
CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA
AAAAAAAAAAAAAAAAA

####################
####################
####################
The file name is record.gcg

The date of the file is Sun Mar 30 14:31:47 2003
The filetype of the file is _gcg
The contents of the file are:
AB031069     2487 bp    mRNA            PRI       27-MAY-2000 Sequence severely
truncated for demonstration. AB031069
    AB031069 Length: 267 (today)  Check: 4285  ..
       1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG
      51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT
     101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC
     151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT
     201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA
     251  AAAAAAAAAA AAAAAAA

####################
####################
####################
The file name is record.staden

The date of the file is Sun Mar 30 14:32:01 2003
The filetype of the file is _staden
The contents of the file are:
;<----AB031069------>
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGG
AGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCG
GAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTCTGTGCAGGTT
CGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGAT
AAAAAAAAAAAAAAAAAAAAAAAAAAA

####################
####################
####################

##########################################
#
# Test file format reformatting and writing
#
##########################################

At this point there are 8 objects.

######
###### Testing put methods
######

Printing staden data in raw format:
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGGTTTGCAGCGGAGACGACGCATGGGGCCTGCGCAAT
AGGAGTACGCTGCCTGGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCCTGGGACGCCGCCGAGTGGTC
TGTGCAGGTTCGCGGGTCGCTGGCGGGGGTCGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAAAAAAAA
AAAAAAAAAAA
Printing staden data in embl format:
ID   AB031069 AB031069
XX
SQ   sequence 267 BP; 65 A; 54 C; 106 G; 42 T; 0 other;
AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG TTTGCAGCGG
AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT GGGAGGCGTG ACTAGAAGCG
GAAGTAGTTG TGGGCGCCTT TGCAACCGCC TGGGACGCCG CCGAGTGGTC TGTGCAGGTT
CGCGGGTCGC TGGCGGGGGT CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT
AAAAAAAAAA AAAAAAAAAA AAAAAAA
//

Printing staden data in fasta format:
> AB031069
AGATGGCGGCGCTGAGGGGTCTTGGGGGCTCTAGGCCGGCCACCTACTGG
TTTGCAGCGGAGACGACGCATGGGGCCTGCGCAATAGGAGTACGCTGCCT
GGGAGGCGTGACTAGAAGCGGAAGTAGTTGTGGGCGCCTTTGCAACCGCC
TGGGACGCCGCCGAGTGGTCTGTGCAGGTTCGCGGGTCGCTGGCGGGGGT
CGTGAGGGAGTGCGCCGGGAGCGGAGATATGGAGGGAGATAAAAAAAAAA
AAAAAAAAAAAAAAAAA

Printing staden data in gcg format:
AB031069
    AB031069 Length: 267 (today)  Check: 4285  ..
       1  AGATGGCGGC GCTGAGGGGT CTTGGGGGCT CTAGGCCGGC CACCTACTGG
      51  TTTGCAGCGG AGACGACGCA TGGGGCCTGC GCAATAGGAG TACGCTGCCT
     101  GGGAGGCGTG ACTAGAAGCG GAAGTAGTTG TGGGCGCCTT TGCAACCGCC
     151  TGGGACGCCG CCGAGTGGTC TGTGCAGGTT CGCGGGTCGC TGGCGGGGGT
     201  CGTGAGGGAG TGCGCCGGGA GCGGAGATAT GGAGGGAGAT AAAAAAAAAA
     251  AAAAAAAAAA AAAAAAA

Printing staden data in genbank format:
LOCUS       AB031069       267 bp
DEFINITION  AB031069 , 267 bases, 829 sum.
ACCESSION  
ORIGIN
       1  agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
      61  agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
     121  gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
     181  cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
     241  aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

Printing staden data in PIR format:
ENTRY           AB031069
TITLE           AB031069
DATE            
REFERENCE       
SUMMARY         #Molecular-weight 0  #Length 267  #Checksum 4285
SEQUENCE
                5        10        15        20        25        30
      1 A G A T G G C G G C G C T G A G G G G T C T T G G G G G C T
     31 C T A G G C C G G C C A C C T A C T G G T T T G C A G C G G
     61 A G A C G A C G C A T G G G G C C T G C G C A A T A G G A G
     91 T A C G C T G C C T G G G A G G C G T G A C T A G A A G C G
    121 G A A G T A G T T G T G G G C G C C T T T G C A A C C G C C
    151 T G G G A C G C C G C C G A G T G G T C T G T G C A G G T T
    181 C G C G G G T C G C T G G C G G G G G T C G T G A G G G A G
    211 T G C G C C G G G A G C G G A G A T A T G G A G G G A G A T
    241 A A A A A A A A A A A A A A A A A A A A A A A A A A A
///