9.7 bptutorial.pl: sequence_manipulation Demo

In this section, I'll go through the code for the demo subroutine sequence_manipulation that was shown in the last section.

The subroutine is actually an anonymous subroutine; a reference to the subroutine is saved in the scalar reference variable $sequence_manipulation:

$sequence_manipulations = sub {

The first few lines of code declare some variables with my. Notice that these are not being passed in as arguments; this method uses no arguments but does occasionally use global variables such as $dna_seq_file, which, as you've just seen, contain the pathname of the input sequence file the demo will use:

my ($infile, $in, $out, $seqobj);
$infile = $dna_seq_file;

print "\nBeginning sequence_manipulations and SeqIO example... \n";

The code is cross-referenced to the tutorial sections of the file. The next comment line refers to the part of the document:

# III.3.1 Transforming sequence files (SeqIO)

which can be looked up in the table of contents to the document for further reading:

III.3 Manipulating sequences
III.3.1 Manipulating sequence data with Seq methods (Seq)

Now, I'll take a look at the first section of example code in the sequence_manipulations method:

# III.3.1 Transforming sequence files (SeqIO)

$in  = Bio::SeqIO->new('-file' => $infile ,'-format' => 'Fasta');
$seqobj = $in->next_seq(  );

# perl "tied filehandle" syntax is available to SeqIO,
# allowing you to use the standard <> and print operations
# to read and write sequence objects, eg:
#$out = Bio::SeqIO->newFh('-format' => 'EMBL');

$out = Bio::SeqIO->newFh('-format' => 'fasta');

print "First sequence in fasta format... \n";
print $out $seqobj;

The code starts with a call to the new object constructor of the Bio::SeqIO class. The new method is being passed the pathname to a FASTA file in $infile, and told that the format is FASTA.

A quick look at the Bio::SeqIO documentation explains that the call to Bio::SeqIO->new returns a stream object for the specified format. So, $out is a stream object (a stream is input or output of data) for FASTA-formatted data, and $in is a stream object for FASTA-formatted input from the file named in the $infile variable. These $in and $out objects are also filehandles.

After the $in object is initialized on the FASTA file named in $infile, it calls the next_seq method, which gets the next (in this case, the first and perhaps only) FASTA record from the file, and it creates a sequence object $seqobj. The output $out object is created. The Perl print statement is then called, using $out as a filehandle, and printing $seqobj. This prints the first FASTA record from that file, as shown in the demo output in the last section and repeated here:

First sequence in fasta format... 

The next part of the sequence_manipulation demo code shows many of the methods for extracting information from a sequence object:

# III.4 Manipulating individual sequences

# The following methods return strings

print "Seq object display id is ",
$seqobj->display_id(  ), "\n"; # the human read-able id of the sequence
print "Sequence is ",
$seqobj->seq(  )," \n";        # string of sequence
print "Sequence from 5 to 10 is ",
$seqobj->subseq(5,10)," \n"; # part of the sequence as a string
print "Acc num is ",
$seqobj->accession_number(  ), " \n"; # when there, the accession number
print "Moltype is ",
$seqobj->alphabet(  ), " \n";    # one of 'dna','rna','protein'
print "Primary id is ", $seqobj->primary_seq->primary_id(  )," \n";
# a unique id for this sequence irregardless
#print "Primary id is ", $seqobj->primary_id(  ), " \n";
# a unique id for this sequence irregardless
# of its display_id or accession number

# The following methods return an array of  Bio::SeqFeature objects
$seqobj->top_SeqFeatures; # The 'top level' sequence features
$seqobj->all_SeqFeatures; # All sequence features, including sub
# seq features

# The following methods returns new sequence objects,
# but do not transfer features across
# truncation from 5 to 10 as new object
print "Truncated Seq object sequence is ",
$seqobj->trunc(5,10)->seq(  ), " \n";
# reverse complements sequence
print "Reverse complemented sequence 5 to 10  is ",
$seqobj->trunc(5,10)->revcom->seq, "  \n";
# translation of the sequence
print "Translated sequence 6 to 15 is ",
$seqobj->translate->subseq(6,15), " \n";

This section of the tutorial produced the following output (slightly reformatted to fit the pages of this book):

Seq object display id is Test1
Sequence from 5 to 10 is TTTCAT 
Acc num is unknown 
Moltype is dna 
Primary id is Test1 
Truncated Seq object sequence is TTTCAT 
Reverse complemented sequence 5 to 10  is ATGAAA  
Translated sequence 6 to 15 is LQRAICLCVD

This part of the demo code is self-explanatory. As you can see, the methods return:

  • The FASTA ID (the part immediately following the > sign in the first line)

  • The sequence alone as a string (reformatted with line breaks to fit this book)

  • The accession number if it is known (FASTA format won't have this, but Genbank format will, for instance)

  • The type of molecule (dna, rna, or protein)

  • A unique ID compared to other IDs in the running program but drawn from the file itself if possible

  • The sequence of a new truncated sequence object defined from the original sequence object; a reverse complement

  • The peptide resulting from a translation of a specified part of a sequence

The final section of the sequence_manipulation demo demonstrates the ability to translate nucleotides into all six reading frames using alternate codon translation tables if desired:

my $c = shift;
$c ||= 'ctgagaaaataa';

print "\nBeginning 3-frame and alternate codon translation example... \n";

my $seq = new Bio::PrimarySeq('-SEQ' => $c, '-ID' => 'no.One');
print "$c translated using method defaults   : ",
$seq->translate->seq, "\n";

# Bio::Seq uses same sequence methods as PrimarySeq
my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two');
print "$c translated as a coding region (CDS): ",
$seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";

print "\nTranslating in all six frames:\n";
my @frames = (0, 1, 2);
foreach my $frame (@frames) {
    print  " frame: ", $frame, " forward: ",
    $seq->translate(undef, undef, $frame)->seq, "\n";
    print  " frame: ", $frame, " reverse-complement: ",
    $seq->revcom->translate(undef, undef, $frame)->seq, "\n";

print "Translating with all codon tables using method defaults:\n";
my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
foreach my $ct (@codontables) {
    print $ct, " : ",
    $seq->translate(undef, undef, undef, $ct)->seq, "\n";

This produces the output:

Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults   : LRK*
ctgagaaaataa translated as a coding region (CDS): MRK

Translating in all six frames:
 frame: 0 forward: LRK*
 frame: 0 reverse-complement: LFSQ
 frame: 1 forward: *EN
 frame: 1 reverse-complement: YFL
 frame: 2 forward: EKI
 frame: 2 reverse-complement: IFS
Translating with all codon tables using method defaults:
1 : LRK*
2 : L*K*
3 : TRK*
4 : LRK*
5 : LSK*
6 : LRKQ
9 : LSN*
10 : LRK*
11 : LRK*
12 : SRK*
13 : LGK*
14 : LSNY
15 : LRK*
16 : LRK*
21 : LSN*

Again, this code is fairly easy to follow, although you may find yourself turning to the documentation for some of the finer points when you try to use these objects and methods in your own code.

This part of the code starts by getting an argument of sequence data if one is available to be shifted off of the @_ argument array and into the variable $c; otherwise, it sets the variable $c to a preset sequence:

my $c = shift;
$c ||= 'ctgagaaaataa';

Briefly, the methods that follow in this part of the sequence_manipulation demo do the following:

  • Call the PrimarySeq::new constructor to create a lightweight sequence object from $c that doesn't have the extra annotation often present in a standard sequence file (see perldoc Bio::PrimarySeq for the whole story)

  • Translate the sequence (from $c)

  • Call the Bio::Seq constructor to create the $seq2 sequence object (from the same sequence data in $seq)

  • Translate the CDS in the sequence object $seq2

  • Translate the sequence in all six reading frames

  • Translate the sequence using all 21 defined codon tables

The translate method seems to have lots of interesting options. However, if you try to look up the documentation for this method, you may have difficulty finding it. The problem is that Bioperl classes often make considerable use of inheritance. Say that the code is calling a method on a Bio::PrimarySeq object, as do the following lines from the demo (somewhat separated in the original):

my $seq = new Bio::PrimarySeq('-SEQ' => $c, '-ID' => 'no.One');
$seq->translate(undef, undef, $frame)->seq, "\n";

You may try perldoc Bio::PrimarySeq and not find a discussion of the translate method because it is being inherited from some other class. And since multiple inheritance is possible, it can take considerable effort to track down this method documentation by figuring out the parent classes and searching the documentation.

There's a very convenient way to find the documentation for a method, even if it's only inherited?it's built into the bptutorial.pl script. In the example just mentioned, you ask for the list of methods used by the Bio::PrimarySeq method.

[tisdall]$ perl bptutorial.pl 100 Bio::PrimarySeq
 ***Methods for Object Bio::PrimarySeq ********  
 Methods taken from package Bio::DescribableI  
 description   display_name   

 Methods taken from package Bio::IdentifiableI  
 authority   lsid_string   namespace   namespace_string   object_id   version   
 Methods taken from package Bio::PrimarySeq  
 accession   direct_seq_set   validate_seq   

 Methods taken from package Bio::PrimarySeqI  
 accession_number   alphabet   can_call_new   desc   display_id   id   
 is_circular   length   moltype   primary_id   revcom   seq   
 subseq   translate   trunc   

 Methods taken from package Bio::Root::Root  
 DESTROY   debug   verbose   

 Methods taken from package Bio::Root::RootI  
 carp   confess   deprecated   new   stack_trace   stack_trace_dump   
 throw   throw_not_implemented   warn   warn_not_implemented  

Sure enough, there under the heading "Methods taken from package Bio::PrimarySeqI" appears the method translate. You should then look at the Bio::PrimarySeqI documentation and find the following method description:


Title   : translate
Usage   : $protein_seq_obj = $dna_seq_obj->translate
          #if full CDS expected:
          $protein_seq_obj = $cds_seq_obj->translate(undef,undef,undef,undef,1);

          Provides the translation of the DNA sequence using full
          IUPAC ambiguities in DNA/RNA and amino acid codes.

          The full CDS translation is identical to EMBL/TREMBL
          database translation. Note that the trailing terminator
          character is removed before returning the translation

          Note: if you set $dna_seq_obj->verbose(1) you will get a
          warning if the first codon is not a valid initiator.

Returns : A Bio::PrimarySeqI implementing object
Args    : character for terminator (optional) defaults to '*'
          character for unknown amino acid (optional) defaults to 'X'
          frame (optional) valid values 0, 1, 2, defaults to 0
          codon table id (optional) defaults to 1
          complete coding sequence expected, defaults to 0 (false)
          boolean, throw exception if not complete CDS (true) or
          defaults to warning (false)

You will want to explore at least a few more of the tutorials embedded in bptutorial.pl, as we've done here with the first of the tutorials.