9.6 bptutorial.pl

I've already shown you a little of the bptutorial.pl document. I ran and discussed a few of the short example programs in the preceding sections.

As you know, one of the easiest ways to get started with a programming system is to find some working and fairly generic programs in that system. You can read and run the programs, and then proceed to alter them using them as templates for your own programming development.

Bioperl comes with a directory of example programs, but the best place to begin looking for starting-off program code is right in the bptutorial.pl document itself. That .pl suffix on the name is the giveaway; the document is actually itself a program, cleverly designed so that you can read, and run, example programs that exercise the core parts of the Bioperl project.

The following explanation of the runnable programs that are part of bptutorial.pl appears at the end of the document (when you view it on the Web or as the output of perldoc bptutorial.pl).

       V.2 Appendix: Tutorial demo scripts

       The following scripts demonstrate many of the features of
       bioperl. To run all the core demos, run:

        > perl -w  bptutorial.pl 0

       To run a subset of the scripts do

        > perl -w  bptutorial.pl

       and use the displayed help screen.

       It may be best to start by just running one or two demos
       at a time. For example, to run the basic sequence manipu-
       lation demo, do:

        > perl -w  bptutorial.pl 1

       Some of the later demos require that you have an internet
       connection and/or that you have an auxilliary bioperl
       library and/or external cpan module and/or external pro-
       gram installed.  They may also fail if you are not running
       under Linux or Unix.  In all of these cases, the script
       should fail "gracefully" simply saying the demo is being
       skipped.  However if the script "crashes", simply run the
       other demos individually (and perhaps send an email to
       bioperl-l@bioperl.org detailing the problem :-).

(Recall that the -w flag to Perl turns on warnings in almost the same manner as a use warnings; directive.)

To test my Bioperl installation, I started by running the basic sequence manipulation demo as suggested.

First, I thought I might copy the bptutorial.pl program file into my own working directory from the Bioperl distribution directory where I'd unpacked the source code. I wanted to put it in my own directory so as not to muddy up the Bioperl distribution directory with my own extraneous files. However, I discovered that the tutorial demo programs rely on a number of datafiles that are found in the t/data/ subdirectory of the Bioperl distribution. Running the program in my own directory gave me an error because the program evidently requires a FASTA-formatted file called dna1.fa:

[tisdall]$ perl -w bptutorial.pl 1

Beginning sequence_manipulations and SeqIO example... 

------------- EXCEPTION  -------------
MSG: Could not open t/data/dna1.fa: No such file or directory
STACK Bio::Root::IO::_initialize_io /usr/local/lib/perl5/site_perl/5.8.0/Bio/Root/IO.
pm:264
STACK Bio::SeqIO::_initialize /usr/local/lib/perl5/site_perl/5.8.0/Bio/SeqIO.pm:437
STACK Bio::SeqIO::fasta::_initialize /usr/local/lib/perl5/site_perl/5.8.0/Bio/SeqIO/
fasta.pm:80
STACK Bio::SeqIO::new /usr/local/lib/perl5/site_perl/5.8.0/Bio/SeqIO.pm:355
STACK Bio::SeqIO::new /usr/local/lib/perl5/site_perl/5.8.0/Bio/SeqIO.pm:368
STACK main::_ _ANON_ _ bptutorial.pl:2758
STACK toplevel bptutorial.pl:3933

--------------------------------------
[tisdall]$

I decided it would be easier to run bptutorial.pl from the distribution directory. I entered that directory; on my Linux machine, I unpacked the Bioperl source code into /usr/local/src/bioperl-1.2.1. I then tried to run the demo:

[tisdall]$ cd /usr/local/src/bioperl-1.2.1
[tisdall]$ perl -w bptutorial.pl 1

Beginning sequence_manipulations and SeqIO example... 
First sequence in fasta format... 
>Test1
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTC
TGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGG
TCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATTACAGAGTAC
ACAACATCCATGAAACGCATTAGCACCACC
Seq object display id is Test1
Sequence is 
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTAC
CTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAACCAATATAGGCATAGCGCACAGACAGATAAAAATT
ACAGAGTACACAACATCCATGAAACGCATTAGCACCACC 
Sequence from 5 to 10 is TTTCAT 
Acc num is unknown 
Moltype is dna 
Primary id is Test1 
Truncated Seq object sequence is TTTCAT 
Reverse complemented sequence 5 to 10  is ATGAAA  
Translated sequence 6 to 15 is LQRAICLCVD 

Beginning 3-frame and alternate codon translation example... 
ctgagaaaataa translated using method defaults   : LRK*
ctgagaaaataa translated as a coding region (CDS): MRK

Translating in all six frames:
 frame: 0 forward: LRK*
 frame: 0 reverse-complement: LFSQ
 frame: 1 forward: *EN
 frame: 1 reverse-complement: YFL
 frame: 2 forward: EKI
 frame: 2 reverse-complement: IFS
Translating with all codon tables using method defaults:
1 : LRK*
2 : L*K*
3 : TRK*
4 : LRK*
5 : LSK*
6 : LRKQ
9 : LSN*
10 : LRK*
11 : LRK*
12 : SRK*
13 : LGK*
14 : LSNY
15 : LRK*
16 : LRK*
21 : LSN*
[tisdall]$

That seemed to run without error. Now I wanted to see the Perl code that had run and produced that output.

Exploring a bit, I found that the POD documentation part of bptutorial.pl is roughly the first half of the file, and that the second half of the file contains the Perl code for the demos.

The author attributions and the libraries that are loaded are at the beginning of the Perl code section, somewhere near the middle of the bptutorial.pl file:

#!/usr/bin/perl

# PROGRAM  : bptutorial.pl
# PURPOSE  : Demonstrate various uses of the bioperl package
# AUTHOR   : Peter Schattner schattner@alum.mit.edu
# CREATED  : Dec 15 2000
# REVISION : $Id: ch09.xml,v 1.2 2003/10/03 19:16:55 becki Exp chodacki $

use strict;
use Bio::SimpleAlign;
use Bio::AlignIO;
use Bio::SeqIO;
use Bio::Seq;

That seems like a good bet for a list of the most important Bioperl modules. Actually, some other modules are loaded for individual demos; at various points, the following come into play:

use Bio::SearchIO;
use Bio::Root::IO;
use Bio::MapIO;
use Bio::TreeIO;
use Bio::Perl

Following those use Module directives comes the following:

# subroutine references

my ($access_remote_db, $index_local_db, $fetch_local_db,
    $sequence_manipulations, $seqstats_and_seqwords,
    $restriction_and_sigcleave, $other_seq_utilities, $run_remoteblast,
    $run_standaloneblast,  $blast_parser, $bplite_parsing, $hmmer_parsing,
    $run_clustalw_tcoffee, $run_psw_bl2seq, $simplealign,
    $gene_prediction_parsing, $sequence_annotation, $largeseqs,
    $run_tree, $run_map, $run_struct, $run_perl, $searchio_parsing,
    $liveseqs, $demo_variations, $demo_xml, $display_help, $bpinspect1 );

# global variable file names.  Edit these if you want to try
#out a tutorial script on a different file

 Bio::Root::IO->catfile("t","data","ecolitst.fa");

# used in $sequence_manipulations
my $dna_seq_file = Bio::Root::IO->catfile("t","data","dna1.fa");     

# used in $other_seq_utilities and in $run_perl and $sequence_annotation
my $amino_seq_file = Bio::Root::IO->catfile("t","data","cysprot1.fa"); 

# used in $blast_parser
my $blast_report_file = Bio::Root::IO->catfile("t","data","blast.report");  

# used in $bplite_parsing
my $bp_parse_file1 = Bio::Root::IO->catfile("t","data","blast.report");

# used in $bplite_parsing
my $bp_parse_file2 = Bio::Root::IO->catfile("t","data","psiblastreport.out");

# used in $bplite_parsing
my $bp_parse_file3 = Bio::Root::IO->catfile("t","data","bl2seq.out");       

# used in $run_clustalw_tcoffee
my $unaligned_amino_file = Bio::Root::IO->catfile("t","data","cysprot1a.fa");

# used in $simplealign
my $aligned_amino_file = Bio::Root::IO->catfile("t","data","testaln.pfam");

# other global variables
my (@runlist, $n );

A look at the documentation (perldoc Bio::Root::IO) shows that the catfile method returns the pathname of a file, which is being saved in a scalar variable such as $dna_seq_file. This method is there for portability between operating systems, because operating systems each have their own syntax for specifying pathnames.

After that comes the code for the help screen, the output of which you've already seen; next comes the individual demo subroutines; and finally the code for the main subroutine, which you saw in the last section.

At the very end of the bptutorial.pl file, I found the part of the code that launches all the demos:

## "main" program follows
#no strict 'refs';

    @runlist = @ARGV;
    # display help if no option
    if (scalar(@runlist)=  =0) {&$display_help;};
    # argument = 0 means run tests 1 thru 22
    if ($runlist[0] =  = 0) {@runlist = (1..22); };
    foreach $n  (@runlist) {
        if ($n =  =100) {my $object = $runlist[1]; &$bpinspect1($object); last;}
        if ($n =  =1) {&$sequence_manipulations; next;}
        if ($n =  =2) {&$seqstats_and_seqwords; next;}
        if ($n =  =3) {&$restriction_and_sigcleave; next;}
        if ($n =  =4) {&$other_seq_utilities; next;}
        if ($n =  =5) {&$run_perl; next;}
        if ($n =  =6) {&$searchio_parsing; next;}
        if ($n =  =7) {&$bplite_parsing; next;}
        if ($n =  =8) {&$hmmer_parsing; next;}
        if ($n =  =9) {&$simplealign ; next;}
        if ($n =  =10) {&$gene_prediction_parsing; next;}
        if ($n =  =11) {&$access_remote_db; next;}
        if ($n =  =12) {&$index_local_db; next;}
        if ($n =  =13) {&$fetch_local_db; next;}
        if ($n =  =14) {&$sequence_annotation; next;}
        if ($n =  =15) {&$largeseqs; next;}
        if ($n =  =16) {&$liveseqs; next;}
        if ($n =  =17) {&$run_struct; next;}
        if ($n =  =18) {&$demo_variations; next;}
        if ($n =  =19) {&$demo_xml; next;}
        if ($n =  =20) {&$run_tree; next;}
        if ($n =  =21) {&$run_map; next;}
        if ($n =  =22) {&$run_remoteblast; next;}
        if ($n =  =23) {&$run_standaloneblast; next;}
        if ($n =  =24) {&$run_clustalw_tcoffee; next;}
        if ($n =  =25) {&$run_psw_bl2seq; next;}
         &$display_help;
    }

## End of "main"

So, I searched for sequence_manipulation and found the following code for this, the first of the bptutorial.pl demos:

#################################################
# sequence_manipulations  (  ):
#

$sequence_manipulations = sub {

    my ($infile, $in, $out, $seqobj);
    $infile = $dna_seq_file;

    print "\nBeginning sequence_manipulations and SeqIO example... \n";


    # III.3.1 Transforming sequence files (SeqIO)

    $in  = Bio::SeqIO->new('-file' => $infile ,
                           '-format' => 'Fasta');
    $seqobj = $in->next_seq(  );

    # perl "tied filehandle" syntax is available to SeqIO,
    # allowing you to use the standard <> and print operations
    # to read and write sequence objects, eg:
    #$out = Bio::SeqIO->newFh('-format' => 'EMBL');

    $out = Bio::SeqIO->newFh('-format' => 'fasta');

    print "First sequence in fasta format... \n";
    print $out $seqobj;

    # III.4 Manipulating individual sequences

    # The following methods return strings

    print "Seq object display id is ",
    $seqobj->display_id(  ), "\n"; # the human read-able id of the sequence
    print "Sequence is ",
    $seqobj->seq(  )," \n";        # string of sequence
    print "Sequence from 5 to 10 is ",
    $seqobj->subseq(5,10)," \n"; # part of the sequence as a string
    print "Acc num is ",
    $seqobj->accession_number(  ), " \n"; # when there, the accession number
    print "Moltype is ",
    $seqobj->alphabet(  ), " \n";    # one of 'dna','rna','protein'
    print "Primary id is ", $seqobj->primary_seq->primary_id(  )," \n";
    # a unique id for this sequence irregardless
    #print "Primary id is ", $seqobj->primary_id(  ), " \n";
    # a unique id for this sequence irregardless
    # of its display_id or accession number

    # The following methods return an array of  Bio::SeqFeature objects
    $seqobj->top_SeqFeatures; # The 'top level' sequence features
    $seqobj->all_SeqFeatures; # All sequence features, including sub
    # seq features

    # The following methods returns new sequence objects,
    # but do not transfer features across
    # truncation from 5 to 10 as new object
    print "Truncated Seq object sequence is ",
    $seqobj->trunc(5,10)->seq(  ), " \n";
    # reverse complements sequence
    print "Reverse complemented sequence 5 to 10  is ",
    $seqobj->trunc(5,10)->revcom->seq, "  \n";
    # translation of the sequence
    print "Translated sequence 6 to 15 is ",
    $seqobj->translate->subseq(6,15), " \n";

    my $c = shift;
    $c ||= 'ctgagaaaataa';

    print "\nBeginning 3-frame and alternate codon translation example... \n";

    my $seq = new Bio::PrimarySeq('-SEQ' => $c,
                                  '-ID' => 'no.One');
    print "$c translated using method defaults   : ",
    $seq->translate->seq, "\n";

    # Bio::Seq uses same sequence methods as PrimarySeq
    my $seq2 = new Bio::Seq('-SEQ' => $c, '-ID' => 'no.Two');
    print "$c translated as a coding region (CDS): ",
    $seq2->translate(undef, undef, undef, undef, 1)->seq, "\n";

    print "\nTranslating in all six frames:\n";
    my @frames = (0, 1, 2);
    foreach my $frame (@frames) {
        print  " frame: ", $frame, " forward: ",
        $seq->translate(undef, undef, $frame)->seq, "\n";
        print  " frame: ", $frame, " reverse-complement: ",
        $seq->revcom->translate(undef, undef, $frame)->seq, "\n";
    }

    print "Translating with all codon tables using method defaults:\n";
    my @codontables = qw( 1 2 3 4 5 6 9 10 11 12 13 14 15 16 21 );
    foreach my $ct (@codontables) {
        print $ct, " : ",
        $seq->translate(undef, undef, undef, $ct)->seq, "\n";
    }

    return 1;
} ;

#################################################