9.3 Testing Bioperl

To check that things were working okay with my new Bioperl installation, I first wrote a little test to see if a Perl program could find the Bio::Perl module:

use Bio::Perl;

I ran it by putting it in a file bp0.pl and giving it to the Perl interpreter:

[tisdall]# perl bp0.pl
[tisdall]#

As you see, it didn't complain, which means Perl found the Bio::Perl module. If it can't find it, it will complain. When I copied the bp0.pl program to a file called bp0.pl.broken and changed the module call of Bio::Perl to a call for the nonexistent module Bio::Perrl, I got the following (slightly truncated) error output:

[tisdall@coltrane development]$ perl bp0.pl.broken
Can't locate Bio/Perrl.pm in @INC 
BEGIN failed--compilation aborted at bp0.pl.broken line 1.

9.3.1 Second Test

Now I knew that Bio::Perl could be found and loaded. I next tried a couple of test programs that are given in the bptutorial.pl document. I found a link to that tutorial document on the Web from the http://bioperl.org page. Alternately, I could have opened a window and typed at the command prompt:

perldoc bptutorial.pl

I went to the section near the beginning of the document called "I.2 Quick getting started scripts" and created a file tut1.pl on my computer by pasting in the text of the first tutorial script:

  use Bio::Perl;

  # this script will only work with an internet connection
  # on the computer it is run on
  $seq_object = get_sequence('swissprot',"ROA1_HUMAN");

  write_sequence(">roa1.fasta",'fasta',$seq_object);

I then ran the program and looked at the output file:

[tisdall]$ perl tut1.pl
[tisdall]$ cat roa1.fasta
>ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein)
(Single-strand binding protein) (hnRNP core protein A1).
SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVT
YATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL
RDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL
SKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG
DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGG
SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS
SSSSYGSGRRF
[tisdall]$

That seemed to work perfectly.

9.3.2 Third Test

I tried the next short script from the same section of the tutorial, pasting it into a file called tut2.pl:

[tisdall]$ cat tut2.pl
use Bio::Perl;

# this script will only work with an internet connection
# on the computer it is run on

$seq_object = get_sequence('swissprot',"ROA1_HUMAN");

# uses the default database - nr in this case
$blast_result = blast_sequence($seq);

write_blast(">roa1.blast",$blast_report);
[tisdall]$ perl tut2.pl

-------------------- WARNING ---------------------
MSG: req was POST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
User-Agent: libwww-perl/5.69
Content-Length: 178
Content-Type: application/x-www-form-urlencoded

...
---------------------------------------------------
Submitted Blast for [blast-sequence-temp-id] 
[tisdall]$ cat roa1.blast
[tisdall]$ ls -l roa1.blast
-rw-rw-r--    1 tisdall  tisdall         0 Apr 30 11:28 roa1.blast
[tisdall]$

Here, I experienced a problem. Running the program created a page of error output, mostly formatted in HTML, which I've truncated in the output. When I tried to cat the output file, there was nothing in it, which I verified (on Linux) by examining the file with ls -l, which showed that it had 0 bytes in it.

This wasn't very encouraging; the second of the two short example scripts was failing.

Looking at the two programs tut1.pl and tut2.pl, you can see that the second is an extension of the first; after getting the sequence, the second program also tries to blast it, and this is where it is failing. It's running (printing those ... dots took a while because the program was waiting for a reply from NCBI over the Internet), but it's not printing out the BLAST report to the file roa1.blast. What's going wrong?

I started my investigation by looking at the documentation for the Bio::Perl module by typing:

perldoc Bio::Perl

and searching for the function that is failing, blast_sequence. Here's what I found:

blast_sequence

 Title   : blast_sequence
 Usage   : $blast_result = blast_sequence($seq)
           $blast_result = blast_sequence('MFVEGGTFASEDDDSASAEDE');

 Function: If the computer has Internet accessibility, blasts
           the sequence using the NCBI BLAST server against nrdb.

           It choose the flavour of BLAST on the basis of the sequence.

           This function uses Bio::Tools::Run::RemoteBlast, which itself
           use Bio::SearchIO - as soon as you want to more, check out
           these modules
 Returns : Bio::Search::Result::GenericResult.pm

 Args    : Either a string of protein letters or nucleotides, or a
           Bio::Seq object

That last sentence about the Args gave me a clue. I went back and looked at the failing program, which I placed in a file called tut2.pl, and, sure enough, the argument for the blast_sequence function, $seq, is neither a string of sequence nor a Bio::Seq object. In fact, it's not even defined! Clearly, what the tutorial writer meant instead of $seq, was $seq_object, which is defined and populated by the earlier get_sequence method call.

I was visited by a brainwave, and I decided to put a use strict and a use warnings into the program and to declare variables with my to see what ensued:

[tisdall]$ cat tut2.pl
use Bio::Perl;
use strict;
use warnings;

# this script will only work with an internet connection
# on the computer it is run on

my $seq_object = get_sequence('swissprot',"ROA1_HUMAN");

# uses the default database - nr in this case
my $blast_result = blast_sequence($seq);

write_blast(">roa1.blast",$blast_report);
[tisdall]$ perl tut2.pl
Global symbol "$seq" requires explicit package name at tut2.pl line 11.
Global symbol "$blast_report" requires explicit package name at tut2.pl line 13.
Execution of tut2.pl aborted due to compilation errors.
[tisdall]$

Well, that's pretty clear. The variables $seq and $blast_report are wrong; apparently, the author intended to reuse the variables $seq_object and $blast_result instead. So, I edited the file and ran it again:

[tisdall]$ cat tut2.pl
use Bio::Perl;
use strict;
use warnings;

# this script will only work with an internet connection
# on the computer it is run on

my $seq_object = get_sequence('swissprot',"ROA1_HUMAN");

# uses the default database - nr in this case
my $blast_result = blast_sequence($seq_object);

write_blast(">roa1.blast",$blast_result);
[tisdall]$ perl tut2.pl
[tisdall]$ perl tut2.pl
Submitted Blast for [ROA1_HUMAN] ................... 
[tisdall]$ ls -l roa1.blast 
-rw-rw-r--    1 tisdall  tisdall     56888 May  5 15:05 roa1.blast
[tisdall]$

Examining the roa1.blast file convinced me that the program had successfully called blast. I decided that was a success, albeit a qualified one: this was a spot in the documentation that could use a little attention, clearly. By the time you're reading this, it may well have been fixed.

9.3.3 Fourth Test

Now, let me show you one more test of my new Bioperl installation.

Looking at the Bio::Perl documentation, I found the following example and discussion at the very beginning.

Bio::Perl(3)   User Contributed Perl Documentation   Bio::Perl(3)

NAME
       Bio::Perl - Functional access to BioPerl for people who
       don't know objects

SYNOPSIS
         use Bio::Perl;

         # will guess file format from extension
         $seq_object = read_sequence($filename);

         # forces genbank format
         $seq_object = read_sequence($filename,'genbank');

         # reads an array of sequences
         @seq_object_array = read_all_sequences($filename,'fasta');

         # sequences are Bio::Seq objects, so the following methods work
         # for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm'

         print "Sequence name is ",$seq_object->display_id,"\n";
         print "Sequence acc  is ",$seq_object->accession_number,"\n";
         print "First 5 bases is ",$seq_object->subseq(1,5),"\n";

         # get the whole sequence as a single string

         $sequence_as_a_string = $seq_object->seq(  );

         # writing sequences

         write_sequence(">$filename",'genbank',$seq_object);

         write_sequence(">$filename",'genbank',@seq_object_array);

         # making a new sequence from just strings you have
         # from something else

         $seq_object = new_sequence("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA",
             "myname","AL12232");

         # getting a sequence from a database (assumes internet connection)

         $seq_object = get_sequence('swissprot',"ROA1_HUMAN");

         $seq_object = get_sequence('embl',"AI129902");

         $seq_object = get_sequence('genbank',"AI129902");

         # BLAST a sequence (assummes an internet connection)

         $blast_report = blast_sequence($seq_object);

         write_blast(">blast.out",$blast_report);

DESCRIPTION
       Easy first time access to BioPerl via functions

FEEDBACK
       Mailing Lists

       User feedback is an integral part of the evolution of this
       and other Bioperl modules. Send your comments and sugges-
       tions preferably to one of the Bioperl mailing lists.
       Your participation is much appreciated.

         bioperl-l@bio.perl.org

       Reporting Bugs

       Report bugs to the Bioperl bug tracking system to help us
       keep track the bugs and their resolution. Bug reports can
       be submitted via email or the web:

         bioperl-bugs@bio.perl.org
         http://bugzilla.bioperl.org/

AUTHOR - Ewan Birney
       Email bioperl-l@bio.perl.org

       Describe contact details here

APPENDIX
       The rest of the documentation details each of the object
       methods.  Internal methods are usually preceded with a _
...

I took the example code from the SYNOPSIS section and put it in a file called bp1.pl. I noticed that the code was not meant to be a running program. The variable $filename refers to some sequence file in the first line of code without being initialized, and then the same variable name is used in several other lines for different purposes. This is not an uncommon situation in such SYNOPSIS sections; the point is to show how individual calls can be made to methods in the class, not to demonstrate a complete working program (although sometimes the code can be run exactly as is).

I had to make some changes to make this code a working, runnable program. I declared the strict and warnings pragmas and declared each variable with my. I created variables for the different sequence filenames I needed as both input and output, and I made sure that the input sequence files were on disk and of the correct variety (for example, I created the array.fasta file with three FASTA headers and sequences one after the other). In the end I had this code:

use Bio::Perl;
use strict;
use warnings;

my $gbfilename = 'AI129902.genbank';

# will guess file format from extension
my $seq_object0 = read_sequence($gbfilename);

# forces genbank format
my $seq_object1 = read_sequence($gbfilename,'genbank');

my $fastafilename = 'array.fasta';

# reads an array of sequences
my @seq_object_array = read_all_sequences($fastafilename,'fasta');

# sequences are Bio::Seq objects, so the following methods work
# for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm'

print "Sequence name is ",$seq_object1->display_id,"\n";
print "Sequence acc  is ",$seq_object1->accession_number,"\n";
print "First 5 bases is ",$seq_object1->subseq(1,5),"\n";

# get the whole sequence as a single string

my $sequence_as_a_string = $seq_object1->seq(  );

# writing sequences

my $gbfilenameout = 'bpout.genbank';

write_sequence(">$gbfilenameout",'genbank',$seq_object1);

write_sequence(">$gbfilenameout",'genbank',@seq_object_array);

# making a new sequence from just strings you have
# from something else

my $seq_object2 = new_sequence("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA",
    "myname","AL12232");

# getting a sequence from a database (assumes internet connection)

my $seq_object3 = get_sequence('swissprot',"ROA1_HUMAN");

my $seq_object4 = get_sequence('embl',"AI129902");

my $seq_object5 = get_sequence('genbank',"AI129902");

# BLAST a sequence (assummes an internet connection)

my $blast_report = blast_sequence($seq_object3);

write_blast(">blast.out",$blast_report);

As you see, I didn't check on the output of all the method calls, but the real purpose of this test of my newly installed Bioperl modules was just to see if all the modules and methods could be found and would run when called.

As mentioned previously, I also added use strict; and use warnings; and declared all the variables with my. Often, you don't see that usage in this kind of documentation; it's not required in Perl, and it may distract some readers of the documentation from the main point of showing how the methods are called. This is not an unusual omission to find in Perl class documentation, but of course, it's recommended and often very helpful, as you saw in the last section.

So, I ran my slightly edited version of the example code from the beginning of the Bio::Perl manpage, with the following results:

[tisdall]$ perl bp1.pl
Sequence name is AI129902
Sequence acc  is AI129902
First 5 bases is CTCCG
Submitted Blast for [ROA1_HUMAN] .........
[tisdall]$

I looked at each of the input and output files and verified the expected contents. The following output from a listing of the files will give you an idea of what to expect (although you may get different results by running the program with different input files or database lookups):

-rw-rw-r--    1 tisdall  tisdall      2391 May  5 10:37 AI129902.genbank
-rw-rw-r--    1 tisdall  tisdall      2485 May  5 13:00 array.fasta
-rw-rw-r--    1 tisdall  tisdall      1513 May  5 13:02 bp1.pl
-rw-rw-r--    1 tisdall  tisdall      3653 May  5 13:02 bpout.genbank
-rw-rw-r--    1 tisdall  tisdall     56888 May  5 13:04 blast.out

Notice that at the end of the program I call the blast_sequence method on the protein sequence object $seq_object3. I discovered that trying to run the blast_sequence method on a nucleotide sequence object (such as $seq_object5) failed. Although the documentation for the method said that the sequence type would be examined and the appropriate BLAST program called (for example, blastp for protein sequence and blastx for nucleotide sequence, against the nr nonredundant protein database), it always seemed to call blastp no matter what the input sequence, and therefore it failed unless called with protein sequence as I had stored in $seq_object3. Perhaps this bug, a disconnect between the code and the documentation, has been fixed by the time you read this.