To check that things were working okay with my new Bioperl installation, I first wrote a little test to see if a Perl program could find the Bio::Perl module:
use Bio::Perl;
I ran it by putting it in a file bp0.pl and giving it to the Perl interpreter:
[tisdall]# perl bp0.pl [tisdall]#
As you see, it didn't complain, which means Perl found the Bio::Perl module. If it can't find it, it will complain. When I copied the bp0.pl program to a file called bp0.pl.broken and changed the module call of Bio::Perl to a call for the nonexistent module Bio::Perrl, I got the following (slightly truncated) error output:
[tisdall@coltrane development]$ perl bp0.pl.broken Can't locate Bio/Perrl.pm in @INC BEGIN failed--compilation aborted at bp0.pl.broken line 1.
Now I knew that Bio::Perl could be found and loaded. I next tried a couple of test programs that are given in the bptutorial.pl document. I found a link to that tutorial document on the Web from the http://bioperl.org page. Alternately, I could have opened a window and typed at the command prompt:
perldoc bptutorial.pl
I went to the section near the beginning of the document called "I.2 Quick getting started scripts" and created a file tut1.pl on my computer by pasting in the text of the first tutorial script:
use Bio::Perl; # this script will only work with an internet connection # on the computer it is run on $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); write_sequence(">roa1.fasta",'fasta',$seq_object);
I then ran the program and looked at the output file:
[tisdall]$ perl tut1.pl [tisdall]$ cat roa1.fasta >ROA1_HUMAN Heterogeneous nuclear ribonucleoprotein A1 (Helix-destabilizing protein) (Single-strand binding protein) (hnRNP core protein A1). SKSESPKEPEQLRKLFIGGLSFETTDESLRSHFEQWGTLTDCVVMRDPNTKRSRGFGFVT YATVEEVDAAMNARPHKVDGRVVEPKRAVSREDSQRPGAHLTVKKIFVGGIKEDTEEHHL RDYFEQYGKIEVIEIMTDRGSGKKRGFAFVTFDDHDSVDKIVIQKYHTVNGHNCEVRKAL SKQEMASASSSQRGRSGSGNFGGGRGGGFGGNDNFGRGGNFSGRGGFGGSRGGGGYGGSG DGYNGFGNDGGYGGGGPGYSGGSRGYGSGGQGYGNQGSGYGGSGSYDSYNNGGGRGFGGG SGSNFGGGGSYNDFGNYNNQSSNFGPMKGGNFGGRSSGPYGGGGQYFAKPRNQGGYGGSS SSSSYGSGRRF [tisdall]$
That seemed to work perfectly.
I tried the next short script from the same section of the tutorial, pasting it into a file called tut2.pl:
[tisdall]$ cat tut2.pl use Bio::Perl; # this script will only work with an internet connection # on the computer it is run on $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); # uses the default database - nr in this case $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_report); [tisdall]$ perl tut2.pl -------------------- WARNING --------------------- MSG: req was POST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi User-Agent: libwww-perl/5.69 Content-Length: 178 Content-Type: application/x-www-form-urlencoded ... --------------------------------------------------- Submitted Blast for [blast-sequence-temp-id] [tisdall]$ cat roa1.blast [tisdall]$ ls -l roa1.blast -rw-rw-r-- 1 tisdall tisdall 0 Apr 30 11:28 roa1.blast [tisdall]$
Here, I experienced a problem. Running the program created a page of error output, mostly formatted in HTML, which I've truncated in the output. When I tried to cat the output file, there was nothing in it, which I verified (on Linux) by examining the file with ls -l, which showed that it had 0 bytes in it.
This wasn't very encouraging; the second of the two short example scripts was failing.
Looking at the two programs tut1.pl and tut2.pl, you can see that the second is an extension of the first; after getting the sequence, the second program also tries to blast it, and this is where it is failing. It's running (printing those ... dots took a while because the program was waiting for a reply from NCBI over the Internet), but it's not printing out the BLAST report to the file roa1.blast. What's going wrong?
I started my investigation by looking at the documentation for the Bio::Perl module by typing:
perldoc Bio::Perl
and searching for the function that is failing, blast_sequence. Here's what I found:
blast_sequence Title : blast_sequence Usage : $blast_result = blast_sequence($seq) $blast_result = blast_sequence('MFVEGGTFASEDDDSASAEDE'); Function: If the computer has Internet accessibility, blasts the sequence using the NCBI BLAST server against nrdb. It choose the flavour of BLAST on the basis of the sequence. This function uses Bio::Tools::Run::RemoteBlast, which itself use Bio::SearchIO - as soon as you want to more, check out these modules Returns : Bio::Search::Result::GenericResult.pm Args : Either a string of protein letters or nucleotides, or a Bio::Seq object
That last sentence about the Args gave me a clue. I went back and looked at the failing program, which I placed in a file called tut2.pl, and, sure enough, the argument for the blast_sequence function, $seq, is neither a string of sequence nor a Bio::Seq object. In fact, it's not even defined! Clearly, what the tutorial writer meant instead of $seq, was $seq_object, which is defined and populated by the earlier get_sequence method call.
I was visited by a brainwave, and I decided to put a use strict and a use warnings into the program and to declare variables with my to see what ensued:
[tisdall]$ cat tut2.pl use Bio::Perl; use strict; use warnings; # this script will only work with an internet connection # on the computer it is run on my $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq); write_blast(">roa1.blast",$blast_report); [tisdall]$ perl tut2.pl Global symbol "$seq" requires explicit package name at tut2.pl line 11. Global symbol "$blast_report" requires explicit package name at tut2.pl line 13. Execution of tut2.pl aborted due to compilation errors. [tisdall]$
Well, that's pretty clear. The variables $seq and $blast_report are wrong; apparently, the author intended to reuse the variables $seq_object and $blast_result instead. So, I edited the file and ran it again:
[tisdall]$ cat tut2.pl use Bio::Perl; use strict; use warnings; # this script will only work with an internet connection # on the computer it is run on my $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); # uses the default database - nr in this case my $blast_result = blast_sequence($seq_object); write_blast(">roa1.blast",$blast_result); [tisdall]$ perl tut2.pl [tisdall]$ perl tut2.pl Submitted Blast for [ROA1_HUMAN] ................... [tisdall]$ ls -l roa1.blast -rw-rw-r-- 1 tisdall tisdall 56888 May 5 15:05 roa1.blast [tisdall]$
Examining the roa1.blast file convinced me that the program had successfully called blast. I decided that was a success, albeit a qualified one: this was a spot in the documentation that could use a little attention, clearly. By the time you're reading this, it may well have been fixed.
Now, let me show you one more test of my new Bioperl installation.
Looking at the Bio::Perl documentation, I found the following example and discussion at the very beginning.
Bio::Perl(3) User Contributed Perl Documentation Bio::Perl(3) NAME Bio::Perl - Functional access to BioPerl for people who don't know objects SYNOPSIS use Bio::Perl; # will guess file format from extension $seq_object = read_sequence($filename); # forces genbank format $seq_object = read_sequence($filename,'genbank'); # reads an array of sequences @seq_object_array = read_all_sequences($filename,'fasta'); # sequences are Bio::Seq objects, so the following methods work # for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm' print "Sequence name is ",$seq_object->display_id,"\n"; print "Sequence acc is ",$seq_object->accession_number,"\n"; print "First 5 bases is ",$seq_object->subseq(1,5),"\n"; # get the whole sequence as a single string $sequence_as_a_string = $seq_object->seq( ); # writing sequences write_sequence(">$filename",'genbank',$seq_object); write_sequence(">$filename",'genbank',@seq_object_array); # making a new sequence from just strings you have # from something else $seq_object = new_sequence("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA", "myname","AL12232"); # getting a sequence from a database (assumes internet connection) $seq_object = get_sequence('swissprot',"ROA1_HUMAN"); $seq_object = get_sequence('embl',"AI129902"); $seq_object = get_sequence('genbank',"AI129902"); # BLAST a sequence (assummes an internet connection) $blast_report = blast_sequence($seq_object); write_blast(">blast.out",$blast_report); DESCRIPTION Easy first time access to BioPerl via functions FEEDBACK Mailing Lists User feedback is an integral part of the evolution of this and other Bioperl modules. Send your comments and sugges- tions preferably to one of the Bioperl mailing lists. Your participation is much appreciated. bioperl-l@bio.perl.org Reporting Bugs Report bugs to the Bioperl bug tracking system to help us keep track the bugs and their resolution. Bug reports can be submitted via email or the web: bioperl-bugs@bio.perl.org http://bugzilla.bioperl.org/ AUTHOR - Ewan Birney Email bioperl-l@bio.perl.org Describe contact details here APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ ...
I took the example code from the SYNOPSIS section and put it in a file called bp1.pl. I noticed that the code was not meant to be a running program. The variable $filename refers to some sequence file in the first line of code without being initialized, and then the same variable name is used in several other lines for different purposes. This is not an uncommon situation in such SYNOPSIS sections; the point is to show how individual calls can be made to methods in the class, not to demonstrate a complete working program (although sometimes the code can be run exactly as is).
I had to make some changes to make this code a working, runnable program. I declared the strict and warnings pragmas and declared each variable with my. I created variables for the different sequence filenames I needed as both input and output, and I made sure that the input sequence files were on disk and of the correct variety (for example, I created the array.fasta file with three FASTA headers and sequences one after the other). In the end I had this code:
use Bio::Perl; use strict; use warnings; my $gbfilename = 'AI129902.genbank'; # will guess file format from extension my $seq_object0 = read_sequence($gbfilename); # forces genbank format my $seq_object1 = read_sequence($gbfilename,'genbank'); my $fastafilename = 'array.fasta'; # reads an array of sequences my @seq_object_array = read_all_sequences($fastafilename,'fasta'); # sequences are Bio::Seq objects, so the following methods work # for more info see L<Bio::Seq>, or do 'perldoc Bio/Seq.pm' print "Sequence name is ",$seq_object1->display_id,"\n"; print "Sequence acc is ",$seq_object1->accession_number,"\n"; print "First 5 bases is ",$seq_object1->subseq(1,5),"\n"; # get the whole sequence as a single string my $sequence_as_a_string = $seq_object1->seq( ); # writing sequences my $gbfilenameout = 'bpout.genbank'; write_sequence(">$gbfilenameout",'genbank',$seq_object1); write_sequence(">$gbfilenameout",'genbank',@seq_object_array); # making a new sequence from just strings you have # from something else my $seq_object2 = new_sequence("ATTGGTTTGGGGACCCAATTTGTGTGTTATATGTA", "myname","AL12232"); # getting a sequence from a database (assumes internet connection) my $seq_object3 = get_sequence('swissprot',"ROA1_HUMAN"); my $seq_object4 = get_sequence('embl',"AI129902"); my $seq_object5 = get_sequence('genbank',"AI129902"); # BLAST a sequence (assummes an internet connection) my $blast_report = blast_sequence($seq_object3); write_blast(">blast.out",$blast_report);
As you see, I didn't check on the output of all the method calls, but the real purpose of this test of my newly installed Bioperl modules was just to see if all the modules and methods could be found and would run when called.
As mentioned previously, I also added use strict; and use warnings; and declared all the variables with my. Often, you don't see that usage in this kind of documentation; it's not required in Perl, and it may distract some readers of the documentation from the main point of showing how the methods are called. This is not an unusual omission to find in Perl class documentation, but of course, it's recommended and often very helpful, as you saw in the last section.
So, I ran my slightly edited version of the example code from the beginning of the Bio::Perl manpage, with the following results:
[tisdall]$ perl bp1.pl Sequence name is AI129902 Sequence acc is AI129902 First 5 bases is CTCCG Submitted Blast for [ROA1_HUMAN] ......... [tisdall]$
I looked at each of the input and output files and verified the expected contents. The following output from a listing of the files will give you an idea of what to expect (although you may get different results by running the program with different input files or database lookups):
-rw-rw-r-- 1 tisdall tisdall 2391 May 5 10:37 AI129902.genbank -rw-rw-r-- 1 tisdall tisdall 2485 May 5 13:00 array.fasta -rw-rw-r-- 1 tisdall tisdall 1513 May 5 13:02 bp1.pl -rw-rw-r-- 1 tisdall tisdall 3653 May 5 13:02 bpout.genbank -rw-rw-r-- 1 tisdall tisdall 56888 May 5 13:04 blast.out
Notice that at the end of the program I call the blast_sequence method on the protein sequence object $seq_object3. I discovered that trying to run the blast_sequence method on a nucleotide sequence object (such as $seq_object5) failed. Although the documentation for the method said that the sequence type would be examined and the appropriate BLAST program called (for example, blastp for protein sequence and blastx for nucleotide sequence, against the nr nonredundant protein database), it always seemed to call blastp no matter what the input sequence, and therefore it failed unless called with protein sequence as I had stored in $seq_object3. Perhaps this bug, a disconnect between the code and the documentation, has been fixed by the time you read this.