11.2 BLAST Databases

The mechanics of creating BLAST databases is quite simple; just run formatdb or xdformat with the proper syntax. Chapter 10 discussed this topic, and you'll find the command summaries in Chapter 13 and Chapter 14. There are, however, subtleties that make this process more complicated than it may appear.

11.2.1 Large Databases

One of the most common database complications occurs with large files. Most computers today use 32-bit operating systems and 32-bit filesystems. This puts a physical limit of 4 GB on the amount of RAM and 4 GB on the size of any particular file. (You may find that you are actually limited to less than 4 GB in both cases, and a 2-GB limit is quite common.) Most computers these days don't have or need 4-GB RAM. However, most hard disks are quite a bit larger than 4 GB, and files can sometimes exceed these limits. Therefore many operating systems have the option of using 64-bit filesystems. Unfortunately you can't just change the filesystem and expect everything to work. Making software applications aware of large files often means recompiling them with special flags, and the process of migrating to a 64-bit filesystem can be painful because the applications don't tell you useful things like "I'm not large-file-aware." Instead, they just sit there quietly burning CPU time while they run in endless loops.

11.2.1.1 Large NCBI databases

The standard protocol for formatting a database is to run formatdb on a FASTA database:

formatdb -p F -i fasta_db -o

NCBI-BLAST databases are physically limited to 4 GB of sequence, which corresponds to about 4 billion amino acids or 16 billion nucleotides (nucleotides are compressed 4:1). On a 32-bit filesystem, the previous approach won't let you use all this space because the FASTA file can't contain more than 2 or 4 billion letters. Creating a database larger than 2 or 4 billion letters requires piping sequence to formatdb.

cat fasta1 fasta2 fasta3 | formatdb -p F -i stdin -n my_db -o

But what if you happen to have more than 16 billion letters? This isn't a problem because formatdb automatically segments individual BLAST databases to files containing 16 billion nucleotides and creates something called an alias database that stitches them all together. This is really convenient because it means that you can search enormous databases even on 32-bit filesystems. Alias databases are discussed in more detail later in this chapter.

It's still possible to run into file size issues by piping FASTA files to formatdb because the filesystem maximum may be 2 GB and the implicit BLAST maximum is 4 GB. Fortunately, formatdb lets you set the size of each database volume with the -v parameter. The following example sets this size this to 2 billion and includes a bit more realism by piping the FASTA files from a compressed format.

zcat file*.gz | formatdb -i stdin -p F -o -n my_db -v 2000000000

A word of caution: be precise. If you accidentally leave off one of the zeroes, you can create 10 times as many files. For large databases, this can be a problem because the maximum number of volumes is 100.

11.2.1.2 Large WU-BLAST databases

WU-BLAST doesn't use alias databases, so the only way to create a database larger than 4 GB is on a 64-bit filesystem. If you're accessing or distributing databases over a network, the network must also be 64-bit aware. To index a large number of compressed files, use a command such as the following:

zcat file* | xdformat -n -I -o ESTs -- -

In the typical Unix command line syntax, the double-dash indicates the end of the command line options, and the single-dash denotes standard input rather than a file.

If you're stuck with a 32-bit filesystem and need to search large BLAST databases, you can use virtual databases, which are explained next. If you use the free version of WU-BLAST, there is no large file support and no virtual database mechanism. Your best solution is to create several databases within the limits of your filesystem, search each independently and then merge the results.

11.2.2 Virtual Databases

Virtual databases let you combine multiple databases and use them as if they were one. It's as simple as including the various databases in quotes on the command line. Here's how it looks for NCBI-BLAST:

blastall -p blastp -d "db1 db2 db3" -i query

And here is the equivalent WU-BLAST command line:

blastp "db1 db2 db3" query

Virtual databases are useful for grouping related searches. Let's say you want to search individually against an EST and an mRNA database, as well as a transcripts database that combines the two. You can create each database individually, but there is some duplication in data. Alternatively, you can just create the EST and mRNA databases and use the virtual database EST mRNA for transcripts. Virtual databases behave just like normal databases. You can even retrieve sequences from virtual databases with fastacmd (this feature isn't yet available for xdget, but see the end of this chapter for a workaround).

One thing you probably don't want to do is to combine databases with redundant sequences. For example, you wouldn't want to group the NCBI nr database with SWISSPROT because nr already includes SWISSPROT. Duplicated sequences decrease the statistical significance of matches and can be confusing in the output.

11.2.3 Alias Databases

Alias databases are a unique and powerful feature of NCBI-BLAST. You've already seen that formatdb creates alias databases when splitting large files, but alias databases also have other uses. Alias databases can be used as static virtual databases for any combination of databases; all you have to do is create a file with the proper name and syntax. Here's a simple alias file, transcripts.nal, that combines the previous ESTs and mRNAs example to create a transcripts databases:

TITLE transcripts
DBLIST ESTs mRNAs

The TITLE is the name of the database and the DBLIST is simply a list of the databases to merge. Using alias databases you can, for example, organize sequences by organism. All you have to do is create the individual databases and combine them in various ways with alias files to create more comprehensive sets.

Not only can you join databases, but you can also use alias files to restrict searches to particular sequences from a database. Let's say you create a comprehensive EST database and then want to create a human-only EST database. The alias file for such a database looks something like this:

TITLE humanESTs
DBLIST ESTs
GILIST human.gi

The GILIST specifies a list of files that contains the GI numbers of the sequences to search. There are a few complexities when working with GI lists. First, using a GI list assumes that your sequences have GI numbers. If your original FASTA identifiers don't include GI numbers, you can't use this feature. It is unfortunate that the file of GI numbers isn't a file of accession numbers, but that's the way it is. If you want to use GI lists for sequences without GI numbers, you have to add fake GI numbers to your identifiers. The following script counts backward from the maximum possible GI number and thus minimizes the potential conflict with real GI numbers:

#!/usr/bin/perl

$i =  2147483648;
while (<>) {
    if (/^>/) {
        $i--;
        print ">gi|$i (fake-gi) ", substr($_, 1);
    }
    else {
        print;
    }
}

To use this script, fake-gi.pl, simply place it upstream of formatdb in your pipe:

zcat file*.gz | fake-gi.pl | formatdb -i stdin -p F -o -n my_db -v 2000000000

Creating an alias database restricted to GI numbers is a bit more complicated than just merging databases. First you need to get a list of GI numbers for your sequences of interest. There are a number of ways you can do this, but the easiest is to use NCBI Entrez. If you have a GI list called human.gi.list, the next step is to convert it to binary form using formatdb (this step isn't actually required, but it does improve performance).

formatdb -F human.gi.list -B human.gi

Finally, create the alias file. You can do this yourself, but you might as well let formatdb do it for you because it also adds the number of sequences and their length to the information in the alias file.

formatdb -p F -i ESTs -L humanESTs -F human.gi

-L identifies the name of the alias database to create, and -F is the name of the binary file of GI numbers.

As you can see, alias databases are a powerful way to join and split databases. As BLAST development continues, you should expect to see more structure in BLAST databases and greater control of sequences subsets.

11.2.4 Removing Redundancy

One way to improve the efficiency of your BLAST searches is to remove redundant sequences. Consider what happens when you search a redundant database. The statistical significance of a database hit depends on the size of the database. Each redundant sequence artificially increases the size of the database and therefore reduces the statistical significance of any hit. In addition, redundancy makes the search slower.

The nrdb program that comes with the licensed version of WU-BLAST concatenates the definition lines of all identical sequences. This program isn't available with the free version, and the NCBI distribution doesn't include such a program in the BLAST distribution. WU-BLAST also includes patdb, which is a bit more aggressive because it concatenates identical subsequences. Both programs are very efficient. You can find examples of their use in Chapter 10. You can also write your own database purifier using Bioperl tools. (Some entertaining discussions about the "best" way to do this may be found in the Bioperl mailing list archives at http://bioperl.org). Here's one way:

#!/usr/bin/perl
use Bio::SeqIO;
 
my %NR;
my $file = Bio::SeqIO->new(-fh => \*ARGV);
while (my $fasta = $file->next_seq){
    my $def = $fasta->id ." " .$fasta->desc;
    $NR{$fasta->seq}{$def} = 1; 
}
for my $seq(keys %NR){
   print ">",join(chr(1),keys
   %{$NR{$seq}}),"\n",$seq,"\n";
}

It's more common to collapse redundant proteins than redundant nucleotide sequences. One reason is because nucleotide sequences are rarely identical to one another. Another is that nucleotide sequences can be very large, and the procedure becomes impractical with off-the-shelf hardware. Most importantly, it is better to assemble genomic fragments into chromosomes and ESTs into full-length transcripts. Because these tasks are complicated and compute-intensive, these feats of bioinformagic are best left to the experts.

11.2.5 Standard BLAST Databases

Every BLAST search is an experiment and should be planned as such. Just as you wouldn't want to use the same query for every BLAST search, you wouldn't want to use the same database for every BLAST search. However, a few databases are used so frequently that they have become standards. All databases described in this section are available from the NCBI at ftp://ftp.ncbi.nih.gov/blast/db/. The most important is the nonredundant protein database, nr. This database combines all translations from GenBank records (including RefSeq) with proteins from the SWISS-PROT, PIR, and PDB databases. If you want to do a comprehensive search against all known proteins, this is the database to use. Not all of the protein sequences have been verified experimentally, so you should expect some errors.

There are several essential nucleotide databases. The ecoli and vector databases may sound like uninteresting databases, but they're actually quite important. The procedures used to sequence DNA require various molecular biology techniques that rely on the E. coli bacterium and various vector sequences for carrying DNA. Because of this, many common sources of data contamination are from E. coli and vector sequences. Screening nucleotide sequences against these databases is a good way to detect these pollutants. Another database that is useful for detecting contaminants in genomic DNA is the mito database of mitochondrial sequences. The est database is also one of the most popular ones. It contains all the expressed sequence tags from DDBJ/EMBL/GenBank. Many undiscovered proteins lurk in the est database.

The NCBI FTP site includes several other databases. For those interested in the business side of bioinformatics, the pataa and patnt databases contain patented amino acid and nucleotide sequences. The sts database contains sequence tagged sites, which are mostly PCR amplimers that uniquely identify a region of a genome and are used in genome mapping. The gss sequences correspond to genome survey sequences. These are random-ish sequences from various organisms. Some sequences may not have been processed to remove sequencing vectors or low quality reads, so the quality of the sequences varies. Finally, the htg database contains all high-throughput genomic sequences that correspond to large genomic fragments from various organisms.

11.2.6 Custom BLAST Databases

In many cases, using a custom database rather than a standard database is more efficient. For example, if you're only interested in searching against human sequences, there's no point in including the rest of the public database. But the total cost of a BLAST search also includes creating the database, so it isn't always more efficient to use a custom database. Suppose you've just cloned a dog gene with mutations that bear some similarity to a human disease, and you want to know which human proteins correspond to the dog protein. You can build a human protein database and search it, or you could search against nr and ignore anything that isn't human. If this is the only experiment you're going to perform, it's probably more efficient to search nr than build a custom database.

If you occasionally want to make custom databases, it's worth getting to know one of the batch retrieval systems available on the Internet. You can make custom databases easily using these web-based systems. If you find yourself making custom databases frequently and find limitations with using web-based systems, you will probably want to have an in-house database. This can be a nontrivial task involving many hours of work and expensive computers, or it can be a relatively simple operation. It depends on the kind of performance and features you want. You will learn more about this topic in just a bit, but let's first discuss sequence databases in general.