10.6 Integration of BLAST into NeoCore XMS

  Previous section   Next section

Sequence alignment tools are fundamental in the bioinformatics arena. A sequence alignment tool compares two or more sequences in order to determine how similar the sequences are or if subsections of the sequences are similar. A common use of the tool is for a researcher who sequences a section of DNA to then access a database and search for other DNA sequences that are similar to the one being studied. Finding other sequences in the database that are similar is a first step in trying to determine the functionality of the gene or to determine other organisms that may have a similar gene. Comparisons do not have to be comparing nucleotide sequence against nucleotide sequence or amino acid sequence against amino acid sequence. Since we know that nucleotide sequences code for amino acid sequences, we can convert nucleotide sequences to their equivalent amino acid sequence and compare sequences on those terms. This is often done because proteins are the functional element in life, and functional information is typically being sought.

Needleman-Wunsch are credited with coming up with the first global sequence alignment algorithm in 1970 (Needleman and Wunsch 1970). A global alignment compares the sequences over the entire length of both sequences. Gaps are added to force the two sequences to be the same length. Dynamic programming, a well-known optimization technique, is used to find the alignment(s) that produce the highest score. A more computationally efficient version was found by Gotoh in 1982.

Smith and Waterman generalized the Needleman-Wunsch algorithm in 1981 to find the highest-scoring subsequence matches between two sequences (Smith and Waterman 1981). Their method includes gap analysis and also uses a dynamic programming approach to find the best matches. The Smith-Waterman algorithm is still in popular use today.

The main problem with the Smith-Waterman sequence alignment algorithm is that it is computationally expensive and therefore very time consuming when comparing a given sequence against a large database of sequences. To improve search speed, a number of heuristic sequence alignment methods have been developed, such as those used in BLAST, FASTA, and SALSA. Of these heuristic methods, BLAST (Basic Local Alignment Search Tool) seems to be the most popular and well known (Altschul et al. 1990; Altschul et al. 1997). The heuristic algorithms are not guaranteed to find optimal alignment solutions, but they are far more practical for searching a large database of sequences, and they do a very good job of finding similar sequences.

It is beyond the scope of this chapter to go into the details of the various search algorithms. Numerous books are available that do cover these details, including (Durbin et al. 1998).

NeoCore developed its own sequence search plug-in module to go with NeoCore XMS. The NeoCore sequence search plug-in is similar to BLAST in operation but is based on NeoCore's patented digital pattern processing technology. The plug-in gives the power of a BLAST sequence search embedded (SSE) within a query of the XML database, which has the added advantage of being able to do targeted BLAST searches. In addition to looking for sequences that are similar to a query sequence, the query can restrict the search to DNA sequences from a particular organism or sequences involved in a particular metabolic pathway. Any set of sequences that can be targeted with an XPath query can have a BLAST search performed on it.

The NeoCore SSE plug-in is accessed via a function call within an XPath query. An example XPath query that calls the SSE is

/ND/dna_sequence_entry/dna_sequence{func("Blastn,-w11,-r1,-q3, -i actgcggt . . . ")}/..

This query searches all DNA sequence entries within NeoCore XMS, targeted by /ND/dna_sequence_entry/dna_sequence. NeoCore XMS wraps all documents within the tags <ND>, hence the /ND at the beginning of the XPath query. The query returns all nucleotide sequences that are similar to the query sequence actgcggt . . .. Similarity is based on the BLAST scoring algorithm. The addition to the standard XPath query is the function call section starting and ending with curly brackets, { func("Blastn, . . . ")}. Five separate function calls and a variety of parameters are supported by the NeoCore SSE. The five functions mimic those of BLAST and for convenience are named the same as BLAST calls.

10.6.1 Sequence Search Types

This section describes the various sequence search types supported by NeoCore SSE. The five functions supported are

  • Protein-Protein Search (Blastp)

    This search compares an amino acid query sequence against amino acid sequences stored in the database.

  • Nucleotide-Nucleotide Search (Blastn)

    This search compares a nucleotide query sequence against nucleotide sequences stored in the database.

  • Nucleotide-Protein Search (Blastx)

    This search compares a nucleotide query sequence against amino acid sequences stored in the database. The nucleotide query sequence is first translated into a set of six possible amino acid sequences. The six amino acid sequences come from the six possible reading frames of the nucleotide sequence.

  • Protein-Nucleotide Search (tBlastn)

    This search compares an amino acid query sequence against nucleotide sequences stored in the database. The nucleotide sequences in the database are dynamically translated in all six reading frames during the sequence search to their equivalent amino acid sequences and then compared to the protein sequence.

  • Translated Nucleotide-Nucleotide Search (tBlastx)

    This search compares a nucleotide query sequence, translated into six possible amino acid sequences, against nucleotide sequences stored in the database. The nucleotide sequences in the database are dynamically translated in all six reading frames during the sequence search and then compared to the six translated protein sequences.

Determining the set of nucleotides that make up a DNA sequence is but one step in the discovery process. The question arises as to how one goes about finding a gene, or a section of the DNA sequence that codes for a protein. It is not at all obvious from sequencing one side of a DNA strand where the genes lie. Similarity searches play a role in this process of discovering genes. If the sequence information that comes from one side of a DNA strand is translated into its various possible amino acid sequences via the genetic code table, a database of proteins can be searched looking for possible matches. If matches are found, then the corresponding section of the DNA may be a gene.

If the human genome may be thought of as the book of life that contains the blueprints for our construction, a mechanism must exist to indicate starting and stopping points. During protein construction within the cell, as genes are read, certain amino acids act as initiators and others act as terminators. The amino acid Methionine is the initiator or starting point for a gene and is coded for by the DNA sequence A-T-G in the genetic code table. As with any construction process, instructions for the ending or completion of the project must exist. Three stop codons serve this purpose. The DNA sequences T-A-A, T-G-A, or T-A-G do not code for an amino acid but act as signals to complete the protein under construction. The newly constructed protein begins with Methionine and ends just before one of the stop codons. Regions bounded by start and stop codons are known as Open Reading Frames (ORF). An ORF represents a possible gene-coding region in the nucleotide sequence. A DNA sequence may have six potential overlapping reading frames that must be evaluated.

As previously mentioned, DNA is a double-stranded molecule. Genes may be found on either the forward or reverse strand. The forward and reverse strands are related because of the pairing of the nucleotides across from one another. A Cytosine will pair only with a Guanine, and a Thymine will pair only with an Adenine.

Forward strand 
C A T G C A T G C A T C
G T A C G T A C G T A G
Reverse strand 

In translating the forward sequence, the first three frames are found (remember that it takes three nucleotides to code for one amino acid, and we are not sure where the starting point is):

Frame 2:

C

ATG

CAT

GCA

TC

Begin translating at the second nucleotide

   

M

H

A

S

 

Frame 3:

CA

TGC

ATG

CAT

C

Begin translating at the third nucleotide

   

C

M

H

   

To translate the reverse frames for evaluation the strand becomes:

G A T G C A T G C A T G

Frame 4:

GAT

GCA

TGC

ATG

Begin translation with the first nucleotide

 

D

A

C

M

 

Frame 5:

G

ATG

CAT

GCA

TG

Begin translation with the second nucleotide

   

M

H

A

   

Frame 6:

GA

TGC

ATG

CAT

G

Begin translation with the third nucleotide

   

C

M

H

   

Frames 2 and 5 begin with the Methionine initiator (M) or start signal. If one of these two strands ends with one of the stop codons, this would tend to indicate a gene-coding region that results in a manufactured protein. Another point that arises from this example is illustrated in Frame 2. Only two nucleotides are known at the end. No matter what the third nucleotide is, this codon will translate to amino acid S or serine. As previously mentioned, the genetic code contains redundancies in coding. Serine may be coded for by any of the following codons: TCA, TCG, TCT, TCC, AGT, and AGC. Methionine, the initiator of the protein chain, is coded for by the unique codon ATG. Except for BLASTp and BLASTn, examining the six reading frames is at the core of performing comprehensive BLAST searches. This search capability is an example of the ability to include powerful plug-ins within NeoCore XMS.

A couple of examples will demonstrate the use of the BLAST search engine with NeoCore XMS. NeoCore XMS was loaded with several bacterial genomes downloaded from NCBI. The XML format of the sequence entries can be assumed to be that shown in earlier sections of this chapter. The first example is a BLASTp. BLASTp compares an amino acid query sequence against amino acid sequences in the database. The XPath query is of the form shown in Listing 10.4.

Listing 10.4 XPath Query
/ND/dna_sequence_entry/features/gene/protein/amino_acid_sequence
{func("Blastn, -i  . . . KFGGTSVANAERFLRVADILESNARQGQVA . . . ")}/..

This query searches for all proteins similar to the query sequence ". . . KFGGTSVANA . . ." and targeted by XPath:

/ND/dna_sequence_entry/features/gene/protein/amino_acid_sequence

The trailing / . . . will cause the return XML to be at the <protein> element level. The return from this query will be of the form shown in Listing 10.5.

Listing 10.5 Result of the Query in Listing 10.4
<query_results>
      <protein>
            <gene>aspartokinase II</gene>
            <protein_id>AAC76922.1</protein_id>
            <db_xref>GI:1790376</db_xref>
            <amino_acid_sequence>
                  ...KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA...
            <amino_acid_sequence>
      </protein>
</query_results>

NeoCore XMS always returns well-formed XML. To view details on the sequence matches, there is an option to output a file of search results. One of the results output from this search is shown in Listing 10.6.

Listing 10.6 File Containing Search Results for Listing 10.5
protein_id: AAC76922.1
DB Seq. Offset:   15
QuerySeq. Offset:  4
SA Length:        35
SA Score:         61
DB Sequence:      KFGGSSLADVKCYLRVAGIMAEYSQPDDMMVVSAA
Query Sequence:   KFGGTSVANAERFLRVADILESNARQGQVATVLSA

In this case a subsequence of 35 amino acids from query sequence starting at offset 4 from the beginning of the query sequence, matched a subsequence from the database, starting at offset 15, with a score of 61. The score is built by adding up alignment scores for each pair of amino acids over the subsequence at hand. Exact matches give the highest score. Over the set of 20 amino acids, there are amino acids that have similar properties to one another. Alignments with amino acids that have similar properties will still add to the score in a positive way. Alignments with amino acids that have dissimilar properties will cause a negative penalty to be added to the score. Sequence substitution matrices provide alignment scores for all possible combinations of two amino acids. It is beyond the scope of this chapter to go into how the sequence substitution matrix scoring is generated. The first couple of chapters in Durbin et al. 1998 provide a very readable development of sequence alignment methodologies and the generation of scoring matrices.

A more complex BLAST search can be performed using the tBLASTx option. The query format is shown in Listing 10.7.

Listing 10.7 A Complex Query Using the tBLASTx Option
/ND/dna_sequence_entry[source/taxonomy/cell_type="Eukaryota"]/dna_
sequence {func("tBlastx, -i 
gtaccctgtttacatcattttgccattttcgcgtactgcaaccggcggg . . . 
")}/../header/accession_no

This query performs a tBLASTx search against all DNA sequences in that database that come from a taxonomy cell type of Eukaryota. The query will return the accession number of all database sequences that meet the tBLASTx criteria. A typical result will look like Listing 10.8.

Listing 10.8 Result of the Query in Listing 10.7
<query_results>
      <accession_no>
            <base_no>AE000333</base_no>
            <version>AE000333.1</version>
            <GI>GI:1788805</GI>
      </accession_no>
</query_results>

A tBLASTx search first converts the query sequence into the six amino acid frames. The sequence search engine then compares the six query frames to each database sequence that is of the cell type Eukaryota. Before the comparison can be done, the database DNA sequences must first be dynamically converted to its six amino acid frames. The comparison is then performed at the amino acid sequence level. Using NeoCore's DPP technology, all six query sequences are compared to each database sequence in a single pass over the database sequences. The result is a very fast and accurate search. An example of one of the database matches is given in Listing 10.9.

Listing 10.9 Example Database Match for Amino Acid Sequence Comparison
Accession_no: AE000140.1
DB Seq Frame No:  2
Query Frame No:   6
DB Seq. Offset:   2728
QuerySeq. Offset: 592
SA Length:        32
SA Score:         53
DB Sequence:      FRVVAPSTHRRAGTGGVAGIQRNLLHSLHFRH
Query Sequence:   FRLVTQSEQGRAGAGDTARYRADFLRAMHHQQ

Notice how query sequence frame number 6 matched with a database sequence frame number 2. Since the query sequence match was on the reverse frame 6, the query sequence offset, 592, would be referenced from the tail end of the query nucleotide sequence. The database sequence matched on a forward frame 2, so the offset is from the beginning of the database nucleotide sequence. Nucleotide sequences tend to be much longer than amino acid sequences (at least by a factor of 3), hence the large offset numbers. The match subsequence is shown in the amino acid domain. The score is generated in the same fashion as a BLASTp type of search.

Sequence search and alignment tools are the heart and soul of data-mining in the high throughput genomics world. These are primary tools for the biologist today. The incorporation of a BLAST search engine into NeoCore XMS greatly enhances the database's ability to meet the information requirements of the exploding life sciences market.


Top

Part IV: Applications of XML