13.9 blastclust Parameters

blastclust clusters a database of protein or nucleotide sequences. It outputs rows of sequence identifiers from the database with clustered sequences occurring on the same row and clusters sorted from largest to smallest. The program can generate a list of clusters for input into another program (e.g., an alignment program such as PHRAP); however, it should be used only on a relatively small number of sequences (10-1000) because it runs only on a single computer, and the RAM requirements quickly exceed most capacities.

Here are a few sample command lines:

blastclust -i my_nucdb -p F -o my_nucdb.clusters 
blastclust -i my_pepdb -o my_pepdb.clusters -L 0.7 -S 90

The following reference describes parameters used with blastclust.

-a [integer]

Default: 1Programs: All

Specifies the number of CPUs to use on a multiprocessor machine.

-b [T/F]

Default: T

Requires coverage on both sequences. If set to T, the program requires both sequences to pass the coverage criteria set with -L before they are called neighbors and clustered together.

-c [file]

Default: Optional

Specifies a configuration file with advanced options. The configuration file is simply a list of the options that you commonly use.

-C [T/F]

Default: F

The crash recovery option. Set it to complete unfinished clustering. Set to T if using the -r option with a file to restore the clustering. Use the same command line as the crashed run with the same -s, with only -C, T, and -r being added. This restarts the run using the hit list file specified by -r and then appending to it (as specified by -s).

-d [file]

Default: Optional

The input file is a BLAST database, not a FASTA file.

-e [T/F]

Default: F

Enables ID parsing in the database-formatted report.

-i [file]

Default: stdin

Specifies the FASTA input file for clustering.

-l [file]

Default: Optional

Restricts the reclustering to the IDs specified in [file]. It can be useful when you have a very large FASTA database and wish to cluster a subset of sequences.

-L [real number]


Specifies the length of coverage threshold.

-p [T/F]

Default: T

Input sequences are proteins. Set to F for nucleotides.

-r [file]

Default: Optional

Specifies the file used to restore neighbors for reclustering. Set -C to T. This file is created by the -s command of a previous run. Use it if the program crashes during a run.

-s [file]

Default: Optional

Specifies the file in which to save the hit list. This file can restore a crashed run and is the input file specified by -r.

-v [file]

Default: stdout

Prints progress messages. Progress is reported to standard output if no file is specified.

-W [integer]

Default: Protein 3, Nucleotide 32

The word size; same as blastall.