blastp
: Command-line Utility for Protein Sequence Search
The blastp
is a command-line utility from the NCBI BLAST toolkit
that is used for performing protein-protein sequence similarity searches using the BLAST algorithm.
blastp
compares a query protein sequence against a protein BLAST database to identify
homologous protein sequences. If you want to compare nucleotide sequence against a nucleotide BLAST database, please
see blastn
tool.
The general syntax of blastp
looks like this:
# basic command
blastp -query query_fasta -db blast_protein_db -outfmt output_format -out output_file
# command with advanced regularly used options
blastp -query query_fasta -db blast_protein_db -evalue 1e-05 -max_target_seqs 5 \
-num_threads 10 -outfmt output_format -out output_file
Where,
Parameter | Description |
---|---|
-query |
Input protein sequences in FASTA format to search against a protein BLAST database |
-db |
Formatted protein BLAST database. See makeblastdb for creating a formatted BLAST database. |
-evalue |
Expectation value (E) value threshold you want to use for the search (default 10). Matches with lower evalue represent significant matches |
-max_target_seqs |
Maximum number of aligned sequences to be reported in the output (default 500). A value of >=5 is recommended |
-num_threads |
Number of threads (CPU cores) for the search (default 1). More is better for a faster search. |
-outfmt |
Numerical value representing a predefined output format or a custom string specifying the fields you want to include in the BLAST output (default 0, pairwise) |
-out |
Name of the output file where results will be saved |
In addition to the above frequently used parameters, you can see more parameters and their usage using the blastp -help
command
Note:
blastp
requires the formatted BLAST database. You can create it using themakeblastdb
command or you can download the preformatted BLAST database from NCBI.
The following examples explain how to use blastp
on the command line for protein-protein sequence similarity
searches.
Let’s say you have an input query protein sequence (input.fasta
) and a formatted protein database (target_protein_db
).
Run basic blastp
command
blastp -query input.fasta -db target_protein_db -outfmt 6 -out blastp_output.txt
Above blastp
compare the protein sequences in input.fasta
against the formatted target_protein_db
, and save the results
in tabular format (-outfmt 6
) in the blastp_output.txt
file.
The output should look like this:
head -n5 blastp_ouput.txt
seq1 seq1 100.000 70 0 0 1 70 1 70 5.11e-48 133
seq1 seq2 100.000 25 0 0 13 37 32 56 1.18e-15 51.6
seq1 seq2 100.000 13 0 0 41 53 1 13 0.029 17.3
seq1 seq3 76.744 43 0 1 21 53 1 43 3.31e-12 43.1
seq1 seq3 100.000 13 0 0 7 19 57 69 3.64e-08 32.7
The columns in the output file (with -outfmt 6
) represent query id, target id, % identical matches, alignment length, mismatches, gap openings,
query start, query end, target start, target end, evalue, and bitscore
Run blastp
command with customized options
blastp -query input.fasta -db target_protein_db -evalue 1e-05 -max_target_seqs 5 -num_threads 10 \
-outfmt "6 qseqid qlen sseqid slen qstart qend sstart send nident pident length mismatch gaps qcovs evalue bitscore" \
-out blastp_output.txt
Above blastp
compare the protein sequences in input.fasta
against the target_protein_db
with given parameter cut-offs, and save the results with
in a tabular format with customized fields in the blastp_output.txt
file.
The output should look like this:
head -n5 blastp_output.txt
seq1 70 seq1 70 1 70 1 70 70 100.000 70 0 0 100 5.11e-48 133
seq1 70 seq2 56 13 37 32 56 25 100.000 25 0 0 36 1.18e-15 51.6
seq1 70 seq3 69 21 53 1 43 33 76.744 43 0 10 66 3.31e-12 43.1
seq1 70 seq3 69 7 19 57 69 13 100.000 13 0 0 66 3.64e-08 32.7
seq2 56 seq2 56 1 56 1 56 56 100.000 56 0 0 100 1.85e-37 105
The columns in the output file represent the customized columns mentioned in -outfmt
parameter.
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.