How to Use E-utilities for Downloading Sequences from NCBI
- The Entrez programming utilities (E-utilities) are a set of server-side programs and helps to download various biomedical data including nucleotide and protein sequences, molecular structures. etc., from National Center for Biotechnology Information (NCBI) using a programmatic approach.
- E-utilities access the Entrez database (molecular biology database system) for downloading biomedical data.
- E-utilities are helpful when we have to download a large number of nucleotide and protein sequences from NCBI. For example, download all plant protein sequences. The GUI approach for sequence download may not always work as expected when dealing with a large number of sequences.
- Entrez Direct (EDirect), which accesses the Entrez database through E-utilities, provides an option to download the nucleotide or protein sequences from a Linux/Unix command line.
- In addition to E-utilities,
ncbi-genome-download
Python package can be specifically used to download the genome sequences from the NCBI database
Download nucleotide or protein sequences based on the GI list
- If you have a list of nucleotide or protein GenInfo identifier (GI), you can download the sequences in FASTA format using the following program (see original code here)
- To run the following Perl scripts, you need to have Perl and
LWP::Simple
Perl module are installed
use LWP::Simple;
# Download protein records corresponding to a list of GI numbers.
# nucleotide or protein database
$db = 'protein';
$id_list = '2026800804,2026800803,2026800802,2026800801,2026800800';
# assemble the epost URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; # basic URL to make all E-utility requests
$url = $base . "epost.fcgi?db=$db&id=$id_list";
# post the epost URL
$output = get($url);
# parse WebEnv and QueryKey
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
# get the sequences in FASTA (rettype)
# Retrieval mode (retmode) in plain text format
$url = $base . "efetch.fcgi?db=$db&query_key=$key&WebEnv=$web";
$url .= "&rettype=fasta&retmode=text";
$data = get($url);
print "$data";
# save this code in a file and run using perl command
- Download the above code and run as
perl gi_download.pl
Download large number of nucleotide or protein sequences
- E-utilities are helpful to download all protein or nucleotide sequences for a particular organism or whole taxonomic branch
- See here for generating query (
$query
variable) to retrieve the sequences - The NCBI E-utility recommends running large jobs on weekends or after office hours (between 9:00 PM and 5:00 AM)
use LWP::Simple;
# nucleotide or protein database
$db = 'nucleotide';
# download all nucleotide sequences from Arabidopsis thaliana plant
# avoid spaces in queries. if there are spaces, replace them with a plus sign (+)
$query = 'txid3702[Organism:noexp]';
# use following query to download all plant sequences
# $query = 'all[filter]+AND+plants[filter]';
# assemble the esearch URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; # basic URL to make all E-utility requests
$url = $base . "esearch.fcgi?db=$db&term=$query&usehistory=y";
# post the esearch URL
$output = get($url);
# parse WebEnv, QueryKey and Count (# records retrieved)
$web = $1 if ($output =~ /<WebEnv>(\S+)<\/WebEnv>/);
$key = $1 if ($output =~ /<QueryKey>(\d+)<\/QueryKey>/);
$count = $1 if ($output =~ /<Count>(\d+)<\/Count>/);
# open output file for writing
# all sequences will be saved in this file
open(OUT, ">Athaliana.fasta") || die "Can't open file!\n";
# retrieve data in batches of 500 Entrez Unique Identifiers (UIDs)
# you can set this up to a maximum of 100,000 records
$retmax = 500;
for ($retstart = 0; $retstart < $count; $retstart += $retmax) {
$efetch_url = $base ."efetch.fcgi?db=$db&WebEnv=$web";
$efetch_url .= "&query_key=$key&retstart=$retstart";
$efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";
$efetch_out = get($efetch_url);
print OUT "$efetch_out";
}
close OUT;
# save this code in a file and run using perl command
- Download the above code and run as
perl bulk_download.pl
Download nucleotide or protein sequences from Linux/Unix command line
- EDirect tool (E-utilities for command line) can be used for programmatic download of nucleotide or protein sequences through the command line. It works on Linux and Mac OS. On Windows OS, it can be used using Cygwin (Unix/Linux environment).
- EDirect provides navigation functions (
esearch
,elink
,efilter
, andefecth
) to download the sequences through NCBI’s sequences databases. esearch
performs Entrez search based on query and database,elink
search for associated records with a query in other databases,efilter
provides filter options for results, andefecth
allows to download the records in a specific format. You can combine these commands using Unix pipe redirection (|
).
Check how to install EDirect
Download nucleotide sequences using gi and GenBank accession in FASTA format,
# using gi
esearch -db nuccore -query 6002679 | efetch -format fasta
# using GenBank accession
esearch -db nuccore -query AF105064.1 | efetch -format fasta
# both commands will return same output
>AF105064.1 Arabidopsis thaliana GIGANTEA (GI) mRNA, complete cds
CAGGGTTTAGCTGTTTGATTCAGCTTCGATTTAGTGTACAGTGTGTTGATTAGTATAAAAAGGATTTAAA
.
.
Download protein sequences using gi and GenBank accession,
esearch -db protein -query AAF00092.1| efetch -format fasta
# output
>AAF00092.1 GIGANTEA [Arabidopsis thaliana]
MASSSSSERWIDGLQFSSLLWPPPRDPQQHKDQVVAYVEYFGQFTSEQFPDDIAELVRHQYPSTEKRLLD
.
.
Download nucleotide sequences in GenBank (gb) format,
esearch -db nuccore -query AF105064.1 | efetch -format gb
# output
LOCUS AF105064 4001 bp mRNA linear PLN 01-OCT-1999
DEFINITION Arabidopsis thaliana GIGANTEA (GI) mRNA, complete cds.
ACCESSION AF105064
.
.
Download protein sequences associated with nucleotide accessions,
# using GenBank accession
esearch -db nuccore -query AF105064.1 | elink -target protein | efetch -format fasta
>AAF00092.1 GIGANTEA [Arabidopsis thaliana]
MASSSSSERWIDGLQFSSLLWPPPRDPQQHKDQVVAYVEYFGQFTSEQFPDDIAELVRHQYPSTEKRLLD
.
.
Get SRA accessions associated with BioSample accessions,
esearch -db sra -query SAMN07304757 | efetch -format runinfo | cut -f1 -d','
# output
Run
SRR5790106
Download genome sequences using ncbi-genome-download
ncbi-genome-download
Python package provides various options to download the genome sequences from RefSeq NCBI database
Install ncbi-genome-download
pip install ncbi-genome-download
Download the Arabidopsis thaliana genome sequence using plant names,
ncbi-genome-download --genera "Arabidopsis thaliana" plant
# multiple plant species
ncbi-genome-download --genera "Arabidopsis thaliana,Sorghum bicolor" plant
Download the Arabidopsis thaliana genome sequence using NCBI taxonomy ID (3702)
ncbi-genome-download -t 3702 plant
Download multiple genome sequences [Arabidopsis thaliana (3702) and Sorghum bicolor (4558)]
ncbi-genome-download -t 3702,4558 plant
By default, genome sequences will be saved in GenBank format. To save in FASTA format,
ncbi-genome-download -t 3702,4558 -F fasta plant
Download all plant genome sequences,
ncbi-genome-download -F fasta plant
Download all plant genome sequences with completed genome assemblies,
ncbi-genome-download -F fasta --assembly-levels complete plant
Download all plant genome sequences with completed and chromosome level genome assemblies,
ncbi-genome-download -F fasta --assembly-levels complete,chromosome plant
References
- Entrez Programming Utilities Help [Internet]
- Entrez Molecular Sequence Database System
- Sample Applications of the E-utilities
- NCBI Genome Downloading Scripts
- Command Line Tools for Genomic Data Science
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.