How to Use bedtools getfasta to Extract DNA Sequences (With Example)
bedtools getfasta
is a command-line utility for extracting DNA sequences from the reference FASTA file based on
the genomic coordinates given in the BED/GFF/VCF file format.
The general syntax of bedtools getfasta
looks like this:
# default
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta
# Extract sequences using name column
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta -name
# Extract sequences and strand information
bedtools getfasta -s -fi reference.fasta -bed regions.bed -fo output.fasta
Where,
Parameter | Description | |
---|---|---|
-fi |
Input FASTA file from where sequences needs to extract | |
-bed |
BED file for regions | |
-fo |
Save extracted sequences in this file in FASTA format | |
-name |
Assign name (fourth column of BED file) to sequences in output FASTA file | |
-s |
Extract sequence strand information (sixth column of BED file) in output FASTA file |
In addition to the above parameters, the bedtools getfasta
has several other parameters
for extracting sequences from the reference FASTA file.
The following examples demonstrate how to use bedtools getfasta
to extract DNA sequences and other information from
the FASTA file.
Extract the sequence from the BED file (default behavior)
The following example shows how to use bedtools getfasta
to extract DNA sequences from the genomic coordinates
provided in the BED file.
The first three columns in BED format are chr, start, and end (BED3 file).
# input sequence
head reference.fasta
>chr1
ATGGCCTTAAATTTTAAA
# input BED file (three columns)
head regions.bed
chr1 4 7
# extract the sequence based on BED genomic coordinates
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta
The above command will extract the sequence from chr1 in between 4 to 7 interval.
The output (extracted sequence) will be saved in output.fasta
. By default, the sequence name in the output FASTA file
will be written as Chr:start-end
.
head output.fasta
>chr1:4-7
CCT
Troubleshooting Tip: The sequence name in the BED file’s first column should exactly match the sequence name in the reference FASTA file. The BED file should be TAB separated. FASTA and BED files should have a Unix line break (use the
dos2unix
command).
Similarly, you can also use seqtk subseq
or Python for
extracting the sequences from specific regions of the FASTA file.
Extract the sequence from the BED file (Assign value in name
column to sequence header)
If you use the -name
parameter with bedtools getfasta
, it will assign a sequence header based on the value in the name
column in the BED file.
To use the -name
parameter of bedtools getfasta
, you should have a BED file with four columns.
The first four columns in BED format are chr, start, end, and name.
# input sequence
head reference.fasta
>chr1
ATGGCCTTAAATTTTAAA
# input BED file (four columns)
head regions.bed
chr1 4 7 geneA
# extract the sequence based on BED genomic coordinates
bedtools getfasta -fi reference.fasta -bed regions.bed -fo output.fasta -name
The output (extracted sequence) will be saved in output.fasta
. The sequence name in the output FASTA file will be written as
geneA
(as defined by the -name
parameter).
head output.fasta
>geneA
CCT
Extract the sequence from the BED file (with sequence and strand information)
You can use the -s
parameter with bedtools getfasta
to extract and output the strand information in output FASTA file.
To extract the strand information, you need six column BED file (BED6). The sixth column in the BED file is sequence strand information.
# input sequence
head reference.fasta
>chr1
ATGGCCTTAAATTTTAAA
# input BED file (six columns)
head regions.bed
chr1 4 7 geneA 0 +
chr1 9 11 geneB 0 -
# extract the sequence based on BED genomic coordinates
bedtools getfasta -s -fi reference.fasta -bed regions.bed -fo output.fasta -name
The output (extracted sequence) will be saved in output.fasta
. The sequence name and strand in the output FASTA file
will be written as geneA(+)
.
head output.fasta
>geneA(+)
CCT
>geneB(-)
TT
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.