How to Extract Sequences from FASTA in Python
In the Python bioinfokit package (v2.1.3), extract_seq()
function can be used for extracting sequences (complete or subsequence)
from FASTA file based on sequence IDs and region coordinates.
The general syntax of extract_seq()
function looks like this:
# load package
from bioinfokit.analys import Fasta
# extract sequences based on sequence ID and region coordinates
Fasta.extract_seq(file="input.fasta", id="ids.txt")
Where, input.fasta
is the name of your input FASTA and ids.txt
contains the
list of sequences IDs (one ID per line) to extract from the FASTA files.
The ids.txt
can also contains the sequence ID and specific sequence regions, similar to three column BED files.
Following examples illustrates how to extract the sequences from FASTA files using the extract_seq()
function from
bioinfokit.
Extract sequences from FASTA
Suppose, you have the following FASTA file,
cat input.fasta
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
A file called ids.txt
(which contains list of sequence IDs, with each ID on a separate line) should be generated
for extracting sequences based on IDs from the FASTA file.
cat ids.txt
GU056837.1
MH150936.1
KU562861.1
Now extract the sequences from input.fasta
based on sequence IDs using extract_seq()
,
# load package
from bioinfokit.analys import Fasta
# extract sequences
Fasta.extract_seq(file="input.fasta", id="ids.txt")
# output (saved in output.fasta)
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGACAGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
The extracted sequence FASTA file (output.fasta
) will be saved in the same directory.
Extract subsequences from specific region from FASTA
In addition, extract_seq()
can extract sequences from specific regions. As an example, you have the following file
ids.txt
, which contains the sequence name and specific region coordinates (separated by TAB),
cat ids.txt
GU056837.1 1 50
MH150936.1 10 40
KU562861.1 30 80
Now extract the specific region sequences from input.fasta
based on sequence IDs and region coordinates using extract_seq()
,
# load package
from bioinfokit.analys import Fasta
# extract sequences
Fasta.extract_seq(file="input.fasta", id="ids.txt")
# output (saved in output.fasta)
>KU562861.1
GTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGA
>MH150936.1
ATGAAAACTTTTCCTTTACTAAAAACCGTCA
The extracted sequence FASTA file (output.fasta
) will be saved in the same directory.
Similarly, you can also use bedtools getfasta
or seqtk subseq
for
extracting the sequences from specific regions of the FASTA file.
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.