How to Use seqtk subseq
to Extract Sequences from FASTA/FASTQ Files
Seqtk is a lightweight command-line utility developed for fast manipulation of sequences in either the FASTA or FASTQ format.
For example, the seqtk subseq
command is used for extracting the sequences (complete or subsequence) from the FASTA/FASTQ
files based on provided sequence IDs and region coordinates.
The general syntax of seqtk subseq
looks like this:
# extract sequences from FASTA
seqtk subseq input.fasta ids.txt > seq_subset.fasta
# extract sequences from FASTQ
seqtk subseq input.fastq ids.txt > seq_subset.fastq
Where, input.fasta
or input.fastq
are the name of your input FASTA/FASTQ files, and ids.txt
contains the
list of sequences IDs (one ID per line) to extract from the FASTA/FASTQ files.
The ids.txt
can also contains the sequence ID and specific sequence regions, similar to three column BED files.
How to install seqtk?: If you don’t have seqtk installed, there are few ways to install. 1) using bioconda:
conda install -c bioconda seqtk
, 2) using brew on a MAC:brew install seqtk
, and 3) source code: obtain source code from the GitHub repository and compile it.
The following examples explains how to use seqtk subseq
to extract the sequences from FASTA/FASTQ files.
Extract sequences from FASTA
For example, if you have the following FASTA file,
cat input.fasta
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCA
AGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>GU056837.1
CTAATTTTATTTTTTTATAATAATTATTGGAGGAACTAAAACATTAATGAAATAATAATTATCATAATTA
TTAATTACATATTTATTAGGTATAATATTTAAGGAAAAATATATTTTATGTTAATTGTAATAATTAGAAC
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGAC
AGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>JAMFTS010000002.1
CCTAAACCCTAAACCCTAAACCCCCTACAAACCTTACCCTAAACCCTAAACCCTAAACCCTAAACCCTAA
ACCCGAAACCCTATACCCTAAACCCTAAACCCTAAACCCTAAACCCTAACCCAAACCTAATCCCTAAACC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTC
AAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
If you want to extract the sequences of for specific genes from the FASTA file, based on their sequence IDs, you should
generate an ids.txt
file. This file should list the sequence IDs, with each ID on a separate line, as demonstrated below:
cat ids.txt
KU562861.1
MH150936.1
CP097510.1
Now extract the sequences from input.fasta
based on sequence IDs using seqtk subseq
,
seqtk subseq input.fa ids.txt
# output
>KU562861.1
GGAGCAGGAGAGTGTTCGAGTTCAGAGATGTCCATGGCGCCGTACGAGAAGGTGATGGATGACCTGGCCAAGGGGCAGCAGTTCGCGACGCAGCTGCAGGGCCTCCTCCGGGACTCCCCCAAGGCCGGCCACATCATGGA
>CP097510.1
CGATTTAGATCGGTGTAGTCAACACACATCCTCCACTTCCATTAGGCTTCTTGACGAGGACTACATTGACAGCCACCGAGGGAACCGACCTCCTCAATGAAGTCAGACGCCAAGAGCCTATCAACTTCCTTCTGCACAGC
>MH150936.1
TAGAAGCTAATGAAAACTTTTCCTTTACTAAAAACCGTCAAACACGGTAAGAAACGCTTTTAATCATTTCAAAAGCAATCCCAATAGTGGTTACATCCAAACAAAACCCATTTCTTATATTTTCTCAAAAACAGTGAGAG
Extract subsequences from specific region from FASTA
seqtk subseq
can also be used for extracting the sequences from the specific region as well. For example, you have the following
ids.txt
file containing sequence name and specific region coordinates (separated by TAB),
cat ids.txt
KU562861.1 1 10
MH150936.1 1 5
CP097510.1 10 20
Now extract the specific region sequences from input.fasta
based on sequence IDs and region coordinates using seqtk subseq
,
# extract single sequence
seqtk subseq input.fa ids.txt
>KU562861.1:2-10
GAGCAGGAG
>CP097510.1:11-20
CGGTGTAGTC
>MH150936.1:2-5
AGAA
seqtk assumes that the coordinates from ids.txt are 0-based, but it converts them to 1-based when extracting the sequences
Similarly, you can also use bedtools getfasta
or Python for
extracting the sequences from specific regions of the FASTA file.
Extract sequences from FASTQ
For example, if you have the following FASTQ file,
cat input.fastq
@SRR22309490.1 1 length=101
CTGTTTTGTCTATTTTTGTTTGGTGCATTAGCTCCAATTGTGAACGTTAATTATGGAGGAATTAGTGGTGCTTTTTATGGGAACTATAGATCTAATTATAT
+SRR22309490.1 1 length=101
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@SRR22309490.2 2 length=101
ACCGTATATGTTTTCTATGTTCTCCACCGCAACATACTCTCCTTGTGAGAGTTTAAAGATATTCTTCTTCCTGTCAATTATCTTCATGCTTCCATCTGGTT
+SRR22309490.2 2 length=101
<AAF<J7<<JJJJJJJJFJFF<FJFFJJJJJJJJJJJFJ-FJJFJJJJJJJJJJJFJJF<FJJJJJJJJFJJJJJJJJJJFFJJFFAJJFJFFJJ<FF-FA
@SRR22309490.3 3 length=101
CTCCACTACTATCTCTTCTTCTTTGGAATATCTCCACGGAAAATCATCTTCACAAAAGCGAGATATTCCATTATCGCACCAAAAGTGTCTATGTGAACCCA
+SRR22309490.3 3 length=101
AAAFA7AJFJ<FJ<<FFJJJJJJJJJJJJJJJAJAJJJFJJJJJJJJJJJJJJJJJAF-JJF<FFJJJJJJAFJJJFJFJJJJJJ<<AJJJJJF<A<FAJJ
@SRR22309490.4 4 length=101
CCATGACCTTGGATACAACTTGCCTAGTGGGTCATGGAGATCGGAAGAGCACACGTCTGAACTCCAGTCACAGTTCCGTATCTCGTATGCCGTCTTCTGCT
+SRR22309490.4 4 length=101
A<AFFJJFJJFJJAFFFJJJJAJAJJJJJFJJFFFFJ<F7FJJJAAAJJFFJJJJ-AFA-<JJF77FF<7A<J-A777AFFAJFJFFJJFFJ7JA-AJF-<
If you want to extract the sequences of for specific reads from the FASTQ file, based on their read IDs, you should
generate an ids.txt
file. This file should list the read IDs, with each ID on a separate line, as demonstrated below:
cat ids.txt
SRR22309490.1
SRR22309490.5
Now extract the sequences from input.fastq
based on sequence IDs using seqtk subseq
,
seqtk subseq input.fastq ids.txt
@SRR22309490.1 1 length=101
CTGTTTTGTCTATTTTTGTTTGGTGCATTAGCTCCAATTGTGAACGTTAATTATGGAGGAATTAGTGGTGCTTTTTATGGGAACTATAGATCTAATTATAT
+
AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
@SRR22309490.5 5 length=101
CTCGCAGTTGACTCATACTTAGCTCTATCGGTTTTGTACATGTGAGCAATCTCTGGAACCAATGGATCATCTGGGTTTGGGTCCGTTAACAATGAACATAT
+
AAAFFJJJJJJFJJJJF-FJFAFFAFFJJFF<FJFFJFJFFJJFFJJJJJJJFJJJJJJJJJJJJFFFJJJJJFJJJJJJJ<FFJFJJFJJFJFFFJJJJF
Extract subsequences from specific region from FASTQ
seqtk subseq
can also be used for extracting the sequences from specific region. For example, if you have following
ids.txt
file with sequence name and speific region coordinates (TAB sepearated)
seqtk subseq
can also be used for extracting the sequences from the specific region of reads as well. For example, you have the following
ids.txt
file containing read name and specific region coordinates (separated by TAB),
cat ids.txt
SRR22309490.1 1 10
SRR22309490.5 10 20
Now extract the specific region sequences from input.fastq
based on sequence IDs and region coordinates using seqtk subseq
,
# extract single sequence
seqtk subseq input.fastq ids.txt
@SRR22309490.1:2-10 1 length=101
TGTTTTGTC
+
AFFFJJJJJ
@SRR22309490.5:11-20 5 length=101
ACTCATACTT
+
JFJJJJF-FJ
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.