Samtools: Extract Reads from Specific Genomic Regions
In genomics and bioinformatics, samtools is widely used for extracting sequence reads from BAM file that fall within specific genomic regions.
samtools view
command can be used as shown below to extract reads from single or multiple regions from the BAM file.
# extract reads from single region
samtools view -b input.bam "chr:start-end" > regions.bam
# extract reads from multiple regions
samtools view -b -L region.bed input.bam > regions.sam
To extract reads from a BAM file using samtools, you need to first create an index file (
.bai
) for the BAM file usingsamtools index input.bam
. This index file is necessary for extracting reads within specific genomic regions.
Before we begin, ensure that samtools is installed on your system. If not, you can read this article on installing samtools.
The following examples demonstrate how to extract reads within a specified region from the BAM file using samtools.
Extract reads from single region
If you want to extract the reads from chromosome 1 spanning positions 3000 to 5000, you can use the samtools as:
samtools view -b input.bam "chr1:3000-5000" > regions.bam
Where, -b
parameter specify the output should be in BAM format
The above command will create a new BAM file regions.bam
which will contain the sequence reads that overlap the
given region (chr1:3000-5000).
If you want to include header in the output, you can add -h
parameter to the above command.
If you want to create an output file in SAM format, you can use the following command.
samtools view input.bam "chr1:3000-5000" > regions.sam
Extract reads from multiple regions
You can extract the sequence reads from the multiple regions using the samtools. You need to provide the multiple region coordinates as a BED file.
Suppose you have multiple region coordinates given in the BED file as below,
# region.bed
chr1 3000 5000
chr10 5000 9000
You can use this BED file to extract the sequences reads that overlaps genomic regions in BED file.
samtools view -b -L region.bed input.bam > regions.bam
Where, -b
parameter specifies the output should be in BAM format, -L
parameter specifies to extract the reads
overlapping genomic region given in BED file.
The above command will create a new file regions.bam
which will contain reads overlapping the multiple regions given
in the BED file.
If you want to create an output file in SAM format, you can use the following command.
samtools view -L region.bed input.bam > regions.sam
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.