Parsing and analyzing BAM files
Introduction
- Binary Alignment/Map (BAM) file (.bam) is a compressed binary format of Sequence Alignment/Map (SAM) file (.sam), which is used for storing the sequence alignment information. BAM file is compressed by the BGZF library and it takes less disk space as compared to text-based SAM file. For example, the 6 GB SAM file can be stored as ~800 MB BAM file.
- The majority of downstream bioinformatics analyses, including sequence assembly, read quantification, alignment viewer (IGV), and so on, require BAM files.
- SAMtools, a software package, allows parsing and analyzing the SAM/BAM files. Here, I am using SAMtools v1.15.
How to install SAMtools
Using Source
To install SAMtools, you need to first install HTSlib. Download the latest version of SAMtools (v1.15) and HTSlib from
here. You can directly download from the website or using wget
,
# download
wget https://github.com/samtools/samtools/releases/download/1.15/samtools-1.15.tar.bz2
# install
tar xvjf samtools-1.15.tar.bz2
cd samtools-1.15/
./configure
make
make install
Follow the similar steps to install the HTSlib.
Using Git
Using Git
you need to first install HTSlib and then SAMtools as below,
git clone --recurse-submodules https://github.com/samtools/htslib
cd htslib
autoreconf -i
./configure
make
make install
git clone https://github.com/samtools/samtools
cd samtools
autoheader
autoconf -Wno-syntax
./configure
make
make install
You can use samtools version
or which samtools
commands to check if you have installed the right version of
samtools.
SAMtools utilities
Interconversion of SAM and BAM files
Convert SAM to BAM,
samtools view -bS PC14_L001_R1.sam > PC14_L001_R1.bam
Convert BAM to SAM,
samtools view -h PC14_L001_R1.bam > out.sam
Convert BAM to CRAM,
CRAM is a highly compressed alternative to the BAM file. CRAM file uses indexed reference sequences for the compression. It is fully compatible with BAM. For example, the 800 MB BAM file can be stored as a ~400 MB CRAM file.
samtools view -C -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.bam > PC14_L001_R1_pos_sorted.cram
# convert CRAM to BAM (slow as compared to BAM to CRAM)
samtools view -b -T potato_dm_v404_all_pm_un.fasta PC14_L001_R1_pos_sorted.cram > out.bam
Sorting and indexing BAM files
BAM files can be unsorted. Sorted (position-based) BAM files are essential for several Bioinformatics applications such as for StringTie transcript assembly, viewing the alignment using IGV, etc.
# sort BAM file by chromosomal position
# @: number of threads
samtools sort -@ 16 -o PC14_L001_R1_pos_sorted.bam PC14_L001_R1.bam
# sort BAM file by read names (QNAME field)
samtools sort -@ 16 -n -o PC14_L001_R1_read_sorted.bam PC14_L001_R1.bam
# index sorted BAM file (generate .bai file)
# indexing allows fast random access of records of sorted BAM files
samtools index -@ 16 PC14_L001_R1_pos_sorted.bam
You can also use the Picard SortSam
command to sort the BAM file by chromosomal position and read name. Read more
here
If you have genome in FASTA format, you can index it using samtools faidx
,
samtools faidx genome.fasta
The indexed genome file will be saved as genome.fasta.fai
View BAM files on terminal
In addition to the IGV browser, the binary BAM files can be viewed on a terminal using the samtools view
command. You can view
alignments or specific alignment regions from the BAM file.
View BAM file,
# view BAM file
samtools view PC14_L001_R1.bam
# use pipe operator to view first few alignment record (e.g. using head)
samtools view PC14_L001_R1.bam | head
You can also view the specific alignment region from the BAM file (For example, reads mapped to chr1 in between some regions). The BAM file must be indexed to view the specific alignment region.
# view alignmnet region in between 5000 to 10000 bp on chr1
samtools view PC14_L001_R1.bam Chr1:5000-10000
Split BAM file by chromosome
Split the BAM file based on chromosome name
samtools view -b PC14_L001_R1.bam chr00 > chr00.bam
Merge BAM files
Merge multiple BAM files into a single BAM file,
samtools merge merge_out.bam chr00.bam chr01.bam
# add RG tag to each alignment record for source file name (e.g., RG:Z:chr00)
samtools merge -r merge_out.bam chr00.bam chr01.bam
# merge large number of BAM files
samtools merge merge_out.bam *.bam
Calculate sequence coverage or depth
Get sequence coverage or depth from genome mapped data
# get read depth for each position on chromosome
# use -a parameter to get read depth for all positions
samtools depth PC14_L001_R1.bam > read_depth.txt
# get overall read depth
awk '{sum+=$3;} END {print sum/NR;}' read_depth.txt
14.6749
# $3 means read depth at each position of chromosome (third column from read_depth.txt)
# NR means total rows in read_depth.txt [total mapped chromosome size (with -a option, you will get whole chromosome
# size)]
Extract reads from BAM files
Specific region alignment reads can be extracted from the BAM file by providing chromosome name and region coordinates
# extract alignment records from chr01 between specific regions
samtools view PC14_L001_R1.bam chr01:1322100-1332100
Check bedtools for converting reads from BAM to FASTQ and BED format
BAM to FASTA and FASTQ
Convert BAM file into FASTA and FASTQ file format,
# get all mapped reads
# 4 is SAM flag for reads unmapped (0x4)
samtools fasta -F 4 PC14_L001_R1.bam > PC14_L001_R1.fasta
# get unmapped reads
samtools fasta -f 4 PC14_L001_R1.bam > PC14_L001_R1.fasta
# similarly replace fasta with fastq for FASTQ format file
Count total, mapped, and unmapped reads from BAM
Get total number of alignment in a BAM file (mapped and unmapped). This count may also include secondary, supplementary, and duplicate alignments. For paired-end read, both reads are counted together.
# Get total number of alignment
samtools view -c PC14_L001_R1.bam
# output
69815336
Get total mapped reads (includes secondary alignments),
Parameter -F 0x04
or -F 4
(excludes the unmapped reads),
# total mapped reads
samtools view -c -F 4 PC14_L001_R1.bam
# output
66420111
Get total unmapped reads,
Parameter -f 0x04
or -f 4
includes the unmapped reads,
# total unmapped reads
samtools view -c -f 4 PC14_L001_R1.bam
# output
3395225
Get total count of single or paired reads (which is used in FASTQ file for mapping to genome),
For paired-end reads, sum of the counts of both reads is provided
# read counts
samtools view -c -f 1 -F 3328 PC14_L001_R1.bam
# output
62074528
Get primary mapped read counts ,
Parameter -F 0x04,0x100
or -F 260
(excludes unmapped and secondary alignments),
# primary mapped read counts
samtools view -c -F 260 PC14_L001_R1.bam
# output
58679303
Get complete alignment statistics for each flag,
samtools flagstat PC14_L001_R1.bam
# output
69815336 + 0 in total (QC-passed reads + QC-failed reads)
7740808 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
66420111 + 0 mapped (95.14% : N/A)
62074528 + 0 paired in sequencing
31037264 + 0 read1
31037264 + 0 read2
58679160 + 0 properly paired (94.53% : N/A)
58679160 + 0 with itself and mate mapped
143 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
Enhance your skills with courses on genomics and bioinformatics
- Genomic Data Science Specialization
- Biology Meets Programming: Bioinformatics for Beginners
- Python for Genomic Data Science
- Bioinformatics Specialization
- Command Line Tools for Genomic Data Science
- Introduction to Genomic Technologies
References
- Samtools
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009 Aug 15;25(16):2078-9.
- Bonfield JK, Marshall J, Danecek P, Li H, Ohan V, Whitwham A, Keane T, Davies RM. HTSlib: C library for reading/writing high-throughput sequencing data. Gigascience. 2021 Feb;10(2):giab007.
- Danecek P, Bonfield JK, Liddle J, Marshall J, Ohan V, Pollard MO, Whitwham A, Keane T, McCarthy SA, Davies RM, Li H. Twelve years of SAMtools and BCFtools. Gigascience. 2021 Feb;10(2):giab008.
If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.