Annotation of genetic variants (SNPs/InDels) in VCF file
- The Variant Call Format (VCF) file produced by variant calling software (e.g. GATK, FreeBayes, BCFtools) contains the information for polymorphic loci present in the sample or population.
- These genetic variants reported in VCF files generally lack the important variant annotation information such as genomic region-based annotation (CDS, introns, intergenic, etc.), associated genes, and their functions, etc.
- Therefore, I have added the
vcf_anot
function inbioinfokit
to add the variant annotation for variant location in the genome, associated genes, and gene functions. Currently,vcf_anot
does not predict the variant effect on amino acid changes.
First, let’s check variant calls in test VCF file without adding annotation
##fileformat=VCFv4.2
##filedate=20190219
##source="beagle.21Jan17.6cc.jar (version 4.1)"
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated ALT Allele Frequencies">
##INFO=<ID=AR2,Number=1,Type=Float,Description="Allelic R-Squared: estimated squared correlation between most probable REF dose and true REF dose">
##INFO=<ID=DR2,Number=1,Type=Float,Description="Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose">
##INFO=<ID=IMP,Number=0,Type=Flag,Description="Imputed marker">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=A,Type=Float,Description="estimated ALT dose [P(RA) + P(AA)]">
##FORMAT=<ID=GP,Number=G,Type=Float,Description="Estimated Genotype Probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 100_S85 101_S86
SL3.0ch00 2157586 . A T . PASS AR2=0.94;DR2=0.95;AF=0.82 GT:DS:GP 1/1:1.92:0,0.07,0.93 1/1:1.98:0,0.02,0.98
SL3.0ch00 2157587 . G A . PASS AR2=0.91;DR2=0.93;AF=0.46 GT:DS:GP ./. 0/1:0.95:0.05,0.94,0
SL3.0ch00 2157604 . A G . PASS AR2=0.91;DR2=0.93;AF=0.47 GT:DS:GP ./. 0/1:0.96:0.04,0.95,0
SL3.0ch00 2157617 . A C . PASS AR2=0.95;DR2=0.96;AF=0.82 GT:DS:GP 1/1:1.94:0,0.06,0.94 1/1:1.99:0,0.01,0.99
SL3.0ch00 2157620 . G A . PASS AR2=0.91;DR2=0.93;AF=0.48 GT:DS:GP ./. 0/1:0.97:0.04,0.96,0
SL3.0ch00 2157622 . G T . PASS AR2=0.95;DR2=0.96;AF=0.82 GT:DS:GP 1/1:1.94:0,0.06,0.94 1/1:1.99:0,0.01,0.99
SL3.0ch00 2157626 . C T . PASS AR2=0.86;DR2=0.89;AF=0.17 GT:DS:GP ./. 0/0:0.06:0.94,0.06,0
SL3.0ch00 2157627 . A G . PASS AR2=0.95;DR2=0.96;AF=0.82 GT:DS:GP 1/1:1.95:0,0.05,0.95 1/1:1.99:0,0.01,0.99
SL3.0ch00 2157650 . C A . PASS AR2=0.96;DR2=0.96;AF=0.81 GT:DS:GP 1/1:1.95:0,0.05,0.95 1/1:1.99:0,0.01,0.99
SL3.0ch00 10737602 . A G . PASS AR2=0.81;DR2=0.85;AF=0.79 GT:DS:GP 1/1:1.92:0,0.08,0.92 1/1:2:0,0,1
SL3.0ch00 10905876 . A G . PASS AR2=0.86;DR2=0.87;AF=0.21 GT:DS:GP ./. 0/0:0.09:0.91,0.09,0
How to add annotation to variants (SNPs/InDels) in VCF file?
- We will use
bioinfokit v0.9.4
or later - Check bioinfokit documentation for installation and documentation
- Download test genotype data in VCF format from tomato GBS study (Kandel et al., 2019). For test analysis purposes, I have only used two genotypes presented in the test VCF file instead of all genotypes.
- Download test tomato gff3 file for adding an annotation to the variants
# I am using interactive python interpreter (Python 3.8.2)
>>> from bioinfokit.analys import marker
>>> marker.vcf_anot(file='SL3.0ch00_imputed.vcf', gff_file='itag30_chr0.gff3', anot_attr='Note')
# annotated file will be saved in the same directory (SL3.0ch00_imputed_anot.txt)
Annotated output file SL3.0ch00_imputed_anot will have additional columns including genomic region, and transcript IDs and name for variants in CDS or exon or UTR regions
Now check variant calls in annotated text file,
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 100_S85 101_S86 genomic region transcript ID transcript name strand
SL3.0ch00 2157586 . A T . PASS AR2=0.94;DR2=0.95;AF=0.82 GT:DS:GP 1/1:1.92:0,0.07,0.93 1/1:1.98:0,0.02,0.98 Intergenic None None None
SL3.0ch00 2157587 . G A . PASS AR2=0.91;DR2=0.93;AF=0.46 GT:DS:GP ./. 0/1:0.95:0.05,0.94,0 Intergenic None None None
SL3.0ch00 2157604 . A G . PASS AR2=0.91;DR2=0.93;AF=0.47 GT:DS:GP ./. 0/1:0.96:0.04,0.95,0 Intergenic None None None
SL3.0ch00 2157617 . A C . PASS AR2=0.95;DR2=0.96;AF=0.82 GT:DS:GP 1/1:1.94:0,0.06,0.94 1/1:1.99:0,0.01,0.99 Intergenic None None None
SL3.0ch00 2157620 . G A . PASS AR2=0.91;DR2=0.93;AF=0.48 GT:DS:GP ./. 0/1:0.97:0.04,0.96,0 Intergenic None None None
SL3.0ch00 2157622 . G T . PASS AR2=0.95;DR2=0.96;AF=0.82 GT:DS:GP 1/1:1.94:0,0.06,0.94 1/1:1.99:0,0.01,0.99 Intergenic None None None
SL3.0ch00 2157626 . C T . PASS AR2=0.86;DR2=0.89;AF=0.17 GT:DS:GP ./. 0/0:0.06:0.94,0.06,0 Intergenic None None None
SL3.0ch00 2157627 . A G . PASS AR2=0.95;DR2=0.96;AF=0.82 GT:DS:GP 1/1:1.95:0,0.05,0.95 1/1:1.99:0,0.01,0.99 Intergenic None None None
SL3.0ch00 2157650 . C A . PASS AR2=0.96;DR2=0.96;AF=0.81 GT:DS:GP 1/1:1.95:0,0.05,0.95 1/1:1.99:0,0.01,0.99 Intergenic None None None
SL3.0ch00 10737602 . A G . PASS AR2=0.81;DR2=0.85;AF=0.79 GT:DS:GP 1/1:1.92:0,0.08,0.92 1/1:2:0,0,1 CDS mRNA:Solyc00g102000.2.1 LOW QUALITY:Unknown protein (AHRD V3.3 ) +
SL3.0ch00 10905876 . A G . PASS AR2=0.86;DR2=0.87;AF=0.21 GT:DS:GP ./. 0/0:0.09:0.91,0.09,0 CDS mRNA:Solyc00g020540.2.1 Aminotransferase-like protein (AHRD V3.3 *-* Q6EQM2_ORYSJ) +
If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com
This work is licensed under a Creative Commons Attribution 4.0 International License