Convert GFF3 to GTF file format

Renesh Bedre    5 minute read

What is GFF3 file format?

  • GFF3 (Generic Feature Format Version 3) file format represents the genomic features in a simple text-based tab-delimited file
  • GFF3 file has nine fields (seqid, source, feature, start, end, score, strand, phase, and attributes)
  • The lines which starts with ‘##’ provides the meta-information of the file and ‘#’ represents the human-readable comments
Field name Description
seqid Chromosome or scaffold identifier for a given feature type (gene, mRNA, exon, CDS, or UTR)
source name of the source from where the feature is generated. It can be name of software or databases.
type Type of the feature (e.g. gene, mRNA or transcript, exon, CDS, or UTR)
start 1-based start integer coordinate of given feature type
end 1-based end integer coordinate of given feature type
score score of the feature type
strand plus (+) or minus (-) strand of the feature type
phase phase indicates the first base of the codon relative to the 5’ end in CDS feature. If phase=0, the codon begin at the first base of CDS nucleotide; if phase=1 the codon begin at the second base of CDS nucleotide; if phase=2 the codon begin at the third base of CDS nucleotide. Phase is required for all CDS features.
attributes feature annotation in the format of tag=value (tag e.g. ID, Parent, Name etc.). Multiple annotation are separated by ‘;’. It is not necessary to quote the values.

Representation of genomic features of plant Arabidopsis thaliana in GFF3 format (only one gene has shown) downloaded from Phytozome database

##gff-version 3
##annot-version TAIR10
Chr1	phytozomev10	gene	3631	5899	.	+	.	ID=AT1G01010.TAIR10;Name=AT1G01010
Chr1	phytozomev10	mRNA	3631	5899	.	+	.	ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10
Chr1	phytozomev10	five_prime_UTR	3631	3759	.	+	.	ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	3760	3913	.	+	0	ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	3996	4276	.	+	2	ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	4486	4605	.	+	0	ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	4706	5095	.	+	0	ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	5174	5326	.	+	0	ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	5439	5630	.	+	0	ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	three_prime_UTR	5631	5899	.	+	.	ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964

Download Arabidopsis thaliana GFF3 file for all genomic features of Chr 1 in GFF3 format

What is GTF file format?

  • GTF (Gene Transfer Format) file format is similar in structure as in GFF2 format
  • As GFF3, GTF also represents the genomic features in a simple text-based tab-delimited file
  • GTF has similar field information as described above for GFF3 with some changes in the attributes field
  • GTF requires CDS, start_codon, and stop_codon in the feature field. UTR (five_prime_UTR and three_prime_UTR), inter (intergenic region), and exon feature types are optional.
  • In the attribute field, gene_id and transcript_id tags are required for each feature type
  • In the attribute field, the text values of each tag must be double-quoted, which is not necessary for GFF3

Representation of genomic features of plant Arabidopsis thaliana in GTF format (only one gene has shown) downloaded from Ensembl Plants database

#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2008-04
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1	araport11	gene	3631	5899	.	+	.	gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1	araport11	transcript	3631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	exon	3631	3913	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1	araport11	CDS	3760	3913	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	start_codon	3760	3762	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	exon	3996	4276	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2";
1	araport11	CDS	3996	4276	.	+	2	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	4486	4605	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3";
1	araport11	CDS	4486	4605	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	4706	5095	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4";
1	araport11	CDS	4706	5095	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	5174	5326	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5";
1	araport11	CDS	5174	5326	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	5439	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6";
1	araport11	CDS	5439	5627	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	stop_codon	5628	5630	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	five_prime_utr	3631	3759	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	three_prime_utr	5631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";

Convert GFF3 to GTF file format

# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
>>> from bioinfokit.analys import gff
>>> gff.gff_to_gtf(file="Athaliana_167_TAIR10.gene_chr1.gff3")
# converted gtf file will be saved in same directory (Athaliana_167_TAIR10.gene_chr1.gtf)

# Note: if mRNA feature type (column 3) is defined by other names than 'mRNA' or 'transcript' in your GFF3 file, you
# can use the option trn_feature_name to pass that feature name to gff_to_gtf
# see here https://reneshbedre.github.io/blog/howtoinstall.html#gff3-to-gtf-file-format-conversion

References

  • https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
  • https://mblab.wustl.edu/GTF22.html
  • https://useast.ensembl.org/info/website/upload/gff.html

This work is licensed under a Creative Commons Attribution 4.0 International License