Convert GFF3 to GTF file format
What is GFF3 file format?
- GFF3 (Generic Feature Format Version 3) file format represents the genomic features in a simple text-based tab-delimited file
 - GFF3 file has nine fields (seqid, source, feature, start, end, score, strand, phase, and attributes)
 - The lines which starts with ‘##’ provides the meta-information of the file and ‘#’ represents the human-readable comments
 
| Field name | Description | 
|---|---|
| seqid | Chromosome or scaffold identifier for a given feature type (gene, mRNA, exon, CDS, or UTR) | 
| source | name of the source from where the feature is generated. It can be name of software or databases. | 
| type | Type of the feature (e.g. gene, mRNA or transcript, exon, CDS, or UTR) | 
| start | 1-based start integer coordinate of given feature type | 
| end | 1-based end integer coordinate of given feature type | 
| score | score of the feature type | 
| strand | plus (+) or minus (-) strand of the feature type | 
| phase | phase indicates the first base of the codon relative to the 5’ end in CDS feature. If phase=0, the codon begin at the first base of CDS nucleotide; if phase=1 the codon begin at the second base of CDS nucleotide; if phase=2 the codon begin at the third base of CDS nucleotide. Phase is required for all CDS features. | 
| attributes | feature annotation in the format of tag=value (tag e.g. ID, Parent, Name etc.). Multiple annotation are separated by ‘;’. It is not necessary to quote the values. | 
Representation of genomic features of plant Arabidopsis thaliana in GFF3 format (only one gene has shown) downloaded from Phytozome database
##gff-version 3
##annot-version TAIR10
Chr1	phytozomev10	gene	3631	5899	.	+	.	ID=AT1G01010.TAIR10;Name=AT1G01010
Chr1	phytozomev10	mRNA	3631	5899	.	+	.	ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10
Chr1	phytozomev10	five_prime_UTR	3631	3759	.	+	.	ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	3760	3913	.	+	0	ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	3996	4276	.	+	2	ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	4486	4605	.	+	0	ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	4706	5095	.	+	0	ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	5174	5326	.	+	0	ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	CDS	5439	5630	.	+	0	ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1	phytozomev10	three_prime_UTR	5631	5899	.	+	.	ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Download Arabidopsis thaliana GFF3 file for all genomic features of Chr 1 in GFF3 format
What is GTF file format?
- GTF (Gene Transfer Format) file format is similar in structure as in GFF2 format
 - As GFF3, GTF also represents the genomic features in a simple text-based tab-delimited file
 - GTF has similar field information as described above for GFF3 with some changes in the attributes field
 - GTF requires CDS, start_codon, and stop_codon in the feature field. UTR (five_prime_UTR and three_prime_UTR), inter (intergenic region), and exon feature types are optional.
 - In the attribute field, gene_id and transcript_id tags are required for each feature type
 - In the attribute field, the text values of each tag must be double-quoted, which is not necessary for GFF3
 
Representation of genomic features of plant Arabidopsis thaliana in GTF format (only one gene has shown) downloaded from Ensembl Plants database
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2008-04
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1	araport11	gene	3631	5899	.	+	.	gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1	araport11	transcript	3631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	exon	3631	3913	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1	araport11	CDS	3760	3913	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	start_codon	3760	3762	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	exon	3996	4276	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2";
1	araport11	CDS	3996	4276	.	+	2	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	4486	4605	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3";
1	araport11	CDS	4486	4605	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	4706	5095	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4";
1	araport11	CDS	4706	5095	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	5174	5326	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5";
1	araport11	CDS	5174	5326	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	exon	5439	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6";
1	araport11	CDS	5439	5627	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1	araport11	stop_codon	5628	5630	.	+	0	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	five_prime_utr	3631	3759	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1	araport11	three_prime_utr	5631	5899	.	+	.	gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
Convert GFF3 to GTF file format
- We will use 
bioinfokit v0.9.8or later - Check bioinfokit documentation for latest version.
 - Download dataset
 
# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
>>> from bioinfokit.analys import gff
>>> gff.gff_to_gtf(file="Athaliana_167_TAIR10.gene_chr1.gff3")
# converted gtf file will be saved in same directory (Athaliana_167_TAIR10.gene_chr1.gtf)
# Note: if mRNA feature type (column 3) is defined by other names than 'mRNA' or 'transcript' in your GFF3 file, you
# can use the option trn_feature_name to pass that feature name to gff_to_gtf
# see here https://reneshbedre.github.io/blog/howtoinstall.html#gff3-to-gtf-file-format-conversion
References
- https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
 - https://mblab.wustl.edu/GTF22.html
 - https://useast.ensembl.org/info/website/upload/gff.html
 
This work is licensed under a Creative Commons Attribution 4.0 International License