Convert GFF3 to GTF file format
What is GFF3 file format?
- GFF3 (Generic Feature Format Version 3) file format represents the genomic features in a simple text-based tab-delimited file
- GFF3 file has nine fields (seqid, source, feature, start, end, score, strand, phase, and attributes)
- The lines which starts with ‘##’ provides the meta-information of the file and ‘#’ represents the human-readable comments
Field name | Description |
---|---|
seqid | Chromosome or scaffold identifier for a given feature type (gene, mRNA, exon, CDS, or UTR) |
source | name of the source from where the feature is generated. It can be name of software or databases. |
type | Type of the feature (e.g. gene, mRNA or transcript, exon, CDS, or UTR) |
start | 1-based start integer coordinate of given feature type |
end | 1-based end integer coordinate of given feature type |
score | score of the feature type |
strand | plus (+) or minus (-) strand of the feature type |
phase | phase indicates the first base of the codon relative to the 5’ end in CDS feature. If phase=0, the codon begin at the first base of CDS nucleotide; if phase=1 the codon begin at the second base of CDS nucleotide; if phase=2 the codon begin at the third base of CDS nucleotide. Phase is required for all CDS features. |
attributes | feature annotation in the format of tag=value (tag e.g. ID, Parent, Name etc.). Multiple annotation are separated by ‘;’. It is not necessary to quote the values. |
Representation of genomic features of plant Arabidopsis thaliana in GFF3 format (only one gene has shown) downloaded from Phytozome database
##gff-version 3
##annot-version TAIR10
Chr1 phytozomev10 gene 3631 5899 . + . ID=AT1G01010.TAIR10;Name=AT1G01010
Chr1 phytozomev10 mRNA 3631 5899 . + . ID=AT1G01010.1.TAIR10;Name=AT1G01010.1;pacid=19656964;longest=1;Parent=AT1G01010.TAIR10
Chr1 phytozomev10 five_prime_UTR 3631 3759 . + . ID=AT1G01010.1.TAIR10.five_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 3760 3913 . + 0 ID=AT1G01010.1.TAIR10.CDS.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 3996 4276 . + 2 ID=AT1G01010.1.TAIR10.CDS.2;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 4486 4605 . + 0 ID=AT1G01010.1.TAIR10.CDS.3;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 4706 5095 . + 0 ID=AT1G01010.1.TAIR10.CDS.4;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 5174 5326 . + 0 ID=AT1G01010.1.TAIR10.CDS.5;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 CDS 5439 5630 . + 0 ID=AT1G01010.1.TAIR10.CDS.6;Parent=AT1G01010.1.TAIR10;pacid=19656964
Chr1 phytozomev10 three_prime_UTR 5631 5899 . + . ID=AT1G01010.1.TAIR10.three_prime_UTR.1;Parent=AT1G01010.1.TAIR10;pacid=19656964
Download Arabidopsis thaliana GFF3 file for all genomic features of Chr 1 in GFF3 format
What is GTF file format?
- GTF (Gene Transfer Format) file format is similar in structure as in GFF2 format
- As GFF3, GTF also represents the genomic features in a simple text-based tab-delimited file
- GTF has similar field information as described above for GFF3 with some changes in the attributes field
- GTF requires CDS, start_codon, and stop_codon in the feature field. UTR (five_prime_UTR and three_prime_UTR), inter (intergenic region), and exon feature types are optional.
- In the attribute field, gene_id and transcript_id tags are required for each feature type
- In the attribute field, the text values of each tag must be double-quoted, which is not necessary for GFF3
Representation of genomic features of plant Arabidopsis thaliana in GTF format (only one gene has shown) downloaded from Ensembl Plants database
#!genome-build TAIR10
#!genome-version TAIR10
#!genome-date 2008-04
#!genome-build-accession GCA_000001735.1
#!genebuild-last-updated 2010-09
1 araport11 gene 3631 5899 . + . gene_id "AT1G01010"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding";
1 araport11 transcript 3631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 exon 3631 3913 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon1";
1 araport11 CDS 3760 3913 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1 araport11 start_codon 3760 3762 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 exon 3996 4276 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon2";
1 araport11 CDS 3996 4276 . + 2 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "2"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1 araport11 exon 4486 4605 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon3";
1 araport11 CDS 4486 4605 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "3"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1 araport11 exon 4706 5095 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon4";
1 araport11 CDS 4706 5095 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "4"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1 araport11 exon 5174 5326 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon5";
1 araport11 CDS 5174 5326 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "5"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1 araport11 exon 5439 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; exon_id "AT1G01010.1.exon6";
1 araport11 CDS 5439 5627 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding"; protein_id "AT1G01010.1";
1 araport11 stop_codon 5628 5630 . + 0 gene_id "AT1G01010"; transcript_id "AT1G01010.1"; exon_number "6"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 five_prime_utr 3631 3759 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
1 araport11 three_prime_utr 5631 5899 . + . gene_id "AT1G01010"; transcript_id "AT1G01010.1"; gene_name "NAC001"; gene_source "araport11"; gene_biotype "protein_coding"; transcript_source "araport11"; transcript_biotype "protein_coding";
Convert GFF3 to GTF file format
- We will use
bioinfokit v0.9.8
or later - Check bioinfokit documentation for latest version.
- Download dataset
# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
>>> from bioinfokit.analys import gff
>>> gff.gff_to_gtf(file="Athaliana_167_TAIR10.gene_chr1.gff3")
# converted gtf file will be saved in same directory (Athaliana_167_TAIR10.gene_chr1.gtf)
# Note: if mRNA feature type (column 3) is defined by other names than 'mRNA' or 'transcript' in your GFF3 file, you
# can use the option trn_feature_name to pass that feature name to gff_to_gtf
# see here https://reneshbedre.github.io/blog/howtoinstall.html#gff3-to-gtf-file-format-conversion
References
- https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
- https://mblab.wustl.edu/GTF22.html
- https://useast.ensembl.org/info/website/upload/gff.html
This work is licensed under a Creative Commons Attribution 4.0 International License