Bioinformatics file readers and processing (FASTA, FASTQ, and VCF)
Bioinformatics file readers
- We will discuss how to read and process common bioinformatics files
- We will use
bioinfokit v2.0.4
or later - Check bioinfokit documentation for installation and documentation
FASTA reader
# you can use interactive python interpreter, jupyter notebook, google colab, spyder or python code
# I am using interactive python interpreter (Python 3.8.2)
from bioinfokit.analys import Fasta
fasta_iter = Fasta.fasta_reader(file='fasta_file')
# read fasta file
for record in fasta_iter:
header, sequence = record
print(header, sequence)
# get sequence length
sequence_len = len(sequence)
# count A bases
a_base = sequence.count('A')
print(a_base)
# reverse complementary
rev_com_seq = fasta.rev_com(seq=sequence)
print(rev_com_seq)
Split FASTA file into multiple FASTA files
Split one big FASTA into multiple smaller FASTA files
from bioinfokit.analys import Fasta
# split one big FASTA into 3 subset FASTA files
Fasta.split_fasta(file='fasta_file', n=3)
FASTQ reader
from bioinfokit.analys import fastq
fastq_iter = fastq.fastq_reader(file='fastq_file')
# read fastq file
for record in fastq_iter:
# get sequence headers, sequence, and quality values
header_1, sequence, header_2, qual = record
# get sequence length
sequence_len = len(sequence)
# count A bases
a_base = sequence.count('A')
print(sequence, qual, a_base, sequence_len)
VCF (variant call format) reader
from bioinfokit.analys import marker
# id is for column name with chromosome identifier in VCF file
vcf_iter = marker.vcfreader(file='vcf_file', id='#CHROM')
# read vcf file
for record in vcf_iter:
# get info lines, header, and variant records
headers, info_lines, marker_lines = record
# get variant records with specified chromosome name
# change chr name as per need
if marker_lines[0] == 'chr':
print(marker_lines)
# retrieve markers within specified positions
# change chr, pos1, and pos2 variable as per need
if marker_lines[0]=='SL4.0ch00' and int(marker_lines[1]>=764654) and int(marker_lines[1])<=1038399:
print(marker_lines)
This work is licensed under a Creative Commons Attribution 4.0 International License