Week 1, Day 4: 10:30am DNA Biology and Bioinformatics Camp

Bioinformatics: Gene Annotations


First, here is an example of a gene model. In this representation, the thick red bars denote the Open Reading Frame (ORF), which is the part that codes for the protein. The other part of the gene, the white bars denotes "untranslated regions" called UTRs:

Thioredoxin H
This gene is Poplar, the tree. Poplar is a model tree system. It is a famous example of a very large organism with a very small genome.

Here is the web portal for the Gramene database for Poplar: Poplar Gramene Database There is a link in the left hand side for the Genome Assembly fasta files. It takes you to this page of the fasta file here. A fasta file is a format for storing biological sequences. Here is an example:
>sequence1
TGCTGAGATCGTACGATCGATCGACTGACTGCAGCTCA
>sequence2
ACGTTTGGACACATATCTACATCATAC
So a fasta file consists of a definition line "defline" for each sequence consisting of a greater than sign ">" and a unique ID, followed by the sequence.

We can download a chromosome fasta file from the Genome Assembly page. Since the gene Thioredoxin H is on chromosome 6, let's download that file. Let's not download to your computer, but rather download to the server with the command "wget". Paste this command into your terminal: wget ftp://ftp.ensemblgenomes.org/pub/plants/release-27/fasta/populus_trichocarpa/dna/Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa.gz

We'll need to unzip this file with the command:

gunzip Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa.gz

Next, let's look at how the annotations are stored. One example of a gene annotation file is the GFF file. Here is an example of what a GFF file looks like:

##gff-version 3
ctg123  .  exon  1300  1500  .  +  .  ID=exon00001
ctg123  .  exon  1050  1500  .  +  .  ID=exon00002
ctg123  .  exon  3000  3902  .  +  .  ID=exon00003
ctg123  .  exon  5000  5500  .  +  .  ID=exon00004
ctg123  .  exon  7000  9000  .  +  .  ID=exon00005
Here is the Gramene Poplar database of GFF annotations:GFF database. Let's download the annotation for chromosome 6, but again let's not download to your computer, but rather to the server with "wget":

wget ftp://ftp.ensemblgenomes.org/pub/plants/release-27/gff3/populus_trichocarpa/Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3.gz

We need to unzip the file first:

gunzip Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3.gz

You can look at this huge file with "less" using the command:

less Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3

However, for simplicity, let's get out the lines corresponding to the gene ID for Thioredoxin H and redirect the contents into a new file:

grep POPTR_0006s11060.1 Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3 > Trxh4.gff3

Now, let's open a python shell by typing "python":

python

Let's import the SeqIO object: >>> from Bio import SeqIO We need to first read in the chromosome sequence "chr6" into a SeqRecord object:

>>> chr6 = SeqIO.read(open("Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa"),"fasta")

What does this object look like? You can see by typing "chr6". Now let's get a substring corresponding to the CDS regions:

>>> cds1 = chr6[8388442-1:8388462].reverse_complement()

Note the -1 term for the first postion. This is to convert from 1-based to 0-based coordinates. We don't do this for the second position because substrings in python exclude the last position.

>>> cds2 = chr6[8386527-1:8386649].reverse_complement()

>>> cds3 = chr6[8386068-1:8386190].reverse_complement()

>>> cds4 = chr6[8385582-1:8385734].reverse_complement()

>>> cds = cds1 + cds2 + cds3 + cds4

>>> cds

You will see the following:

SeqRecord(seq=Seq('ATGGGACTTTGCTTGGATAAGCATAAACATGATGCAGACAATGATGAGCTGCAT...TGA', SingleLetterAlphabet()), id='', name='', description='', dbxrefs=[])

This is weird, but the cds is SeqRecord object, but we want to convert to a Seq object. The SeqRecord object contains a Seq object as the first entry labeled "seq=...". This means we can get the Seq object by this command:

>>> DNA = cds.seq

Now let's transcribe and translate like we did yesterday:

>>> RNA = DNA.transcribe()

>>> Protein = RNA.translate()

Now let's download a python script that puts all this together:

wget dnacamp.cgrb.oregonstate.edu/translateGFF.py

Move the script into your scripts directory:

mv translateGFF.py scripts/.

Let's take a look at this script. open it with "less":

less scripts/translateGFF.py

Here is what the script looks like:

from Bio.Seq import Seq
from Bio import SeqIO
import sys

if len(sys.argv) == 2 or "-h" in sys.argv or "--help" in sys.argv:
    print >> sys.stderr, "Usage: parseGFF.py  "
    sys.exit()

fasta = sys.argv[1]
gff = sys.argv[2]

chrom = SeqIO.read(open(fasta),"fasta")

rc = False
coords = []
for line in open(gff,"r"):
    if "#" in line: continue
    chromname,source,seqtype,start,stop,score,strand,frame,info = line.strip().split("\t")
    if "CDS" in seqtype:
        coords.append((int(start),int(stop)))
    rc = strand is "-"    
    
coords.sort(reverse = rc)

ORF = Seq('')
for start,stop in coords:
    CDS = chrom[start-1:stop]
    if rc: 
        CDS = chrom[start-1:stop].reverse_complement()
    ORF += CDS.seq

RNA = ORF.transcribe()
Protein = RNA.translate()

print Protein


Finally, let's run the script!

python ~/scripts/translateGFF.py Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa Trxh4.gff3