First, here is an example of a gene model. In this representation, the thick red bars denote the Open Reading Frame (ORF), which is the part that codes for the protein. The other part of the gene, the white bars denotes "untranslated regions" called UTRs:
Thioredoxin H
This gene is Poplar, the tree. Poplar is a model tree system. It is a famous example of a very large organism with a very small genome.
Here is the web portal for the Gramene database for Poplar:
Poplar Gramene Database
There is a link in the left hand side for the Genome Assembly fasta files. It takes you to this page of the fasta file
here. A fasta file is a format for storing biological sequences. Here is an example:
>sequence1
TGCTGAGATCGTACGATCGATCGACTGACTGCAGCTCA
>sequence2
ACGTTTGGACACATATCTACATCATAC
So a fasta file consists of a definition line "defline" for each sequence consisting of a greater than sign ">" and a unique ID, followed by the sequence.
We can download a chromosome fasta file from the Genome Assembly page. Since the gene Thioredoxin H is on chromosome 6, let's download that file. Let's not download to your computer, but rather download to the server with the command "wget". Paste this command into your terminal:
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-27/fasta/populus_trichocarpa/dna/Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa.gz
We'll need to unzip this file with the command:
gunzip Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa.gz
Next, let's look at how the annotations are stored. One example of a gene annotation file is the GFF file. Here is an example of what a GFF file looks like:
##gff-version 3
ctg123 . exon 1300 1500 . + . ID=exon00001
ctg123 . exon 1050 1500 . + . ID=exon00002
ctg123 . exon 3000 3902 . + . ID=exon00003
ctg123 . exon 5000 5500 . + . ID=exon00004
ctg123 . exon 7000 9000 . + . ID=exon00005
Here is the Gramene Poplar database of GFF annotations:
GFF database. Let's download the annotation for chromosome 6, but again let's not download to your computer, but rather to the server with "wget":
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-27/gff3/populus_trichocarpa/Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3.gz
We need to unzip the file first:
gunzip Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3.gz
You can look at this huge file with "less" using the command:
less Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3
However, for simplicity, let's get out the lines corresponding to the gene ID for Thioredoxin H and redirect the contents into a new file:
grep POPTR_0006s11060.1 Populus_trichocarpa.JGI2.0.27.chromosome.6.gff3 > Trxh4.gff3
Now, let's open a python shell by typing "python":
python
Let's import the SeqIO object:
>>> from Bio import SeqIO
We need to first read in the chromosome sequence "chr6" into a SeqRecord object:
>>> chr6 = SeqIO.read(open("Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa"),"fasta")
What does this object look like? You can see by typing "chr6". Now let's get a substring corresponding to the CDS regions:
>>> cds1 = chr6[8388442-1:8388462].reverse_complement()
Note the -1 term for the first postion. This is to convert from 1-based to 0-based coordinates. We don't do this for the second position because substrings in python exclude the last position.
>>> cds2 = chr6[8386527-1:8386649].reverse_complement()
>>> cds3 = chr6[8386068-1:8386190].reverse_complement()
>>> cds4 = chr6[8385582-1:8385734].reverse_complement()
>>> cds = cds1 + cds2 + cds3 + cds4
>>> cds
You will see the following:
SeqRecord(seq=Seq('ATGGGACTTTGCTTGGATAAGCATAAACATGATGCAGACAATGATGAGCTGCAT...TGA', SingleLetterAlphabet()), id='', name='', description='', dbxrefs=[])
This is weird, but the cds is SeqRecord object, but we want to convert to a Seq object. The SeqRecord object contains a Seq object as the first entry labeled "seq=...". This means we can get the Seq object by this command:
>>> DNA = cds.seq
Now let's transcribe and translate like we did yesterday:
>>> RNA = DNA.transcribe()
>>> Protein = RNA.translate()
Now let's download a python script that puts all this together:
wget dnacamp.cgrb.oregonstate.edu/translateGFF.py
Move the script into your scripts directory:
mv translateGFF.py scripts/.
Let's take a look at this script. open it with "less":
less scripts/translateGFF.py
Here is what the script looks like:
from Bio.Seq import Seq
from Bio import SeqIO
import sys
if len(sys.argv) == 2 or "-h" in sys.argv or "--help" in sys.argv:
print >> sys.stderr, "Usage: parseGFF.py "
sys.exit()
fasta = sys.argv[1]
gff = sys.argv[2]
chrom = SeqIO.read(open(fasta),"fasta")
rc = False
coords = []
for line in open(gff,"r"):
if "#" in line: continue
chromname,source,seqtype,start,stop,score,strand,frame,info = line.strip().split("\t")
if "CDS" in seqtype:
coords.append((int(start),int(stop)))
rc = strand is "-"
coords.sort(reverse = rc)
ORF = Seq('')
for start,stop in coords:
CDS = chrom[start-1:stop]
if rc:
CDS = chrom[start-1:stop].reverse_complement()
ORF += CDS.seq
RNA = ORF.transcribe()
Protein = RNA.translate()
print Protein
Finally, let's run the script!
python ~/scripts/translateGFF.py Populus_trichocarpa.JGI2.0.27.dna.chromosome.6.fa Trxh4.gff3