DNA Biology and Bioinformatics Camp

Annotate bacterial genome sequences (predict genes and their functions) using Artemis

Today we will be learning about genome annotation. First we will use the program Artemis to look at how to annotate a gene. Then we will use the program Prokka, which is designed for annotating Bacterial genomes, to annotate our Agrobacterium tumefaciens genome assemblies from yesterday.

First, let’s copy our genome assembly from crick onto your desktop using FileZilla. Connect to crick using FileZilla and navigate to the folder genome_assembly/kmer75assembly/ and download the file contigs.fa by dragging it to your desktop.

Next, open the program Artemis by double clicking on its icon on your desktop. Click on the File menu and click “Open ...”

Next, find your contigs.fa file in your desktop folder in the file open window and click “Open”.

A new window will appear showing the regions of the genome.

This is your assembled genome sequence. All of the assembled contigs will be displayed next to each other (but not in the correct order). Scroll around to see the different sequences.

Now let’s try to find an open reading frame in your genome sequence.

Click on the “Create” menu and select the option “Mark Open Reading Frames ...”

Change the minimum open reading frame size to 100 if it isn’t already and click OK.

This will display all of the open reading frames (regions starting with “ATG” and ending with a stop codon) in your genome sequence as a blue box.

Let’s find out what one of these regions might be. Pick an open reading frame and click on it to select it.

Click the menu option “Run”, then select “NCBI Searches” and then “blastx”. Then click “OK” in the window that appears.

This will perform a BLAST search of a translation of your chosen open reading frame sequence against a database of known protein sequences.

A new window will pop up with the results of your BLAST search. This may take a while to run.

Look at the first few results of your BLAST search. What function might your chosen sequence have?

It might not be a gene at all. if you just see genome sequences or no results, try again with a different open reading frame.

Try looking around the genome sequence using Artemis and see what you find.

As you can see, it would take a long time to annotate every gene in your assembly this way. Fortunately there are programs that automate all of the annotation steps for you. Let’s try annotating our genome assemblies from yesterday with Prokka.

First, close Artemis and open Putty and log in to crick.

Change directory to the folder containing your genome assembly from yesterday:

cd genome_assembly/kmer75assembly/

You should be in a folder that contains your genome assembly contigs (“contigs.fa”). Run the ls command to see all of the files.

First, we need to rename the sequence IDs in our assemblies so that they will work with prokka. Run the following commands:

awk '/^>/{print ">contig_" ++i; next}{print}' < contigs.fa > contigs_number.fa

This will make a new file called contigs_number.fa that contains all of your assembled sequences with IDs renamed as numbers (contig_1, contig_2, etc).

To run Prokka, run the following commands:

prokka contigs_number.fa --outdir annotated_genome --prefix Atumefaciens_assembly --force

Next, change directories to the new folder prokka created:

cd annotated_genome

and list the files in the directory:

ls

You should see two files, Atumefaciens_assembly.faa and Atumefaciens_assembly.fna. The “.faa” file contains all of the predicted and annotated proteins that Prokka identified in your assembly. The “.fna” file contains all of the annotated DNA sequences from your contigs.

Let’s take a look at a few of the (probable) proteins Prokka found in your genome:

less Atumefaciens_assembly.faa

What is the predicted function of the first protein found?

If you have time, you can copy the Atumefaciens_assembly.fna file to your desktop using FileZilla and load it into Artemis to look at your annotations.