Week 2, Day 4: 10:00am DNA Biology and Bioinformatics Camp

Assembly of bacterial genome sequences

PowerPoint Slides

Today we will be assembling the sequencing reads from our Agrobacterium tumefaciens genomic DNA extractions from last week. The samples were sequenced on an Illumina MiSeq over the weekend, and today we will be assembling the short reads into longer genome sequences using the program Velvet.

First, look at the files in the current directory using the ls command:


There is a new folder called agro_reads. This folder contains the forward and reverse sequencing reads from the MiSeq run of your Agrobacterium genome samples. These reads have been trimmed so that only high quality reads remain. Change into that directory with the command:

cd agro_reads

and type


You should see two files, named Atumefaciens_####_R1.fastq and Atumefaciens_####_R2.fastq, where #### is your username without “_ws”. For example, my username is “weisbeal_ws”, therefore my file would be Atumefaciens_weisbeal_R1.fastq. These files contain the your quality-trimmed forward and reverse sequencing reads from the MiSeq run.

Let’s look at what some of these reads look like by typing the following command:

less Atumefaciens_####_R1.fastq

where #### is part of your username. Each sequence has 4 lines, the first starts with “@” and is the identifier for that sequence. The second line contains the actual DNA sequence. The fourth line contains the quality score for each base in the sequence. Each character corresponds to a number that represents the quality of that DNA base being correct.

Lets make a new directory for our genome assembly. First, go back to your home directory

cd ..

and make a new folder called genome_assembly:

mkdir genome_assembly

and change directory into that folder

cd genome_assembly

To assemble these reads into longer genome sequences, we will use the program Velvet. This program can take a long time to run since it has to piece together millions of reads. Velvet has two parts, velveth and velvetg.

Here are the commands to run Velvet:

velveth ./kmer75assembly 75 -shortPaired -fastq -separate ../agro_reads/Atumefaciens_####_R1.fastq ../agro_reads/Atumefaciens_####_R2.fastq

where #### is part of your username. Once velveth is done running, then run:

velvetg ./kmer75assembly

If there is an error message, please ask one of the TAs for help and they will help you get it running again.

Once the velvet commands have finished running, let’s check and see what output it produced.

Type the ls command again to see what files are in your current directory.


You should see a folder called kmer75assembly. This folder contains your assembled genome and other files from the assembly process. Change directories to that folder and list the files inside.

cd kmer75assembly


You should see a file named contigs.fa. This file contains your assembled genome sequences.

Let’s look at the very beginning of this file to see what they look like. Type the following command:

less ./contigs.fa

These are your assembled contigs. Unfortunately we cannot piece together all of the contigs together because the sequencing reads are so short. To do that we would need to do several more sequencing runs with longer reads. However this gives us a majority of the genome of our organism.

Let’s take a look now at some of the statistics from our genome assembly. Run the following command:

assemblathon_stats.pl ./contigs.fa

This shows various statistics of the results of your Velvet assembly. Look for the line that says “Number of scaffolds”. This number is how many pieces (or contigs) your final genome assembly was assembled into. The lower this number is the better.

Look for the line that says “Total number of bases in assembly”. This is the total length of your assembled genome sequence in nucleotide bases. This number should be around 5 million, the estimated size of the Agrobacterium tumefaciens genome.

Also look at the line that says “Longest scaffold”. This is the length of your longest contig.

If you finish early, try assembling your genome again using a different “kmer” size with the velveth command and then the velvetg command. That is, try replacing the number 75 with an odd number between 19 and 149 and see if the number of scaffolds you get changes for better or for worse. A kmer is the length of DNA sequence that velvet will split your reads into before aligning them. For example, a kmer size of 75 (or “75-mer”) will align pieces of your reads together that are 75 bases long.

Be sure to go back up a folder (cd ..) before starting again, and pick a new name for your assembly folder in the velveth and velvetg commands (ie. “kmer149assembly”) so that your old assembly does not get overwritten.

Which kmer size gave you the best assembly?