Week 2, Day 3: 11:00am DNA Biology and Bioinformatics Camp

Writing Software for Bioinformatics: BLAST and Scatter Plots

BlobCommands Word Doc

Today we will use unix, so today we will connect to the CGRB servers. Click on PuTTY icon and for "Host name" type:


Click Open.

At the Security Alert, click Yes.

Once the terminal (black window opens), at "login as:" type YourName_ws and then type your password. (or if you are new, use passwd to reset it from the default)

Now make a directory for today's work:

mkdir YourName_Blob

cd YourName_Blob

Just to be sure you are in your new directory, type:


Now copy the files from our teaching folder to your directory. Use "cp" for copy, and then the full path to the file, with "*" for any characters and "space ." to say put it here.

cp /nfs1/Teaching/CGRB/rnaseq_analysis_spring_2015/*.fa .

cp /nfs1/Teaching/CGRB/rnaseq_analysis_spring_2015/*.pl .

cp /nfs1/Teaching/CGRB/rnaseq_analysis_spring_2015/BlobCom.txt .

* handy unix tips: try using "Tab" above to complete e.g."/nfs1/T" and "Tab"

* handy unix tips: try using the "up arrow" rather than retyping each time

Now, to see that you got all your files transferred, type:


Now take a quick (5 seconds) look at your files to see what we're working with.

less Ppenetrans_genome.fa

type q to quit

less Xamericanum_genome.fa

type q to quit

less Bacteria_genomes.fa

type q to quit

To see all the bacteria names, use "grep" and search for ">", a character before each name:

grep '>' Bacteria_genomes.fa

For the remaining steps, you might like to "copy" and "paste" from "BlobCom.txt".

To save time, half of the group will analyze Ppenetrans and the other half will analyze Xamericanum.

We will pause here and assign groups to Pp and Xa.

Now we will calculate %GC and coverage (see Glossary at bottom). We do this with a simple program in perl. This perl script reads a big "sam" file containing reads an outputs our table:

perl sam_len_cov_gc.pl -s /nfs1/Teaching/CGRB/rnaseq_analysis_spring_2015/Pp.sam -f Ppenetrans_genome.fa -o Pp

perl sam_len_cov_gc.pl -s /nfs1/Teaching/CGRB/rnaseq_analysis_spring_2015/Xa.sam -f Xamericanum_genome.fa -o Xa

Now look at your output files:

less Pp.lencovgc.txt

type q to quit

less Xa.lencovgc.txt

type q to quit

The columns are "name, contig name, contig length, coverage, proportion GC".

Now we find matches to bacteria using BLAST to align them.

First we make the bacterial database:

makeblastdb -dbtype nucl -in Bacteria_genomes.fa

Then we BLAST the nematode genome against the bacterial database:

blastn -db Bacteria_genomes.fa -query Ppenetrans_genome.fa -evalue 1e-4 -max_target_seqs 1 -num_threads 1 -outfmt 6 -out Pp.BLAST.txt

blastn -db Bacteria_genomes.fa -query Xamericanum_genome.fa -evalue 1e-4 -max_target_seqs 1 -num_threads 1 -outfmt 6 -out Xa.BLAST.txt

Now take a quick (5 second) look at your output files:

less Pp.BLAST.txt

type q to quit

less Xa.BLAST.txt

type q to quit

The table columns are "contig name, bacterial species match, % similarity, match length, ..." and other information.

Now, we filter the table, keeping only longer matches.

(Explanation: "cat" feeds the file into the "|" pipe which runs a simple unix mini-program "awk" taking rows with column 4 values above 80)

cat Pp.BLAST.txt | awk '{if($4>80){print $0}}' > Pp.BIG.txt

cat Xa.BLAST.txt | awk '{if($4>80){print $0}}' > Xa.BIG.txt

Take a quick look at the output using "less".

Finally, combine the two tables (1) blast output and (2) the coverage and %GC values using another unix mini-program "awk".

(Explanation: this takes the second column $2 (contig names) in our perl output file and matches it with the first column (contig names) from our blast output file $1 and appends column $2 values from the blast (taxon names).)

awk 'NR==FNR{a[$1]=$2;next};{print$0,$2 in a?a[$2]:"A"}' Pp.BIG.txt Pp.lencovgc.txt | column -t > Pp_plot.txt

awk 'NR==FNR{a[$1]=$2;next};{print$0,$2 in a?a[$2]:"A"}' Xa.BIG.txt Xa.lencovgc.txt | column -t > Xa_plot.txt Sort the output by reverse order on the last column (so it will plot better).

sort -k 6 Pp_plot.txt > Pp_plot.sort

sort -k 6 Xa_plot.txt > Xa_plot.sort

Finally, we can plot this in RStudio. But we must get it off crick and onto your computer.

Use Filezilla to do this by sescure ftp. Open Filezilla from the desktop, and for "Host", type:


for "Username", type: Yourlogin_ws , Then enter your Password and for "Port" enter 732. Then hit "Quickconnect".

Navigate to your own folder on the crick server, until you see your file. Drag the file(s) you made Pp_plot.sort or Xa_plot.sort onto your Desktop.


To find RStudio, go to to the bottom left START -> All Programs -> Statistics -> RStudio -> RStudio. Click on RStudio to open it.

On the upper right window pane called "Environment" select "Import Dataset" and then "From Text File" and browse to your computer for Pp_plot.sort or Xa_plot.sort.

You will find these under "Desktop" -> Computer# [your computer today] -> Desktop. But before you click Import, change the "File name" at the top to "Pp" or "Xa" (depending on which species you analyzed).

Now, in RStudio's "Console" window pane on the bottom left, rename the columns like this:

names(Pp) <- c("name", "contig", "length", "cov","gc", "taxon")

names(Xa) <- c("name", "contig", "length", "cov","gc", "taxon")

Again, in "Console" create a your first black and white blob plot (scatter plot) with "gc" on the x-axis (%GC content of each contig) and "cov" on the y-axis (COVERAGE), by typing:

plot(Pp$gc, Pp$cov, ylim=c(0.5,60), xlim=c(0.15,0.79), pch=1,cex=0.6)

plot(Xa$gc, Xa$cov, ylim=c(0.5,60), xlim=c(0.15,0.79), pch=1,cex=0.6)

Now, add colors to the scatter plot for each bacteria:

plot(Pp$gc, Pp$cov, ylim=c(0.5,60), xlim=c(0.15,0.79), col=(Pp$taxon), pch=1, cex=0.6)

Finally, add a legend:

legend(0.15,60,unique(Pp$taxon),col=unique(Pp$taxon),pch=1,cex=0.6) or

plot(Xa$gc, Xa$cov, ylim=c(0.5,700), xlim=c(0.15,0.79), col=(Xa$taxon), pch=1, cex=0.6)


Questions: What could cause organisms to have different %GC?

What could cause organisms to have different coverage?

Did you have any idea about the endosymbionts before we added color?

We will talk about everyone's results at this point.

Glossary (See Figure):

Genome – the true chromosomal sequence(s)

Contig – our best guess at the true genome, based on overlapping reads – usually the genome is in pieces or fragments.

Read – our raw data (sequences) from any sequencing technology, e.g. Illumina

Coverage – the average number of nucleotides (reads) we have for each position in the genome (or contig)

%GC – the percent of G+C (i.e. #Gs + #Cs / #nucleotides)