NUI Galway Logo

NUI Galway Home

NUI Galway Prospective Students NUI Galway Library
NUI Galway Logo NUI Galway Search NUI Galway Faculties & Departments NUI Galway Student Life
NUI Galway Logo NUI Galway University News NUI Galway Research NUI Galway Administration & Services
NUI Galway Logo
Home >> Research >> Bioinformatics
Menu Header

7. Example A: Using RescueNet in Annotation

The following example shows how RescueNet can be used as part of an annotation process to give an indication of where coding regions may exist. In this example, we are working with the unannotated, contiguous genome sequence of B. suis (13). Note that we did not rely in any way on BLAST searches or any gene prediction software other than RescueNet.

In order to sample the prevalent codon usage patterns in the genome, we wish to train a SOM using a sample of genes from B. suis. This initial training set can come from the output of other gene prediction programs, or from a list of previously known genes in the organism, or even from a list of all ORFs existing in the genome. Within this set of genes, we are choosing to train the SOM using only those ORFs which are over 400 codons in length. These longer ORFs are more likely to be real genes, and should be representative of the codon usage patterns that exist in the coding regions of the genome. The sequences will become the training set for the SOM using the following command:

RescueNet –t Bsuis.seq –sep 3 400 –size 10 –epochs 500 –saveas B.suis10x10.som

Analysing this command, we see that “Bsuis.seq” (click here to download: 3MB) is the FASTA format file that holds our training set of ORFs. The ‘-sep 3’ option tells the program only to use samples of length 400 codons or more. The SOM is set at being a 10x10 node SOM is being trained for 500 cycles, and saved as “B.suis10x10.som” (click here to download).

Once training is complete, we wish to analyse the contiguous genome sequence of B. suis using the SOM we have just trained. RescueNet does this by dividing the contiguous sequence up into smaller samples in all 6 reading frames, so the user must ensure that they have sufficient hard disk space for running this option (see System Requirements). Each sample is tested by the SOM and results are saved in tab-delimited files. The command used is:

RescueNet –annot B_suis_chr1 –som B.suis10x10.som –b M Bsuis.seq –ws 100 –wo 50

In this command, the file “B_suis_chr1” (click here to download) is the contiguous genome sequence of B. suis chromosome 1. We are loading the SOM (“B.suis10x10.som” that we trained in the previous step. The random dataset (see Section 5.1.2) is being generated from the SOM’s training set using the ‘-b M’ option. The contiguous sequence will be divided into samples of length 100 triplets (‘-ws’) and offset by 50 triplets from one another (‘-wo’). The program proceeds by producing 6 sequence files (one for each reading frame) and converting the samples into RSCU values. Note that these files are not automatically removed after the program exits. Each of the 6 samples files are then tested by the SOM and short format results (see Section 5.1.1) are generated and saved into 6 separate files (each ending in “res.txt”).

As usual, these short format results are tab-delimited, and so are best opened in a spreadsheet program. Section 5.1.1 gave a description about the difference between Cosine and Probability scores. If we plot both scores over a short segment of the file, we can visualise the difference between the two scales. Figure 7.1 shows both scores plotted for the first 10,000bp in the first forward reading frame. You can see that the peaks are the same for both types of scores, but because probability scores are re-scaled, it is much easier for a user to see high-scoring genomic areas using probability scores alone.

So in the case of our B.suis analysis, the probability scores are the most effective measure to use. We can combine the probability scores columns from the 6 results files within a spreadsheet program, and graph sections of the data. A sample 25,000bp region is shown in figure 7.2. The published gene predictions from TIGR are shown for comparison. As may be seen from the sample, most high scoring regions correspond well to gene-coding regions. In all, approx. 90% of known-function B. suis genes are predicted correctly using our method. However, some short genes, and many ORFs annotated as ‘hypothetical proteins’ are not. This is understandable, given that there are many reasons for atypical codon usage.

Once all the samples are processed, some simple post-processing is carried out. Naturally, all same-frame concurrent predictions are merged. Predictions that are totally overlapped by another prediction are deleted if they are less than 75% the length of the other. Similarly, any prediction in which more than half its length is overlapped is deleted if it is less than half as long as the other prediction, or less than 90% as long as the other and receiving a lower score. Finally, any prediction that is overlapped on both ends to a total of at least 70% overlap is also deleted.

While these rules aim to delete smaller erroneous predictions, it is recognised that the loose rules leave room for many other overlaps. However, it was found that in many overlapping cases it was difficult to decide which prediction to delete. Therefore, the best solution is to leave both predictions rather than misleading an annotator by giving one possibly erroneous prediction.

Figure 7.2: Gene prediction in the B. suis genome (region 450Kbp-475Kbp) using a SOM trained on ORFs of over 1200bp in length. Probability scores for reading frames 4, 5 & 6 were multiplied by –1 for clarity. The yellow bars above and below the graph are regions that TIGR has annotated as being protein-coding regions.

Note that from version 0.91 onwards, a General Feature Format (GFF) file is also generated in the annotation process. This file holds predictions of where genes lie in the contiguous sequence and is based on areas that hold scores higher than a certain threshold. The GFF file is saved with the extension “.gff” and can be viewed in sequence viewers such as Artemis.

(previous) (back to index) (next)

Department of Information Technology,
National University of Ireland, Galway, University Road, Galway, Ireland.
Phone: +353 (0)91 524411 ext 3549, E-mail: aaron.golden(AT)