NUI Galway Logo

NUI Galway Home

NUI Galway Prospective Students NUI Galway Library
NUI Galway Logo NUI Galway Search NUI Galway Faculties & Departments NUI Galway Student Life
NUI Galway Logo NUI Galway University News NUI Galway Research NUI Galway Administration & Services
NUI Galway Logo
Home >> Research >> Bioinformatics
Menu Header

8. Example B: Analysis of codon usage variation

In this example, we wish to study the causes of codon usage variation in the genome of Neisseria meningitidis and in doing so, we will also identify which genes have a codon usage pattern which is atypical of the genome. Unlike the previous example, this example uses RescueNet’s interactive text-based menus, which can be run simply by calling RescueNet from the command-line.

The first step is again to train a new SOM. We do this by choosing option 2 from the main menu. We have chosen to analyse the list of predicted genes downloaded from TIGR (15). Within this dataset, we then choose to train the SOM using only genes of known function. This leaves 1124 genes in the training-set from 2156 genes in the dataset. As shown in fig. 8.1, we then choose to train a 10x10 node SOM for 3000 epochs.

Figure 8.1: RescueNet interactive mode: Training

Once training is complete, the SOM will be saved in a file named “savedNet10x10-3000e.som”. This is a default name and may be changed. We now find ourselves back at the main menu. This time, we choose Option 3 (Load a Saved SOM) and enter the name of our newly trained SOM. Next, we choose the option to “Analyse Data using SOM”, upon which we are asked to give the name of the file which contains the sequences we wish to analyse.

The first time we do this, we wish to identify the N. meningitidis genes that have an unusual codon usage pattern. Therefore we shall tell RescueNet to analyse the genes in GNM.seq, but we will split the dataset into genes of known function and hypothetical genes (see fig 8.2).

Figure 8.2: Data Analysis

We are now presented with the Data Analysis Submenu. Using option 1 in this menu, we can now generate probability scores for each gene in the datasets, and thereby identify the genes that display an atypical codon usage pattern. When prompted to change the random genes, we choose to generate genes that use the same mutational bias as the genes. Long format results are chosen.

The end of the results file is shown in fig. 8.3. From the results summary, we see that quite a high proportion of the hypothetical genes (23%) receive scores below 0.1 from the SOM, and these genes (such as NMB1757 and NMB0819 shown in the figure) can be easily identified from the results file.

NMB1757 hypothetical protein (Codons: 76 )
Winning Node: 0, 2 Cosine Score: 0.221433
Z-Score: -7.111150, Probability that pattern is not random: 0.0000000000

NMB0819 hypothetical protein (Codons: 130 )
Winning Node: 0, 3 Cosine Score: 0.174694
Z-Score: -7.904694, Probability that pattern is not random: 0.0000000000

*********** Overal Stats for Hypothetical Genes***********

0.0 -> 0.1: 247 Genes
0.1 -> 0.2: 42 Genes
0.2 -> 0.3: 27 Genes
0.3 -> 0.4: 27 Genes
0.4 -> 0.5: 43 Genes
0.5 -> 0.6: 25 Genes
0.6 -> 0.7: 26 Genes
0.7 -> 0.8: 23 Genes
0.8 -> 0.9: 56 Genes
0.9 -> 1.0: 516 Genes

Figure 8.3: Long format results file

The next analysis we wish to carry out is to visualise the amount of codon usage variation within the N. meningitidis genome. We should at this stage find ourselves back at the Data Analysis Submenu. Choosing option 3 here brings the user into the Cluster Analysis function (see Section 5.3).

The user is firstly asked if they want to change the threshold value. At this stage, we do not wish to, and so enter ‘n’. We now choose to analyse only the genes of known function in the dataset. We do this in order to map the various groups of codon usage that exist in the genes of known function. After entering an output file name, the program maps the genes and since we do not wish to perform Significance of Difference tests between the groups, the Cluster Analysis function is now complete.

Opening the Cluster Analysis output file, we see two separate 10x10 number maps (fig 8.4). The first of these maps shows the number of genes that group with each SOM node. Displaying this map using a spreadsheet program or MatLab gives a representation like figure 8.5i. This will give the user an initial indication of the number of distinct clusters of codon usage that are occurring in the dataset. The Significance of Difference between these clusters may of course be tested at any time.

Figure 8.4: Cluster Analysis results

The second map in the output file represents similarities found between actual nodes using the chosen threshold value. Each number represents a sufficiently different pattern for that threshold. Experimenting with the threshold value builds up a picture of which nodes are similar to other nodes on the output layer (figure 8.5ii).

The whole process can be repeated using subsets of the dataset in order to test whether these subsets use similar codon usage patterns. In this way, we can discover the sources of codon usage variation within a genome. Choice of subset is up to the user. In our case, we chose to test separate functional groups in N. meningitidis. Lists of the genes in each subset were downloaded from TIGR (16). Using the cluster analysis function on the various subsets, we showed that to a certain extent, different functional groups have different codon usage patterns in N. meningitidis. Some examples are shown in figures 8.5.

Figure 8.5: Cluster Analysis in N. meningitidis
(i) Nodes on the output layer responding to all known genes. Legend shows number of times each node responds to the dataset.
(ii) Groups of similar weight vectors on the output layer. The symbol ‘~’ denotes outlying nodes that are dissimilar to their neighbours.
(iii) Distribution of the energy metabolism functional group genes.
(iv) Distribution of the protein synthesis functional group genes
(v) Distribution of the cellular processes functional group genes
(vi) Distribution of the high scoring hypothetical genes

(previous) (back to index) (next)

Department of Information Technology,
National University of Ireland, Galway, University Road, Galway, Ireland.
Phone: +353 (0)91 524411 ext 3549, E-mail: aaron.golden(AT)