8. Example B: Analysis of
codon usage variation
In this example, we wish to study the causes of
codon usage variation in the genome of Neisseria meningitidis
and in doing so, we will also identify which genes have a codon
usage pattern which is atypical of the genome. Unlike the previous
example, this example uses RescueNet’s interactive text-based
menus, which can be run simply by calling RescueNet from the command-line.
The first step is again to train a new SOM. We
do this by choosing option 2 from the main menu. We have chosen
to analyse the list of predicted genes downloaded from TIGR (15).
Within this dataset, we then choose to train the SOM using only
genes of known function. This leaves 1124 genes in the training-set
from 2156 genes in the dataset. As shown in fig. 8.1, we then choose
to train a 10x10 node SOM for 3000 epochs.
|Figure 8.1: RescueNet interactive mode:
Once training is complete, the SOM will be saved
in a file named “savedNet10x10-3000e.som”. This is a
default name and may be changed. We now find ourselves back at the
main menu. This time, we choose Option 3 (Load a Saved SOM) and
enter the name of our newly trained SOM. Next, we choose the option
to “Analyse Data using SOM”, upon which we are asked
to give the name of the file which contains the sequences we wish
The first time we do this, we wish to identify
the N. meningitidis genes that have an unusual codon usage
pattern. Therefore we shall tell RescueNet to analyse the genes
in GNM.seq, but we will split the dataset into genes of known function
and hypothetical genes (see fig 8.2).
|Figure 8.2: Data Analysis
We are now presented with the Data Analysis Submenu.
Using option 1 in this menu, we can now generate probability scores
for each gene in the datasets, and thereby identify the genes that
display an atypical codon usage pattern. When prompted to change
the random genes, we choose to generate genes that use the same
mutational bias as the genes. Long format results are chosen.
The end of the results file is shown in fig. 8.3.
From the results summary, we see that quite a high proportion of
the hypothetical genes (23%) receive scores below 0.1 from the SOM,
and these genes (such as NMB1757 and NMB0819 shown in the figure)
can be easily identified from the results file.
|NMB1757 hypothetical protein (Codons: 76 )
Winning Node: 0, 2 Cosine Score: 0.221433
Z-Score: -7.111150, Probability that pattern is not random:
NMB0819 hypothetical protein (Codons: 130 )
Winning Node: 0, 3 Cosine Score: 0.174694
Z-Score: -7.904694, Probability that pattern is not random:
*********** Overal Stats for Hypothetical Genes***********
0.0 -> 0.1: 247 Genes
0.1 -> 0.2: 42 Genes
0.2 -> 0.3: 27 Genes
0.3 -> 0.4: 27 Genes
0.4 -> 0.5: 43 Genes
0.5 -> 0.6: 25 Genes
0.6 -> 0.7: 26 Genes
0.7 -> 0.8: 23 Genes
0.8 -> 0.9: 56 Genes
0.9 -> 1.0: 516 Genes
|Figure 8.3: Long format results file
The next analysis we wish to carry out is to visualise
the amount of codon usage variation within the N. meningitidis
genome. We should at this stage find ourselves back at the Data
Analysis Submenu. Choosing option 3 here brings the user into the
Cluster Analysis function (see Section 5.3).
The user is firstly asked if they want to change
the threshold value. At this stage, we do not wish to, and so enter
‘n’. We now choose to analyse only the genes of known
function in the dataset. We do this in order to map the various
groups of codon usage that exist in the genes of known function.
After entering an output file name, the program maps the genes and
since we do not wish to perform Significance of Difference tests
between the groups, the Cluster Analysis function is now complete.
Opening the Cluster Analysis output file, we see
two separate 10x10 number maps (fig 8.4). The first of these maps
shows the number of genes that group with each SOM node. Displaying
this map using a spreadsheet program or MatLab gives a representation
like figure 8.5i. This will give the user an initial indication
of the number of distinct clusters of codon usage that are occurring
in the dataset. The Significance of Difference between these clusters
may of course be tested at any time.
|Figure 8.4: Cluster Analysis results
The second map in the output file represents similarities
found between actual nodes using the chosen threshold value. Each
number represents a sufficiently different pattern for that threshold.
Experimenting with the threshold value builds up a picture of which
nodes are similar to other nodes on the output layer (figure 8.5ii).
The whole process can be repeated using subsets
of the dataset in order to test whether these subsets use similar
codon usage patterns. In this way, we can discover the sources of
codon usage variation within a genome. Choice of subset is up to
the user. In our case, we chose to test separate functional groups
in N. meningitidis. Lists of the genes in each subset were downloaded
from TIGR (16). Using the cluster analysis function on the various
subsets, we showed that to a certain extent, different functional
groups have different codon usage patterns in N. meningitidis.
Some examples are shown in figures 8.5.
Figure 8.5: Cluster
Analysis in N. meningitidis
(i) Nodes on the output layer responding to all known genes.
Legend shows number of times each node responds to the dataset.
(ii) Groups of similar weight vectors on the output layer.
The symbol ‘~’ denotes outlying nodes that are
dissimilar to their neighbours.
(iii) Distribution of the energy metabolism functional group
(iv) Distribution of the protein synthesis functional group
(v) Distribution of the cellular processes functional group
(vi) Distribution of the high scoring hypothetical genes
(back to index) (next)