NUI Galway Logo

NUI Galway Home

NUI Galway Prospective Students NUI Galway Library
NUI Galway Logo NUI Galway Search NUI Galway Faculties & Departments NUI Galway Student Life
NUI Galway Logo NUI Galway University News NUI Galway Research NUI Galway Administration & Services
NUI Galway Logo
Home >> Research >> Bioinformatics
Menu Header

5. Analysing Data Using a Saved SOM

A previously saved SOM may be loaded using option 3 under the main menu. Saved SOM files should be compatible across all supported platforms. SOMs that have been trained using a variety of organisms are available for download from the RescueNet website (

It is possible to train a saved SOM further using the Continue Training option, but this option would result in reconfiguring a trained SOM, and it is difficult to foresee a use for this option.

Option 2 in the submenu brings the user into the data analysis part of the program. The user will be prompted for the name of the RSCU file that holds the data to be tested. The user will then be asked if this dataset is to be split (according to the keywords ‘putative’ or ‘hypothetical’ in the gene’s annotation. The user may choose to split the dataset into genes of known function vs hypothetical genes in order to explicitly see the proportion of either subset that fits in well with the codon usage patterns of the genome (see Section8). Once this choice is made, the Data Analysis Submenu is presented.

5.1 Submenu Option 1: Generating Probability Scores for Genes
This option will use the SOM to assign a probability score to every gene in the dataset under test. This value will be the probability that the gene uses a similar codon usage pattern to one which the SOM is trained to recognise, as opposed to the likelihood that the gene uses a more random codon usage pattern.

5.1.1 Output file formats: The user is firstly prompted for an output filename. Even if the dataset has been separated after loading, all results will be stored in the same output file. The user is also prompted as to whether the results should be in short or long format.

Short format results are unordered and consist of the first word in the gene’s annotation (usually an accession number) followed by the cosine distance score and the probability score on the same line (Fig 5.1). This format facilitates the use of spreadsheet programs for analysing the results in some applications (e.g. annotation) and fields are tab-delimited. The difference between cosine scores and probability scores is a subtle one. While cosine scores are an absolute score given by the SOM to a gene, probability scores are very dependant on the choice of random gene dataset (Section 5.1.2). The probability scores are in fact cosine scores that have been re-scaled from the mean score received by the random genes. In some cases, random genes can score very well (e.g. if codon usage in a genome is heavily governed by the mutational bias of that genome). In this case, the probability scores are almost useless and cosine scores should be used. That said, probability scores are the best measure to take in most cases. See Section 7 for a graphic example of the difference between cosine scores and probability scores.

Long format results (Fig. 5.2) are ordered with the highest scoring gene first and give more information about the gene, including the winning node, the cosine score, a z-score (based on the random dataset), as well as the probability score. A summary of results in the dataset is given at the end.

5.1.2 Random Sequence Dataset: Next, the user will be asked if they wish to change the dataset of random sequences that are used in probability score generation. If the user chooses to do so, s/he will be asked to choose between:
· a neutral nucleotide bias or a mutational bias
· the mutational bias found in the genes in the file-under-test
· a custom defined nucleotide bias
While the mutational bias option is the recommended one, the user should only choose this option if the current input file is the one which was used to train the SOM. If in doubt, it is safer to stick to the neutral bias option. The third option (defining your own bias) would be useful if the nucleotide percentages are available for the genome under test. This information is usually readily available.

It should be noted that the probability scores will vary significantly between those generated using a neutral or mutational biased random dataset. Because of this, the probability scores should never be thought of as absolute scores, but more as a relative scale. No matter what random dataset is used, the order of the scores will not change (i.e. the highest scoring gene will always be the highest scoring gene), but the scores themselves will. As for the number of genes to be generated, 1000 to 2000 is a reasonable number, and gives a good statistical spread.

Results will now be generated and the user will be returned to the Data Analysis Submenu.

5.2 Submenu Option 2: Cosine Distribution Graphs
This option prints the distribution of cosine scores for a dataset to a file. If the user has split the dataset into known & hypothetical genes, then both distributions will be printed. If a dataset of random genes exists, then the distribution of their cosine scores is also printed. These graphs may then be viewed in a spreadsheet program. The use of this option is to allow the user to see what proportion of each displayed dataset scores well in comparison to the scores received by the random gene dataset.

5.3 Submenu Option 3: Cluster Analysis
This option prints to a file clusters of similar nodes found on the SOM as well as how many genes in the dataset are recognised by each node. This option is very useful for identifying trends in codon usage in the dataset, because we can see exactly how many genes are grouped in each area of the SOM output layer. By splitting a dataset into smaller subsets (according to function or chromosomal position) a map can be built up of the codon usage patterns used by each subgroup (see Section 8)

After choosing this option, the user will firstly be asked if s/he wishes to change the threshold level at which node clusters are automatically generated. This option comes into use when the user wishes to explore the variation between the patterns the actual nodes have been trained to recognise. Moving this value towards 1 should show more node clusters on the output layer because a higher threshold value raises the level at which the clustering algorithm says that two nodes are similar to each other. Conversely, lowering the value towards 0 should show less clusters and the difference between each cluster will be more significant.

The user is then asked which subgroup they wish to analyse. While this may seem like a repeat of the question asked when the data file was loaded, it actually refers to which subset of the data (if the data was earlier split) should be mapped in the output file. The user is also prompted for a name for the output file at this point. The output file will also contain a list of genes (highest scoring first) that are grouped with each node on the output layer.

5.4 Significance of Difference Tests
Once clustering has been completed, the user will be asked if they wish to perform significance of difference tests between the node clusters. There are two types of SoD tests to choose between. The first compares every node on the output layer to every other, and so much data is produced in this way that this option is only of use in the case of a user trying to establish if two singular nodes are significantly different from each other. The second option is more useful. Entire clusters of nodes are compared to other clusters in order to establish if they use specific codons in a significantly different manner to the other clusters. In this case, the user can choose to either let RescueNet automatically identify clusters (according to the previously mentioned threshold level) or they can manually input lists of nodes as clusters. The second option is obviously more arduous, but it can be worthwhile in the case of users looking for differences in codon usage between groups of genes.

(previous) (back to index) (next)

Department of Information Technology,
National University of Ireland, Galway, University Road, Galway, Ireland.
Phone: +353 (0)91 524411 ext 3549, E-mail: aaron.golden(AT)