NUI Galway Logo

NUI Galway Home

NUI Galway Prospective Students NUI Galway Library
NUI Galway Logo NUI Galway Search NUI Galway Faculties & Departments NUI Galway Student Life
NUI Galway Logo NUI Galway University News NUI Galway Research NUI Galway Administration & Services
NUI Galway Logo
Home >> Research >> Bioinformatics
Menu Header

4. Training the SOM

4.1 The Training Set
Before data may be analysed using RescueNet, a Self-Organizing Map must be trained using whatever sequence data the user deems appropriate. Option 2 from the main menu will begin the training process. The user will firstly be prompted for the training set filename. Only FASTA format sequence data or RSCU value files are acceptable input. Any incorrect filenames, or incorrectly formatted data will cause the program to exit.

There is no limit to the amount of sequences that can be part of the training set, but obviously the larger the training set, the longer that training will take. The actual choice of training set is very dependant on the application for which RescueNet is being used. For example, a user may wish to train the SOM using only ORFs which are over a certain length. On the other hand, a user wishing to explore codon usage trends in an annotated genome may wish only to include genes of known function, as opposed to hypothetical genes. In order to go some way to facilitate this choice, the user is asked to choose between using;
0) All genes in the file
1) Only genes of known function (i.e. those without the words ‘putative’ or ‘hypothetical’ in their annotation)
2) Known function & putative genes (i.e. those without the word ‘hypothetical’ in their annotation)
3) All genes greater than a certain length in triplets
Note that options 1 & 2 depend on having an annotated dataset. Note also that option 3 will not find ORFs in a contiguous genome sequence. In this case, a dataset of ORFs may first be generated using NCBI’s ORF Finder (14) or a similar program.

Finally, a user may manually construct a training set using whatever sequences they wish (e.g. mixing sequences from different genomes together, or only training the SOM on genes of specific function), as long as all sequences are in the same file.

See the Sections 7 & 8 examples for more on training sets.

4.2 SOM lattice size
RescueNet uses a square output lattice topology. During training, the user will be prompted for the length of one side of the lattice. Increasing the size will increase training time, so the user should be careful in the choice of lattice size. This choice will be governed by many influences, such as the problem being tackled, time constraints, and the size of the training set.

For example, if RescueNet is being used as a visualisation tool for codon usage trends, the user may be tempted to choose a large sized lattice to give increased resolution. However, the user should be aware that a larger SOM that took longer to train will not necessarily lead to more accurate results. Too small a lattice may miss some weak trends in codon usage, but too large a lattice may lead to an over-specialised SOM (which loses the ability to generalise about similar patterns in the data analysis phase). As a very rough estimate, I suggest that the total size of the lattice should be approximately the number of genes in the training set divided by 10 (and one side of the lattice should be the rounded off square root of this). The user is encouraged to experiment as much as possible with lattice size.

4.3 The Number of Epochs
This is essentially the measure of how long the SOM should be trained for. This value is also one with which to experiment. Note, however, that a very short training time will lead to a SOM that will be inaccurate, and will not recognise subtle patterns, while a very long training time will also lead to an over-specialised SOM that may also be inaccurate. A value of between 500 and 3000 is recommended for most codon usage analysis applications, but the user may want to choose a lower number if SOM training time is becoming a bottleneck in an annotation process. Training of the SOM begins after the number of epochs are entered.

4.4 Saving the SOM
Once training is complete, the SOM is automatically saved into a file with a default name beginning with the word “savedNet”. This file may be renamed and used at a later date for data analysis (see Section 5).

(previous) (back to index) (next)

Department of Information Technology,
National University of Ireland, Galway, University Road, Galway, Ireland.
Phone: +353 (0)91 524411 ext 3549, E-mail: aaron.golden(AT)