Training the SOM
4.1 The Training Set
Before data may be analysed using RescueNet, a Self-Organizing Map
must be trained using whatever sequence data the user deems appropriate.
Option 2 from the main menu will begin the training process. The
user will firstly be prompted for the training set filename. Only
FASTA format sequence data or RSCU value files are acceptable input.
Any incorrect filenames, or incorrectly formatted data will cause
the program to exit.
There is no limit to the amount of sequences that
can be part of the training set, but obviously the larger the training
set, the longer that training will take. The actual choice of training
set is very dependant on the application for which RescueNet is
being used. For example, a user may wish to train the SOM using
only ORFs which are over a certain length. On the other hand, a
user wishing to explore codon usage trends in an annotated genome
may wish only to include genes of known function, as opposed to
hypothetical genes. In order to go some way to facilitate this choice,
the user is asked to choose between using;
0) All genes in the file
1) Only genes of known function (i.e. those without the words ‘putative’
or ‘hypothetical’ in their annotation)
2) Known function & putative genes (i.e. those without the word
‘hypothetical’ in their annotation)
3) All genes greater than a certain length in triplets
Note that options 1 & 2 depend on having an annotated dataset.
Note also that option 3 will not find ORFs in a contiguous genome
sequence. In this case, a dataset of ORFs may first be generated
using NCBI’s ORF Finder (14) or a similar program.
Finally, a user may manually construct a training
set using whatever sequences they wish (e.g. mixing sequences from
different genomes together, or only training the SOM on genes of
specific function), as long as all sequences are in the same file.
See the Sections 7 &
8 examples for more on training sets.
4.2 SOM lattice size
RescueNet uses a square output lattice topology. During training,
the user will be prompted for the length of one side of the lattice.
Increasing the size will increase training time, so the user should
be careful in the choice of lattice size. This choice will be governed
by many influences, such as the problem being tackled, time constraints,
and the size of the training set.
For example, if RescueNet is being used as a visualisation
tool for codon usage trends, the user may be tempted to choose a
large sized lattice to give increased resolution. However, the user
should be aware that a larger SOM that took longer to train will
not necessarily lead to more accurate results. Too small a lattice
may miss some weak trends in codon usage, but too large a lattice
may lead to an over-specialised SOM (which loses the ability to
generalise about similar patterns in the data analysis phase). As
a very rough estimate, I suggest that the total size of the lattice
should be approximately the number of genes in the training set
divided by 10 (and one side of the lattice should be the rounded
off square root of this). The user is encouraged to experiment as
much as possible with lattice size.
4.3 The Number of Epochs
This is essentially the measure of how long the SOM should be trained
for. This value is also one with which to experiment. Note, however,
that a very short training time will lead to a SOM that will be
inaccurate, and will not recognise subtle patterns, while a very
long training time will also lead to an over-specialised SOM that
may also be inaccurate. A value of between 500 and 3000 is recommended
for most codon usage analysis applications, but the user may want
to choose a lower number if SOM training time is becoming a bottleneck
in an annotation process. Training of the SOM begins after the number
of epochs are entered.
4.4 Saving the SOM
Once training is complete, the SOM is automatically saved into a
file with a default name beginning with the word “savedNet”.
This file may be renamed and used at a later date for data analysis
(see Section 5).
(back to index) (next)