2.1 Generating the background
Before running Sombrero to find regulatory motifs,
you must first generate a background model for whatever chromosome
or genome you are working with. The "BackExtract" Perl
script is provided with Sombrero for this purpose.
To use this script, you will need the genomic sequence
of your genome in FastA format. You will also need the locations
of the genes in that genome. The BackExtract script currently handles
two types of gene location file formats; General Feature Format
(GFF -- outputted by Artemis)
and ProTein Table format (PTT -- usually provided by GenBank).
The BackExtract script looks at these gene locations and extracts
the regions of the genome between the genes; the "intergenic"
Alternatively, you can provide a FastA file of
intergenic regions or other sequences that you wish to base your
background model on.
Finally, you also need to tell the script what
order the background Markov model should be. This should depend
on the amount of intergenic sequence that your genome has. If you
are generating a background model using a small amount of sequence,
it is best to use the default setting (3rd order model). 3rd to
5th order models are sufficient in most cases.
The BackExtract script arguments are as follows:
-g [genome_file]: The name of the contiguous
genome sequence file
-gff [GFF_file]: The GFF file of gene locations for the genome
-ptt [PTT_file]: The PTT file of gene locations for the genome
-seq [FastA_file]: FastA format sequences of intergenic regions
(-seq is an alternative to -g and -gff/-ptt)
-x [len]: The length of the Markov Chain (default = 3)
-out [name]: Output filename (Default: out.back)
Therefore, an example use would be as follows:
perl BackExtract.pl -g chr12.1con -gff chr12.gff
-x 5 -out back12
2.2 Sombrero command line
Sombrero is a command-line driven program. The
command line options are each explained below:
Here, "training_file" is the FastA format file that
Sombrero 'trains' on. This file should contain the sequences that
you want to look for regulatory elements in. Required.
The "background_file" is the Markov model file that
you would have generated using the BackExtract script described
in Section 2.1. Required.
-l [x]: Here 'x' gives
the length of the motif that the SOM should find. This is the
'L' in 'L-mer'. Default = 8, Max = 40.
-lm [x][y]: This option
is an alternative to '-l'. It allows the SOM to be trained
over multiple L-mer lengths between 'x' and 'y'. For example,
the option '-lm 8 20' would train separate SOMs for all
even motif lengths between 8 and 20 (inclusive). This is the recommended
usage of Sombrero if you do not know the length of the motif you
are looking for. Since most regulatory elements are between 8
and 20bp long, '-lm 8 20' is a good argument to use.
Sombrero can collate the results from the various trained SOMs
and presents you with the most likely transcription factor binding
sites across all L-mer lengths. See Section 2.3 and Section 3
for examples of Sombrero being trained across multiple lengths.
-size [R][C]: The size
of the SOM to train in Rows and Columns. This is an important
choice that will affect the accuracy of motif finding. Our experiments
have shown that is is best if you keep the ratio of SOM nodes
(RxC) to input dataset base pairs in the order of 1:10. For example,
if you have a training set with a total of 8000bp in it, then
a 40x20 SOM would be most effective. This would be achieved using
the command line option '-size 40 20'. Default: Auto
sizing using the 10:1 heuristic.
-time [T]: The number
of training cycles to train the SOM for. The default option is
usually sufficient in our experience. Default = 100 cycles.
-forward: If this option
is used, then Sombrero only looks for motifs in the forward direction
of the training sequences and ignores the complementary strands.
Default: Sombrero checks both strands.
-cmplx [C]: This option
allows the user to adjust the default complexity threshold, thus
allowing more simple/complex motifs to be reported in the list
of most significant motifs found. Note that this option does not
affect the actual motifs that are found, just the list of which
ones are reported in the viewer. Default = 0.15.
-mask: If this option
is set, Sombrero will mask out all lowercase characters and all
L-mers containing "N".
-prior [prior file]:
This option will initialise Sombrero using a SOM that has previously
been trained on a database of transcription factor binding matrices.
This initialisation improves accuracy if one of the members of
the TF matrix database is present in the input sequences.
-autoprior [prior base filename]:
Assuming that you have downloaded a set of priors, providing the
base name of the prior files (e.g. "Fly") with this
option allows the size of the prior SOM to be chosen automatically.
-random: If set, Sombrero
is initialised entirely randomly, rather than the gradient-random
initialisation used by default.
Here "output_name" is the root name of the output files
that will be generated. Several output files may be generated
for a particular Sombrero session, but all will begin with the
name provided as "output_name". Required.
An example use of Sombrero would be:
./sombrero -t crp_set.seq -b e_coli.back -lm
8 20 -size 10 10 -out CRP
2.3. Viewing the results
At present, the Sombrero results viewer is a Perl/Tk
program, so in order to look at Sombrero's results, you'll need
to have Perl installed, as well as the Tk module (which usually
comes with the Perl installation, so you probably won't have to
worry about this). The viewer is the file called "SOMBREROView.pl"
that came with your distribution. The viewer is easy to run; it
only takes one argument and that is the Sombrero output file root
name. For example, if you had trained Sombrero using the option
'-out CRP', then "CRP is your root name, and the results
viewer would be called as follows:
perl SOMBREROView.pl CRP
The Sombrero results viewer looks like this:
This particular 20x10 node SOM is the result of
training Sombrero with gal4 binding sequences from S. cerevisiae.
This example is more fully explored in Section 3. However, the above
screen shot can illustrate the basics of the Sombrero results viewer.
First of all, the coloured blocks are the nodes
of the SOM. Each block is colour-coded according to how "over-represented"
the node's motif is. The redder the block, the higher the z-score
that node's motif has. What we call "over-representation"
or z-score is in fact a good indicator that a motif is a regulatory
element. Therefore, the reddest block on the display holds the motif
that is most likely to be a regulatory element.
All nodes on the display can be clicked using the
mouse pointer, and doing so brings up information about the node
and the node's motif. For example, clicking on a node brings up
the motif consensus sequence, the motif's complexity and the information
content of the motif to name a few. In the bottom right hand corner
of the window, you will also see a list of occurrences of the motif
throughout the training sequences. These instances can be displayed
on simple sequence viewer using the "Display Occurrences"
button. You can also copy the instances to the clipboard. Ticking
the "Enforce Threshold" box can eliminate some weaker
matches to the motif, leaving only sequences that we are fairly
confident are instances of the motif.
You will also notice the "L-mer Length"
drop-down menu beside the Exit button at the top of the window.
If you trained Sombrero using the '-lm' option, this drop-down
menu will allow you to browse between all the different SOMs that
were trained for these sequences. However, you won't have to go
through every SOM looking for the reddest node and comparing them
all; we've done that for you! The top right hand corner of the window
holds the "Best Patterns" box. This is an ordered list
of the best motifs found across all SOMs. Therefore, the number
1 pattern in this box is the most likely regulatory element found
in any SOM trained for these sequences. You can click on any of
these patterns and the relevant SOM and motif information will be
to index) (next)