SOMBRERO Demonstration using the Gal4 dataset
The following example demonstrates Sombrero's operation
when finding motifs in real genomic data. The dataset used in this
example is a set of 4 promoter sequences taken from S. cerevisiae.
The 4 sequences are known to harbour 14 binding sites for the gal4
transcription factor. There are 3100bp in total in the dataset.
All sequences and binding site locations were downloaded from the
Promoter Database of
In this example, we will apply Sombrero to the
yeast promoter sequences known to harour gal4 binding sites.
However, we first need to generate a background model of the yeast
intergenic regions. Here, we generate a 3rd order Markov model of
all yeast intergenic regions using the "BackExtract" script
described in Section 2.1.
Next, we need to train Sombrero on the sequences.
We do this using the following command:
./sombrero -t yeast_prom.seq -b yeast.back
-size 20 10 -lm 16 22 -out gal4
Here, "yeast_prom.seq" is the FastA file
containing the 4 yeast promoter sequences, "yeast.back"
is the file containing the background Markov model of yeast intergenic
regions. A 20 x 10 node SOM is trained and training repeats across
all even L-mer lengths between 16 and 22 inclusive. The root name
for the output files is given as "gal4".
If you wish to repeat this example yourself, the
required data files, as well as a list of known gal4 binding sites,
can be downloaded here. Training will
take anywhere between 10 and 30 minutes, depending on the computer
you are using.
Once training completes, we will wish to view the
results and the potential motifs that Sombrero finds in the data.
If you didn't have time to train Sombrero, but would still like
to see the results, the required files are provided here.
We can open the Sombrero results viewer using the following simple
perl SOMBREROView.pl gal4
This command should open the viewer display below:
Clicking on the L-mer length drop-down menu will
show that 4 SOMs are available for viewing (L-mer lengths 16, 18,
20 & 22). The current view is on the 18-mer SOM, and the currently
selected node is node 19, 9; the node with the highest z-score in
the entire training run. The Sombrero viewer tells us that this
node contains a motif with the consensus sequence TCGGSSSNSNNTNNTCCG.
There are 17 matches to this motif found by Sombrero in the input
sequences, and these are presented in the bottom right hand corner
of the viewer.
The motif in node 19,9 is shown in Logo format
in Figure A below. The experimentally verified gal4 binding
motif is shown in Figure B. As you can see, Sombrero finds a close
match to the gal4 binding motif.
In terms of how successful Sombrero is at finding
the actual binding site locations, we can compare the list of occurrences
of the above Sombrero motif to the known gal4 binding sites.
Doing so shows that Sombrero correctly finds 13 of the 14 known
gal4 binding sites.It also finds 4 other matches to the
above motif that are not known to be gal4 binding sites.
Whether or not these extra occurrences actually bind gal4
in the cell is a matter for debate.
Compare this performance to tests we carried out
with two other popular motif-finding methods; MEME and AlignACE.
Using program settings chosen to be as close as possible to the
Sombrero settings, AlignACE only finds 11 of the known gal4
sites, whereas MEME only finds 10.