NUI Galway Logo

NUI Galway Home

NUI Galway Prospective Students NUI Galway Library
NUI Galway Logo NUI Galway Search NUI Galway Faculties & Departments NUI Galway Student Life
NUI Galway Logo NUI Galway University News NUI Galway Research NUI Galway Administration & Services
NUI Galway Logo
Home >> Research >> Bioinformatics
Menu Header

3. SOMBRERO Demonstration using the Gal4 dataset

The following example demonstrates Sombrero's operation when finding motifs in real genomic data. The dataset used in this example is a set of 4 promoter sequences taken from S. cerevisiae. The 4 sequences are known to harbour 14 binding sites for the gal4 transcription factor. There are 3100bp in total in the dataset. All sequences and binding site locations were downloaded from the Promoter Database of Saccharomyces cerevisiae.

In this example, we will apply Sombrero to the yeast promoter sequences known to harour gal4 binding sites. However, we first need to generate a background model of the yeast intergenic regions. Here, we generate a 3rd order Markov model of all yeast intergenic regions using the "BackExtract" script described in Section 2.1.

Next, we need to train Sombrero on the sequences. We do this using the following command:

./sombrero -t yeast_prom.seq -b yeast.back -size 20 10 -lm 16 22 -out gal4

Here, "yeast_prom.seq" is the FastA file containing the 4 yeast promoter sequences, "yeast.back" is the file containing the background Markov model of yeast intergenic regions. A 20 x 10 node SOM is trained and training repeats across all even L-mer lengths between 16 and 22 inclusive. The root name for the output files is given as "gal4".

If you wish to repeat this example yourself, the required data files, as well as a list of known gal4 binding sites, can be downloaded here. Training will take anywhere between 10 and 30 minutes, depending on the computer you are using.

Once training completes, we will wish to view the results and the potential motifs that Sombrero finds in the data. If you didn't have time to train Sombrero, but would still like to see the results, the required files are provided here. We can open the Sombrero results viewer using the following simple command:

perl gal4

This command should open the viewer display below:

Clicking on the L-mer length drop-down menu will show that 4 SOMs are available for viewing (L-mer lengths 16, 18, 20 & 22). The current view is on the 18-mer SOM, and the currently selected node is node 19, 9; the node with the highest z-score in the entire training run. The Sombrero viewer tells us that this node contains a motif with the consensus sequence TCGGSSSNSNNTNNTCCG. There are 17 matches to this motif found by Sombrero in the input sequences, and these are presented in the bottom right hand corner of the viewer.

The motif in node 19,9 is shown in Logo format in Figure A below. The experimentally verified gal4 binding motif is shown in Figure B. As you can see, Sombrero finds a close match to the gal4 binding motif.

In terms of how successful Sombrero is at finding the actual binding site locations, we can compare the list of occurrences of the above Sombrero motif to the known gal4 binding sites. Doing so shows that Sombrero correctly finds 13 of the 14 known gal4 binding sites.It also finds 4 other matches to the above motif that are not known to be gal4 binding sites. Whether or not these extra occurrences actually bind gal4 in the cell is a matter for debate.

Compare this performance to tests we carried out with two other popular motif-finding methods; MEME and AlignACE. Using program settings chosen to be as close as possible to the Sombrero settings, AlignACE only finds 11 of the known gal4 sites, whereas MEME only finds 10.

(previous) (back to index)

Department of Information Technology,
National University of Ireland, Galway, University Road, Galway, Ireland.
Phone: +353 (0)91 524411 ext 3549, E-mail: aaron.golden(AT)