NUI Galway Logo

NUI Galway Home

NUI Galway Prospective Students NUI Galway Library
NUI Galway Logo NUI Galway Search NUI Galway Faculties & Departments NUI Galway Student Life
NUI Galway Logo NUI Galway University News NUI Galway Research NUI Galway Administration & Services
NUI Galway Logo
Home >> Research >> Bioinformatics
Menu Header

2. Using Sombrero

2.1 Generating the background model

Before running Sombrero to find regulatory motifs, you must first generate a background model for whatever chromosome or genome you are working with. The "BackExtract" Perl script is provided with Sombrero for this purpose.

To use this script, you will need the genomic sequence of your genome in FastA format. You will also need the locations of the genes in that genome. The BackExtract script currently handles two types of gene location file formats; General Feature Format (GFF -- outputted by Artemis) and ProTein Table format (PTT -- usually provided by GenBank). The BackExtract script looks at these gene locations and extracts the regions of the genome between the genes; the "intergenic" regions.

Alternatively, you can provide a FastA file of intergenic regions or other sequences that you wish to base your background model on.

Finally, you also need to tell the script what order the background Markov model should be. This should depend on the amount of intergenic sequence that your genome has. If you are generating a background model using a small amount of sequence, it is best to use the default setting (3rd order model). 3rd to 5th order models are sufficient in most cases.

The BackExtract script arguments are as follows:

-g [genome_file]: The name of the contiguous genome sequence file
-gff [GFF_file]: The GFF file of gene locations for the genome
-ptt [PTT_file]: The PTT file of gene locations for the genome
-seq [FastA_file]: FastA format sequences of intergenic regions
(-seq is an alternative to -g and -gff/-ptt)
-x [len]: The length of the Markov Chain (default = 3)
-out [name]: Output filename (Default: out.back)

Therefore, an example use would be as follows:

perl -g chr12.1con -gff chr12.gff -x 5 -out back12


2.2 Sombrero command line arguments

Sombrero is a command-line driven program. The command line options are each explained below:

-t [training_file]: Here, "training_file" is the FastA format file that Sombrero 'trains' on. This file should contain the sequences that you want to look for regulatory elements in. Required.

-b [background_file]: The "background_file" is the Markov model file that you would have generated using the BackExtract script described in Section 2.1. Required.

-l [x]: Here 'x' gives the length of the motif that the SOM should find. This is the 'L' in 'L-mer'. Default = 8, Max = 40.

-lm [x][y]: This option is an alternative to '-l'. It allows the SOM to be trained over multiple L-mer lengths between 'x' and 'y'. For example, the option '-lm 8 20' would train separate SOMs for all even motif lengths between 8 and 20 (inclusive). This is the recommended usage of Sombrero if you do not know the length of the motif you are looking for. Since most regulatory elements are between 8 and 20bp long, '-lm 8 20' is a good argument to use. Sombrero can collate the results from the various trained SOMs and presents you with the most likely transcription factor binding sites across all L-mer lengths. See Section 2.3 and Section 3 for examples of Sombrero being trained across multiple lengths.

-size [R][C]: The size of the SOM to train in Rows and Columns. This is an important choice that will affect the accuracy of motif finding. Our experiments have shown that is is best if you keep the ratio of SOM nodes (RxC) to input dataset base pairs in the order of 1:10. For example, if you have a training set with a total of 8000bp in it, then a 40x20 SOM would be most effective. This would be achieved using the command line option '-size 40 20'. Default: Auto sizing using the 10:1 heuristic.

-time [T]: The number of training cycles to train the SOM for. The default option is usually sufficient in our experience. Default = 100 cycles.

-forward: If this option is used, then Sombrero only looks for motifs in the forward direction of the training sequences and ignores the complementary strands. Default: Sombrero checks both strands.

-cmplx [C]: This option allows the user to adjust the default complexity threshold, thus allowing more simple/complex motifs to be reported in the list of most significant motifs found. Note that this option does not affect the actual motifs that are found, just the list of which ones are reported in the viewer. Default = 0.15.

-mask: If this option is set, Sombrero will mask out all lowercase characters and all L-mers containing "N".

-prior [prior file]: This option will initialise Sombrero using a SOM that has previously been trained on a database of transcription factor binding matrices. This initialisation improves accuracy if one of the members of the TF matrix database is present in the input sequences.

-autoprior [prior base filename]: Assuming that you have downloaded a set of priors, providing the base name of the prior files (e.g. "Fly") with this option allows the size of the prior SOM to be chosen automatically.

-random: If set, Sombrero is initialised entirely randomly, rather than the gradient-random initialisation used by default.

-out [output_name]: Here "output_name" is the root name of the output files that will be generated. Several output files may be generated for a particular Sombrero session, but all will begin with the name provided as "output_name". Required.

An example use of Sombrero would be:

./sombrero -t crp_set.seq -b e_coli.back -lm 8 20 -size 10 10 -out CRP


2.3. Viewing the results

At present, the Sombrero results viewer is a Perl/Tk program, so in order to look at Sombrero's results, you'll need to have Perl installed, as well as the Tk module (which usually comes with the Perl installation, so you probably won't have to worry about this). The viewer is the file called "" that came with your distribution. The viewer is easy to run; it only takes one argument and that is the Sombrero output file root name. For example, if you had trained Sombrero using the option '-out CRP', then "CRP is your root name, and the results viewer would be called as follows:

perl CRP

The Sombrero results viewer looks like this:

This particular 20x10 node SOM is the result of training Sombrero with gal4 binding sequences from S. cerevisiae. This example is more fully explored in Section 3. However, the above screen shot can illustrate the basics of the Sombrero results viewer.

First of all, the coloured blocks are the nodes of the SOM. Each block is colour-coded according to how "over-represented" the node's motif is. The redder the block, the higher the z-score that node's motif has. What we call "over-representation" or z-score is in fact a good indicator that a motif is a regulatory element. Therefore, the reddest block on the display holds the motif that is most likely to be a regulatory element.

All nodes on the display can be clicked using the mouse pointer, and doing so brings up information about the node and the node's motif. For example, clicking on a node brings up the motif consensus sequence, the motif's complexity and the information content of the motif to name a few. In the bottom right hand corner of the window, you will also see a list of occurrences of the motif throughout the training sequences. These instances can be displayed on simple sequence viewer using the "Display Occurrences" button. You can also copy the instances to the clipboard. Ticking the "Enforce Threshold" box can eliminate some weaker matches to the motif, leaving only sequences that we are fairly confident are instances of the motif.

You will also notice the "L-mer Length" drop-down menu beside the Exit button at the top of the window. If you trained Sombrero using the '-lm' option, this drop-down menu will allow you to browse between all the different SOMs that were trained for these sequences. However, you won't have to go through every SOM looking for the reddest node and comparing them all; we've done that for you! The top right hand corner of the window holds the "Best Patterns" box. This is an ordered list of the best motifs found across all SOMs. Therefore, the number 1 pattern in this box is the most likely regulatory element found in any SOM trained for these sequences. You can click on any of these patterns and the relevant SOM and motif information will be displayed.

(previous) (back to index) (next)

Department of Information Technology,
National University of Ireland, Galway, University Road, Galway, Ireland.
Phone: +353 (0)91 524411 ext 3549, E-mail: aaron.golden(AT)