DRIMust - Discovering Ranked Imbalanced Motifs using suffix tree

DRIMust Manual

Input
Results
Examples and sample data

Input

Input type: The input to DRIMust is a list of sequences in FASTA format (DNA, RNA and protein alphabets are supported). The input list can be of two types:

Ranked list: One list of sequences, ordered by a parameter of interest (for example: expression level). DRIMust will search for motifs enriched at the top of the list compared to the rest of the list (the top of the list is dynamically determined by DRIMust).
Target and background lists: Two lists of sequences - a target list and a background list (the order within each list is not important). DRIMust will search for motifs enriched in the target list compared to the background list.

In both options, the total number of sequences must not exceed 40,000 and the total number of characters is limited to 4,000,000. The lists may contain 'AGCTUN' characters for DNA and RNA sequences and 'ACDEFGHIKLMNPQRSTVWYX' for protein sequences. Sequences that contain more than 5% 'N' or 'X' will be omitted from the analysis.

Search mode:

single-strand search mode: DRIMust treats the input sequences as the plus strand and searches for over-represented motifs at the top of the list. This mode is suitable for DNA, RNA and protein sequences.
Double-strand search mode: DRIMust takes into account both the given input sequences (as the plus strand) and their reverse-complement sequences (as the minus strand) and searches for motifs that are enriched in both strands. This mode is suitable for DNA and RNA sequences - dataset may contain 'AGCTUN' characters only.

* Note that the double-strand search will require longer running time than the single-strand mode.

Search parameters

Motif length range: DRIMust can search motifs in a specific length or in a range. The maximal length range allowed is 4-20 characters. The default range is 5-10 characters for single-strand search mode and 10 for double-strand search mode. To select a specific length, insert the same value to both 'Min. length' and 'Max. length' boxes.

Statistical significance threshold: DRIMust will report motifs having P-value better than this threshold. The default threshold is 10^-6. Other thresholds between 10^-2 and 10^-15 can be chosen by the user.

General parameters

Job name: An optional parameter that enables you to give your job an informative name. Otherwise, it will get a unique number identifier.

E-mail address: Enables to get a link to the results by e-mail. It is useful when submitting very long jobs (calculation time depends on the number of sequences, their length and the motif length range. In addition, double-strand search mode requires longer running time than single-strand mode). If you choose not to provide an e-mail address, it is recommended to bookmark the results page.

Results

DRIMust motif searching process is divided into two phases. In the first phase, DRIMust searches for k-mers which are over-represented at the top of the input sequences list. In the second phase, DRIMust expands the most promising k-mers heuristically and creates motifs represented by PSSMs.
The significant motifs can be viewed in the results page in three levels of detail:

PSSM motifs view
The PSSM motifs are displayed in a graphical representation using the WebLogo software.
The P-value of the motif is displayed above each logo.
There is an option to download high-resolution printable versions of the logos in PNG or PDF formats.
Motif occurrences in the input sequences
The occurrences of each motif are graphically represented in two views:
- Occurrences alignment: a presentation of the PSSM motif occurrences in the input sequences, aligned to each other. The occurrences are colored by the logo color scheme and the flanking positions are colored in black. Each occurrence is identified by the title of the sequence in which it occurs, the index of the sequence in the query list, the starting position of the occurrence in the sequence and the strand (in double-strand search mode only).
- Occurrences distribution: a schematic presentation of the distribution of the motif occurrences in the input sequences. Each line represents an input sequence (order as in the original list provided by the user), where a bold line is a sequence containing the motif. The occurrences are represented by colored blocks (fuchsia for the occurrences in the plus strand and blue for the occurrences in the minus strand). Placing the cursor on a motif occurrence reveals the title of the sequence in which it occurs, the occurrence string, the starting position of the occurrence in the sequence and the strand (in double-strand search mode only).
K-mers view:
At the top of the page, it is possible to view the significant exact k-mers found by DRIMust. The exact k-mers table can also be downloaded as text file.
Table information:

K-mer: The motif string color-coded by the logo color scheme.
In double-strand search mode, the reverse complement k-mer is also provided.

P-value: The value presented is the mHG score, corrected for multiple testing, which is a tight bound for the P-value (P-value ≤ corrected mHG score).
For more information about the mHG statistics, please refer to: Eden et al. (2007).

N: Total number of input sequences.

B: Total number of sequences containing the motif.

n: The index, in which the division of the input list into target and background by the mHG statistics, gives the optimal enrichment of the motif at the top of the list.

b: The number of sequences containing the motif among the n top sequences.

Enrichment: Measures to what extent the motif is found at the top of the list comparing to total list. Defined as: (b/n) / (B/N).

* Please note that the results are kept on our server for one month.

Examples and sample data

Ranked list - single-strand search mode

The dataset in this example contains RNA sequences bound to the human pumilio 2 (PUM2) RNA-binding protein obtained by the PAR-CLIP technique. The list comprises 9995 sequences (each of length 100), ranked according to the cluster abundance, as published by Hafner et al., 2010 [1]. DRIMust was run in single-stranded search mode and the rest of the parameters were set to default. DRIMUST found one motif at p-value of 4.9e-394, which is the experimentally verified PUM2 consensus motif [1].

1. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M Jr, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T (2010) Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell, 141, 129-141.

Download sample data
View results

Ranked list - double-strand search mode

In this example, 8245 Hoxa2-binding regions from the ChIP-seq experiment by Donaldson et al., 2012 [2] were analyzed. The binding regions were defined by Donaldson et al., 2012 [2] based on the summit region coordinates (200 bp centred upon the MACS defined summit). The DNA sequences were ranked according to their binding P-values (as defined by Donaldson et al.). DRIMust was run using the double-strand search mode and the rest of the parameters were set to default. DRIMUST found one motif at p-value of 1.70e-80, which is the known Hoxa2 consensus motif [2].

2. Donaldson IJ, Amin S, Hensman JJ, Kutejova E, Rattray M, Lawrence N, Hayes A, Ward CM, Bobola N. Genome-wide occupancy links Hoxa2 to Wnt-β-catenin signaling in mouse embryonic development. Nucleic Acids Res. 2012; 40:3990-4001.

Download sample data
View results

Target and background lists - double-strand search mode

The following example comprises TP53 high-confidence binding sites reported by Smeenk et al., 2008 [3], containing 1546 loci in the human genome (target set). The sequences at the target set contain 200 bp upstream and downstream to the proposed binding site. The background set contains 1546 random sequences taken arbitrarily from the human genome. DRIMust was run using the double-strand search mode, with target and background lists, and the rest of the parameters were set to default. DRIMust found one motif at P-value of 2.22e-266, which is the TP53 consensus motif [3].

3. Riley T, Sontag E, Chen P, Levine A. Transcriptional control of human p53-regulated genes. Nat. Rev. Mol. Cell Biol. 2008;9:402-412.

Download target list Download background list
View results