The RF MotifDiscovery module allows identifying significantly enriched sequence motifs, such as RNA-binding protein or RNA post-transcriptional modification consensus sequences, from peaks identified in RNA immunoprecipitation experiments via RF PeakCall.

Important

This is a beta version. As such, it might be subjected to changes/improvements.

Usage

To list the required parameters, simply type:

$ rf-motifdiscovery -h

Parameter	Type	Description
-p or --processors	int	Number of processors (threads) to use for shuffling (Default: 1) Note: this parameter has no effect when specified without `-s` (or `--shuffle`)
-b or --peaks	string	Peaks BED file (mandatory)
-nb or --negative-peaks	string	A BED file containing negative peak sequences (optional) Note: when no negative peaks file is specified, a set of negative sequences will be generated by `-ns` (or `--neg-samplings`) rounds of random sampling from reference transcripts, or random shuffling if `-s` (or `--shuffle`) has been specified
f or --fasta	string	A FASTA file containing the reference transcript sequences (mandatory)
-o or --output-dir	string	Output directory for writing counts in RC (RNA Count) format (Default: rf_motifdiscovery/)
-ow or --overwrite		Overwrites the output directory if already exists
-w or --window	int	Size of the window, centered on the center of each peak, in which motif discovery should be performed (≥3, Default: 50)
-np or --neg-samplings	int	Number of negative sequences to generate/sample for each peak (Default: 20)
-s or --shuffle		Negative sequences will be generated by random shuffling peak sequences Note: default is to sample `--neg-samplings` random windows from reference transcripts, for each peak in the dataset
-ns or* --nuc-shuffling		Performs random shuffling of nucleotides without preserving dinucleotide frequencies
-k or --kmer	int	K-mer size (≥4, Default: 5)
-v or --pvalue	float	P-value threshold to consider an enrichment significant (0-1, Default: 1e-3)
-nm or --n-motifs	int	Maximum number of motifs to report (≥1, Default: 3)
-ops or --one-per-seq		K-mers are counted only once per peak
-t or --tollerance	float	Fractional tollerance to consider a position degenerate (0-1, Default: 0.2)
-sk or --save-kmer-table		Saves the list of k-mers, and their associated p-values

Understanding the algorithm

The algorithm starts by identifying significantly enriched k-mers of length --kmer in peak sequences. By default, analysis is limited to a --window nt-long window centered on the center of the peak. Significance is assessed using a Fisher test, by comparing the number of occurrences of each k-mer within peak sequences, as compared to a set of negative sequences. By default, all the occurrences of the same k-mer within each peak are counted, unless the --one-per-seq parameter has been specified; in such a case, each k-mer is counted only once per peak. An user-provided list of negative peaks can be provided in BED format via the --negative-peaks parameter. If no negative peaks file is provided, by default, for each peak in the dataset, up to --neg-samplings negative sequences will be randomly sampled from the reference transcripts. If the --shuffle parameter is specified, however, negative sequences will be generated by --neg-samplings random shufflings of the original peak sequences. By default, shuffling is performed in such a way that original dinucleotide frequencies are preserved; this can be overriden via the --nuc-shuffling parameter, that enables fully-random shuffling.
Once significantly enriched k-mers have been identified, motifs are built as follows:

The most significant k-mer is selected (in case of multiple k-mers having the same p-value, the one with the highest enrichment is selected)
Significant k-mers within a maximum Hamming distance of 1 are identified
A core motif is built by using this initial set of k-mers
The core motif is extended by identifying significant k-mers overlapping by at least 75% with the core motif
This expanded set of sequences is then used to build a Position Frequency Matrix for the motif

Significantly enriched motifs will be reported as Position Frequency Matrices (PFMs), in TRANSFAC format:

ID Motif_DRACW
BF RF_MotifDiscovery
P0 A C G U
1 34 9 23 34 W
2 35 0 36 29 D
3 49 0 51 0 R
4 100 0 0 0 A
5 0 65 0 35 C
6 35 24 0 40 W
7 21 20 32 28 K
XX
//

Currently, RF MotifDiscovery does not generate sequence logos. However, output PFMs can be directly used as input for WebLogo 3, either via the web interface, or locally:

weblogo -a RNA -s large -c classic --format PDF < motif_DRACW.mat > motif_DRACW.pdf