The RF MotifDiscovery module allows identifying significantly enriched sequence motifs, such as RNA-binding protein or RNA post-transcriptional modification consensus sequences, from peaks identified in RNA immunoprecipitation experiments via RF PeakCall.

Important

This is a beta version. As such, it might be subjected to changes/improvements.

Usage

To list the required parameters, simply type:

$ rf-motifdiscovery -h
Parameter Type Description
-p or --processors int Number of processors (threads) to use for shuffling (Default: 1)
Note: this parameter has no effect when specified without -s (or --shuffle)
-b or --peaks string Peaks BED file (mandatory)
-nb or --negative-peaks string A BED file containing negative peak sequences (optional)
Note: when no negative peaks file is specified, a set of negative sequences will be generated by -ns (or --neg-samplings) rounds of random sampling from reference transcripts, or random shuffling if -s (or --shuffle) has been specified
f or --fasta string A FASTA file containing the reference transcript sequences (mandatory)
-o or --output-dir string Output directory for writing counts in RC (RNA Count) format (Default: rf_motifdiscovery/)
-ow or --overwrite Overwrites the output directory if already exists
-w or --window int Size of the window, centered on the center of each peak, in which motif discovery should be performed (≥3, Default: 50)
-np or --neg-samplings int Number of negative sequences to generate/sample for each peak (Default: 20)
-s or --shuffle Negative sequences will be generated by random shuffling peak sequences
Note: default is to sample --neg-samplings random windows from reference transcripts, for each peak in the dataset
-ns or* --nuc-shuffling Performs random shuffling of nucleotides without preserving dinucleotide frequencies
-k or --kmer int K-mer size (≥4, Default: 5)
-v or --pvalue float P-value threshold to consider an enrichment significant (0-1, Default: 1e-3)
-nm or --n-motifs int Maximum number of motifs to report (≥1, Default: 3)
-ops or --one-per-seq K-mers are counted only once per peak
-t or --tollerance float Fractional tollerance to consider a position degenerate (0-1, Default: 0.2)
-sk or --save-kmer-table Saves the list of k-mers, and their associated p-values


Understanding the algorithm

The algorithm starts by identifying significantly enriched k-mers of length --kmer in peak sequences. By default, analysis is limited to a --window nt-long window centered on the center of the peak. Significance is assessed using a Fisher test, by comparing the number of occurrences of each k-mer within peak sequences, as compared to a set of negative sequences. By default, all the occurrences of the same k-mer within each peak are counted, unless the --one-per-seq parameter has been specified; in such a case, each k-mer is counted only once per peak. An user-provided list of negative peaks can be provided in BED format via the --negative-peaks parameter. If no negative peaks file is provided, by default, for each peak in the dataset, up to --neg-samplings negative sequences will be randomly sampled from the reference transcripts. If the --shuffle parameter is specified, however, negative sequences will be generated by --neg-samplings random shufflings of the original peak sequences. By default, shuffling is performed in such a way that original dinucleotide frequencies are preserved; this can be overriden via the --nuc-shuffling parameter, that enables fully-random shuffling.
Once significantly enriched k-mers have been identified, motifs are built as follows:

  1. The most significant k-mer is selected (in case of multiple k-mers having the same p-value, the one with the highest enrichment is selected)
  2. Significant k-mers within a maximum Hamming distance of 1 are identified
  3. A core motif is built by using this initial set of k-mers
  4. The core motif is extended by identifying significant k-mers overlapping by at least 75% with the core motif
  5. This expanded set of sequences is then used to build a Position Frequency Matrix for the motif

    Coverage calculation

    Significantly enriched motifs will be reported as Position Frequency Matrices (PFMs), in TRANSFAC format:
ID Motif_DRACW
BF RF_MotifDiscovery
P0 A C G U
1 34 9 23 34 W
2 35 0 36 29 D
3 49 0 51 0 R
4 100 0 0 0 A
5 0 65 0 35 C
6 35 24 0 40 W
7 21 20 32 28 K
XX
//

Currently, RF MotifDiscovery does not generate sequence logos. However, output PFMs can be directly used as input for WebLogo 3, either via the web interface, or locally:

weblogo -a RNA -s large -c classic --format PDF < motif_DRACW.mat > motif_DRACW.pdf