The RF MotifDiscovery module allows identifying significantly enriched sequence motifs, such as RNA-binding protein or RNA post-transcriptional modification consensus sequences, from peaks identified in RNA immunoprecipitation experiments via RF PeakCall.
Important
This is a beta version. As such, it might be subjected to changes/improvements.
Usage
To list the required parameters, simply type:
$ rf-motifdiscovery -h
Parameter | Type | Description |
---|---|---|
-p or --processors | int | Number of processors (threads) to use for shuffling (Default: 1) Note: this parameter has no effect when specified without -s (or --shuffle ) |
-b or --peaks | string | Peaks BED file (mandatory) |
-nb or --negative-peaks | string | A BED file containing negative peak sequences (optional) Note: when no negative peaks file is specified, a set of negative sequences will be generated by -ns (or --neg-samplings ) rounds of random sampling from reference transcripts, or random shuffling if -s (or --shuffle ) has been specified |
f or --fasta | string | A FASTA file containing the reference transcript sequences (mandatory) |
-o or --output-dir | string | Output directory for writing counts in RC (RNA Count) format (Default: rf_motifdiscovery/) |
-ow or --overwrite | Overwrites the output directory if already exists | |
-w or --window | int | Size of the window, centered on the center of each peak, in which motif discovery should be performed (≥3, Default: 50) |
-np or --neg-samplings | int | Number of negative sequences to generate/sample for each peak (Default: 20) |
-s or --shuffle | Negative sequences will be generated by random shuffling peak sequences Note: default is to sample --neg-samplings random windows from reference transcripts, for each peak in the dataset |
|
-ns or* --nuc-shuffling | Performs random shuffling of nucleotides without preserving dinucleotide frequencies | |
-k or --kmer | int | K-mer size (≥4, Default: 5) |
-v or --pvalue | float | P-value threshold to consider an enrichment significant (0-1, Default: 1e-3) |
-nm or --n-motifs | int | Maximum number of motifs to report (≥1, Default: 3) |
-ops or --one-per-seq | K-mers are counted only once per peak | |
-t or --tollerance | float | Fractional tollerance to consider a position degenerate (0-1, Default: 0.2) |
-sk or --save-kmer-table | Saves the list of k-mers, and their associated p-values |
Understanding the algorithm
The algorithm starts by identifying significantly enriched k-mers of length --kmer
in peak sequences. By default, analysis is limited to a --window
nt-long window centered on the center of the peak. Significance is assessed using a Fisher test, by comparing the number of occurrences of each k-mer within peak sequences, as compared to a set of negative sequences. By default, all the occurrences of the same k-mer within each peak are counted, unless the --one-per-seq
parameter has been specified; in such a case, each k-mer is counted only once per peak. An user-provided list of negative peaks can be provided in BED format via the --negative-peaks
parameter. If no negative peaks file is provided, by default, for each peak in the dataset, up to --neg-samplings
negative sequences will be randomly sampled from the reference transcripts. If the --shuffle
parameter is specified, however, negative sequences will be generated by --neg-samplings
random shufflings of the original peak sequences. By default, shuffling is performed in such a way that original dinucleotide frequencies are preserved; this can be overriden via the --nuc-shuffling
parameter, that enables fully-random shuffling.
Once significantly enriched k-mers have been identified, motifs are built as follows:
- The most significant k-mer is selected (in case of multiple k-mers having the same p-value, the one with the highest enrichment is selected)
- Significant k-mers within a maximum Hamming distance of 1 are identified
- A core motif is built by using this initial set of k-mers
- The core motif is extended by identifying significant k-mers overlapping by at least 75% with the core motif
- This expanded set of sequences is then used to build a Position Frequency Matrix for the motif
Significantly enriched motifs will be reported as Position Frequency Matrices (PFMs), in TRANSFAC format:
ID Motif_DRACW
BF RF_MotifDiscovery
P0 A C G U
1 34 9 23 34 W
2 35 0 36 29 D
3 49 0 51 0 R
4 100 0 0 0 A
5 0 65 0 35 C
6 35 24 0 40 W
7 21 20 32 28 K
XX
//
Currently, RF MotifDiscovery does not generate sequence logos. However, output PFMs can be directly used as input for WebLogo 3, either via the web interface, or locally:
weblogo -a RNA -s large -c classic --format PDF < motif_DRACW.mat > motif_DRACW.pdf