The RF StructExtract module allows extracting (portions of) individual structure elements from a structure model generated using rf-fold
, on the basis of specific selection criteria such as size, median reactivity, median Shannon entropy, presence of a multiway junction, or thermodynamic stability higher than expected by chance.
Usage
To list the required parameters, simply type:
$ rf-structextract -h
Parameter | Type | Description |
---|---|---|
-p or --processors | int | Number of processors (threads) to use (Default: 1) |
-ro or --rffoldOut | string | Path to the output folder generated by rf-fold , containing the structures to be parsed |
-xf or --xmlFolder | string | Path to the output folder generated by rf-norm , containing the reactivities in XML format |
-o or --output | string | Output folder (Default: rf_structextract/) |
-ow or --overwrite | Overwrites the output directory if already exists | |
-w or --winSize | int | Window size (in nt) for calculating the median reactivity and Shannon (Default: 50) |
-ml or --minTranscriptLen | int | Low reactivity - low Shannon calculation will be skipped for transcripts below this length (Default: 500) |
-ir or --ignoreReact | Skips low reactivity evaluation | |
-is or --ignoreShannon | Skips low Shannon evaluation | |
-mv or --minValueFrac | float | Windows for which less than this fraction of bases is covered, will be set to NaN (Default: 0.4 [40%]) |
-mb or --minBelowMedian | float | Structure elements having less than this fraction of bases whose Shannon and reactivity are below the global transcript median, will be discarded (Default: 0.7 [70%]) |
-mp or --minPairedFrac | float | Structure elements having less than this fraction of paired bases will be discarded (Default: 0.45 [45%]) |
-mm or --minMotifLen | int | Structure elements below this length will be discarded (Default: 50) |
-xm or --maxMotifLen | int | Structure elements above this length will be discarded (Default: no limit) |
-xl or --maxLoopSize | int | Structure elements encompassing a loop larger than this number of bases, will be discarded (Default: no limit) |
-mo or --multiwayOnly | Only report structure elements encompassing multiway junctions | |
-opf or --onePerFile | Extracted structure elements belonging to the same transcript are reported in separate files | |
-ee or --evalEnergy | Only structure having a free energy significantly lower than expected by chance will be reported Note #1: this is estimated by randomly shuffling the underlying sequence N times (where N is controlled via the --nShufflings parameter) and by calculating the probability associated with the corresponding Z-scoreNote #2: this procedure will significantly slow down the analysis |
|
-v or --pvalue | float | P-value threshold for considering the energy of a structure significantly lower than expected by chance (0-1, Default: 0.05) |
-ns or --nShufflings | int | Number of times a sequence must be shuffled (>=1, Default: 100) |
-ds or --dinuclShuffle | Sequences are shuffled taking care to preserve their dinucleotide frequencies (slower) |
Understanding the algorithm
Aim of the module is to extract high-confidence RNA structure elements, more likely to be functionally relevant. The algorithm first identifies independently folded structural domains (that are regions of the transcript whose folding is independent from that of the rest of the transcript) and then, starting from the inner-most loop, it begins identifying the individual structure motifs by bidirectional extension. The extension is stopped when one or more of the following user-defined criteria are not met:
-
If the motif falls within a region of high Shannon entropy - high reactivity. Briefly, the smoothed Shannon entropy is calculated along the entire transcript by sliding a centered window of ±
--winSize
/ 2 nucleotides in 1 nt increments and by calculating the median Shannon entropy within each window. If the fraction of bases in the window having non-NaN values is <--minValueFrac
, the window median is set to NaN. The same operation is repeated to calculate a smoothed reactivity. When a structure motif is extacted, the algorithm compares the smoothed Shannon entropy and reactivity across all the bases encompassed by the motif, to the median Shannon entropy and reactivity of the entire transcript. The motif is retained and the extension continued if the fraction of bases falling below the transcript's median is ≤--minBelowMedian
.
It is essential to note that, in order to evaluate the Shannon entropy, the--rffoldOut
directory passed to the module must contain theshannon/
folder. This folder is only generated when invoking therf-fold
module with the--shannon-entropy
flag (more details can be found in the manual page ofrf-fold
). Evaluation of the Shannon entropy can be turned off by enabling the--ignoreShannon
flag. Similarly, in order to evaluate the reactivity, the--xmlFolder
of XML reactivity profiles must be provided. Evaluation of the reactivity can be turned off by enabling the--ignoreReact
flag. Both Shannon entropy and reactivity evaluation are automatically skipped for transcripts shorter than--minTranscriptLen
. -
If the fraction of base-paired positions in the motif is <
--minPairedFraction
- If the motif is shorter than
--minMotifLen
(in which case the extension continues, if possible) - If the motif is longer than
--maxMotifLen
- If the motif encompasses a loop larger than
--maxLoopSize
(Note: in the case of junctions, the size of the loop is calculated as the number of unpaired residues residing in the junction loop) - If the motif has a free energy higher than expected by chance. Briefly, if the flag
--evalEnergy
is enabled, the sequence of the motif is randomly shuffled ×--nShufflings
times and the probability of obtaining by chance a structure having a free energy ≤ that that of the original motif is calculated from the corresponding Z-score. If the probability is ≥--pvalue
, the motif is discarded. When the--dinuclShuffle
flag is enabled, the sequence of the motif is shuffled in such a way that the dinucleotide frequencies are preserved.
In the above example, the effect of smoothing reactivities and Shannon entropy is shown. The red dashed lines correspond to the median reactivity and median Shannon entropy along the entire trascript. The inset further shows an independently folded structural domain. The green dots mark the loops that represent the possible starting points for the bidirectional extension and motif extraction. In this example, only the two motifs, colored in green (respectively marked #1 and #2) will be reported, as they fall inside regions of low reactivity and low Shannon entropy. The base-pairs marked in red will not be part of motif #1 as, when included, the reactivity would exceed the global median reactivity for more than 1 - --minBelowMedian
bases.