The RF StructExtract module allows extracting (portions of) individual structure elements from a structure model generated using rf-fold, on the basis of specific selection criteria such as size, median reactivity, median Shannon entropy, presence of a multiway junction, or thermodynamic stability higher than expected by chance.

Usage

To list the required parameters, simply type:

$ rf-structextract -h
Parameter Type Description
-p or --processors int Number of processors (threads) to use (Default: 1)
-ro or --rffoldOut string Path to the output folder generated by rf-fold, containing the structures to be parsed
-xf or --xmlFolder string Path to the output folder generated by rf-norm, containing the reactivities in XML format
-o or --output string Output folder (Default: rf_structextract/)
-ow or --overwrite Overwrites the output directory if already exists
-w or --winSize int Window size (in nt) for calculating the median reactivity and Shannon (Default: 50)
-ml or --minTranscriptLen int Low reactivity - low Shannon calculation will be skipped for transcripts below this length (Default: 500)
-ir or --ignoreReact Skips low reactivity evaluation
-is or --ignoreShannon Skips low Shannon evaluation
-mv or --minValueFrac float Windows for which less than this fraction of bases is covered, will be set to NaN (Default: 0.4 [40%])
-mb or --minBelowMedian float Structure elements having less than this fraction of bases whose Shannon and reactivity are below the global transcript median, will be discarded (Default: 0.7 [70%])
-mp or --minPairedFrac float Structure elements having less than this fraction of paired bases will be discarded (Default: 0.45 [45%])
-mm or --minMotifLen int Structure elements below this length will be discarded (Default: 50)
-xm or --maxMotifLen int Structure elements above this length will be discarded (Default: no limit)
-xl or --maxLoopSize int Structure elements encompassing a loop larger than this number of bases, will be discarded (Default: no limit)
-mo or --multiwayOnly Only report structure elements encompassing multiway junctions
-opf or --onePerFile Extracted structure elements belonging to the same transcript are reported in separate files
-ee or --evalEnergy Only structure having a free energy significantly lower than expected by chance will be reported
Note #1: this is estimated by randomly shuffling the underlying sequence N times (where N is controlled via the --nShufflings parameter) and by calculating the probability associated with the corresponding Z-score
Note #2: this procedure will significantly slow down the analysis
-v or --pvalue float P-value threshold for considering the energy of a structure significantly lower than expected by chance (0-1, Default: 0.05)
-ns or --nShufflings int Number of times a sequence must be shuffled (>=1, Default: 100)
-ds or --dinuclShuffle Sequences are shuffled taking care to preserve their dinucleotide frequencies (slower)


Understanding the algorithm

Aim of the module is to extract high-confidence RNA structure elements, more likely to be functionally relevant. The algorithm first identifies independently folded structural domains (that are regions of the transcript whose folding is independent from that of the rest of the transcript) and then, starting from the inner-most loop, it begins identifying the individual structure motifs by bidirectional extension. The extension is stopped when one or more of the following user-defined criteria are not met:

  1. If the motif falls within a region of high Shannon entropy - high reactivity. Briefly, the smoothed Shannon entropy is calculated along the entire transcript by sliding a centered window of ±--winSize/ 2 nucleotides in 1 nt increments and by calculating the median Shannon entropy within each window. If the fraction of bases in the window having non-NaN values is < --minValueFrac, the window median is set to NaN. The same operation is repeated to calculate a smoothed reactivity. When a structure motif is extacted, the algorithm compares the smoothed Shannon entropy and reactivity across all the bases encompassed by the motif, to the median Shannon entropy and reactivity of the entire transcript. The motif is retained and the extension continued if the fraction of bases falling below the transcript's median is ≤ --minBelowMedian.

    It is essential to note that, in order to evaluate the Shannon entropy, the --rffoldOut directory passed to the module must contain the shannon/ folder. This folder is only generated when invoking the rf-fold module with the --shannon-entropy flag (more details can be found in the manual page of rf-fold). Evaluation of the Shannon entropy can be turned off by enabling the --ignoreShannon flag. Similarly, in order to evaluate the reactivity, the --xmlFolder of XML reactivity profiles must be provided. Evaluation of the reactivity can be turned off by enabling the --ignoreReact flag. Both Shannon entropy and reactivity evaluation are automatically skipped for transcripts shorter than --minTranscriptLen.

  2. If the fraction of base-paired positions in the motif is < --minPairedFraction

  3. If the motif is shorter than --minMotifLen (in which case the extension continues, if possible)
  4. If the motif is longer than --maxMotifLen
  5. If the motif encompasses a loop larger than --maxLoopSize (Note: in the case of junctions, the size of the loop is calculated as the number of unpaired residues residing in the junction loop)
  6. If the motif has a free energy higher than expected by chance. Briefly, if the flag --evalEnergy is enabled, the sequence of the motif is randomly shuffled × --nShufflings times and the probability of obtaining by chance a structure having a free energy ≤ that that of the original motif is calculated from the corresponding Z-score. If the probability is ≥ --pvalue, the motif is discarded. When the --dinuclShuffle flag is enabled, the sequence of the motif is shuffled in such a way that the dinucleotide frequencies are preserved.


Overview

In the above example, the effect of smoothing reactivities and Shannon entropy is shown. The red dashed lines correspond to the median reactivity and median Shannon entropy along the entire trascript. The inset further shows an independently folded structural domain. The green dots mark the loops that represent the possible starting points for the bidirectional extension and motif extraction. In this example, only the two motifs, colored in green (respectively marked #1 and #2) will be reported, as they fall inside regions of low reactivity and low Shannon entropy. The base-pairs marked in red will not be part of motif #1 as, when included, the reactivity would exceed the global median reactivity for more than 1 - --minBelowMedian bases.