The RF StructExtract module allows extracting (portions of) individual structure elements from a structure model generated using rf-fold, on the basis of specific selection criteria such as size, median reactivity, median Shannon entropy, presence of a multiway junction, or thermodynamic stability higher than expected by chance.

Usage

To list the required parameters, simply type:

$ rf-structextract -h

Parameter	Type	Description
-p or --processors	int	Number of processors (threads) to use (Default: 1)
-ro or --rffoldOut	string	Path to the output folder generated by `rf-fold`, containing the structures to be parsed
-xf or --xmlFolder	string	Path to the output folder generated by `rf-norm`, containing the reactivities in XML format
-o or --output	string	Output folder (Default: rf_structextract/)
-ow or --overwrite		Overwrites the output directory if already exists
-w or --winSize	int	Window size (in nt) for calculating the median reactivity and Shannon (Default: 50)
-ml or --minTranscriptLen	int	Low reactivity - low Shannon calculation will be skipped for transcripts below this length (Default: 500)
-ir or --ignoreReact		Skips low reactivity evaluation
-is or --ignoreShannon		Skips low Shannon evaluation
-mv or --minValueFrac	float	Windows for which less than this fraction of bases is covered, will be set to NaN (Default: 0.4 [40%])
-mb or --minBelowMedian	float	Structure elements having less than this fraction of bases whose Shannon and reactivity are below the global transcript median, will be discarded (Default: 0.7 [70%])
-mp or --minPairedFrac	float	Structure elements having less than this fraction of paired bases will be discarded (Default: 0.45 [45%])
-mm or --minMotifLen	int	Structure elements below this length will be discarded (Default: 50)
-xm or --maxMotifLen	int	Structure elements above this length will be discarded (Default: no limit)
-xl or --maxLoopSize	int	Structure elements encompassing a loop larger than this number of bases, will be discarded (Default: no limit)
-mo or --multiwayOnly		Only report structure elements encompassing multiway junctions
-opf or --onePerFile		Extracted structure elements belonging to the same transcript are reported in separate files
-ee or --evalEnergy		Only structure having a free energy significantly lower than expected by chance will be reported Note #1: this is estimated by randomly shuffling the underlying sequence N times (where N is controlled via the `--nShufflings` parameter) and by calculating the probability associated with the corresponding Z-score Note #2: this procedure will significantly slow down the analysis
-v or --pvalue	float	P-value threshold for considering the energy of a structure significantly lower than expected by chance (0-1, Default: 0.05)
-ns or --nShufflings	int	Number of times a sequence must be shuffled (>=1, Default: 100)
-ds or --dinuclShuffle		Sequences are shuffled taking care to preserve their dinucleotide frequencies (slower)

Understanding the algorithm

Aim of the module is to extract high-confidence RNA structure elements, more likely to be functionally relevant. The algorithm first identifies independently folded structural domains (that are regions of the transcript whose folding is independent from that of the rest of the transcript) and then, starting from the inner-most loop, it begins identifying the individual structure motifs by bidirectional extension. The extension is stopped when one or more of the following user-defined criteria are not met:

If the motif falls within a region of high Shannon entropy - high reactivity. Briefly, the smoothed Shannon entropy is calculated along the entire transcript by sliding a centered window of ±--winSize/ 2 nucleotides in 1 nt increments and by calculating the median Shannon entropy within each window. If the fraction of bases in the window having non-NaN values is < --minValueFrac, the window median is set to NaN. The same operation is repeated to calculate a smoothed reactivity. When a structure motif is extacted, the algorithm compares the smoothed Shannon entropy and reactivity across all the bases encompassed by the motif, to the median Shannon entropy and reactivity of the entire transcript. The motif is retained and the extension continued if the fraction of bases falling below the transcript's median is ≤ --minBelowMedian.

It is essential to note that, in order to evaluate the Shannon entropy, the --rffoldOut directory passed to the module must contain the shannon/ folder. This folder is only generated when invoking the rf-fold module with the --shannon-entropy flag (more details can be found in the manual page of rf-fold). Evaluation of the Shannon entropy can be turned off by enabling the --ignoreShannon flag. Similarly, in order to evaluate the reactivity, the --xmlFolder of XML reactivity profiles must be provided. Evaluation of the reactivity can be turned off by enabling the --ignoreReact flag. Both Shannon entropy and reactivity evaluation are automatically skipped for transcripts shorter than --minTranscriptLen.
If the fraction of base-paired positions in the motif is < --minPairedFraction
If the motif is shorter than --minMotifLen (in which case the extension continues, if possible)
If the motif is longer than --maxMotifLen
If the motif encompasses a loop larger than --maxLoopSize (Note: in the case of junctions, the size of the loop is calculated as the number of unpaired residues residing in the junction loop)
If the motif has a free energy higher than expected by chance. Briefly, if the flag --evalEnergy is enabled, the sequence of the motif is randomly shuffled × --nShufflings times and the probability of obtaining by chance a structure having a free energy ≤ that that of the original motif is calculated from the corresponding Z-score. If the probability is ≥ --pvalue, the motif is discarded. When the --dinuclShuffle flag is enabled, the sequence of the motif is shuffled in such a way that the dinucleotide frequencies are preserved.

Overview

In the above example, the effect of smoothing reactivities and Shannon entropy is shown. The red dashed lines correspond to the median reactivity and median Shannon entropy along the entire trascript. The inset further shows an independently folded structural domain. The green dots mark the loops that represent the possible starting points for the bidirectional extension and motif extraction. In this example, only the two motifs, colored in green (respectively marked #1 and #2) will be reported, as they fall inside regions of low reactivity and low Shannon entropy. The base-pairs marked in red will not be part of motif #1 as, when included, the reactivity would exceed the global median reactivity for more than 1 - --minBelowMedian bases.