The RF Fold module is designed to allow transcriptome-wide reconstruction of RNA structures, starting from XML files generated using the RF Norm tool.
This tool can process a single, or an entire directory of XML files, and produces the inferred secondary structures (either in dot-bracket notation, or CT format) and their graphical representation (either in Postscript, or SVG format).
Folding inference can be performed using 2 different algorithms:
1. ViennaRNA
2. RNAstructure
Prediction can be performed either on the whole transcript, or through a windowed approach (see next paragraph).
Structure modelling
The structure modelling approach is inspired by the original method described in Siegfried et al., 2014 (PMID: 25028896). Since version 2.8.0, the underlying logic of the windowed approach has been slightly changed, by performing the detection of pseudoknots as the last step. The procedure is outlined below:
In step I, a window is slid along the RNA (if the -w option is enabled, otherwise the entire RNA is used), and partition function is calculated. If provided, soft-constraints from structure probing are applied. Predicted base-pair probabilities are averaged across all windows in which they have appeared, and base-pairs with >99% probability are retained, and hard-constrained to be paired in step III.
In step II, a window is slid along the RNA, and MFE folding is performed, including (where present) soft-constraints from probing data, and base-pairs from step I. Predicted base-pairs are retained if they appear in >50% of analyzed windows.
In step III (optional), a window is slid along the RNA, and putative pseudoknots are detected using the same approach employed by the ShapeKnots algorithm (Hajdin et al., 2013 (PMID: 23503844)). Our implementation of the ShapeKnots algorithm relies on the ViennaRNA package (instead of RNAstructure as in the original implementation), thus it is much faster:
Nonetheless, both algorithms work in single thread. Alternatively, the multi-thread implementation ShapeKnots-smp shipped with the latest RNAstructure version can be used.
If constraints from structure probing experiments are provided, these are incorporated in the form of soft-constraints. Predicted pseudoknotted base-pairs are retained if they apper in >50% of analyzed windows and if they do not clash with the nested base-pairs indentified in step II. In case structure probing constraints are provided, pseudoknots are retained only if the average reactivity of bases on both sides of the pseudoknotted helix is below a certain reactivity cutoff.
Note
At all stages, increased sampling is performed at the 5'/3'-ends to avoid end biases
Along with the predicted structure, the windowed method also produces a WIGGLE track file containing per-base Shannon entropies.
Regions with higher Shannon entropies are likely to form alternative structures, while those with low Shannon entropies correspond to regions with well-defined RNA structures, or persistent single-strandedness (Siegfried et al., 2014).
Shannon entropy is calculated as:
where pi,j is the probability of base i of being base-paired to base j, over all its potential J pairing partners.
Since version 2.9.1, RF Fold can use R to generate PDF graphical reports for each structure, reporting reactivity, MEA structure, Shannon entropy, and base-pairing probabilities:
Note
The calculation of Shannon entropy and base-pairing probabilities requires partition function to be computed. Since this is a very slow step, partition function folding is performed only in windowed mode, or if parameters -dp (or --dotplot) or -sh (or --shannon) are explicitly specified.
Additionally, since version 2.9.4, RF Fold can use the RNAplot tool of the ViennaRNA package (v2.7.0 or greater) to generate SVG secondary structure plots with overlaid reactivities:
Combining multiple experiments
Since version 2.9.0, RF Fold can combine multiple experiments into a single prediction. These can either be replicates prepared using the same chemical probe, or experiments performed by using different chemical probes.
The combination is achieved by iterating the folding process (as described in the "Structure modelling" paragraph above) over all experiments. Windows from all experiments are combined using a majority voting approach, so that only the pairs passing the above-detailed thresholds (i.e., pairs with >99% probability and pairs appearing in >50% windows) are retained. Therefore, if, for example, a certain base-pair appears in 100% of the windows for one replicate and in just 25% of the windows for another replicate, it will be included in the final model as, in total, it appears in 62.5% of the windows.
In the example below:
$ rf-fold -sl 2.4 -in -0.2 experiment_1/ experiment_2/
the structure of the transcripts common to both experiments will be modelled y combining both sets of reactivities, while transcripts unique to either experiments will be modelled using a single set of reactivities.
Single XML files and entire XML folders can also be combined:
$ rf-fold -sl 2.4 -in -0.2 experiment_1/ experiment_2/ transcript_1.xml
In the above example, transcript_1 will be modelled using data coming from both experiment #1 and #2 (if the file transcript_1.xml is present in the respective folders), as well as data from transcript_1.xml, while the all other transcripts potentially shared across experiments #1 and #2 will only be modelled using those two datasets.
Importantly, it is possible combining even experiments generated using different chemical probes. In the following example:
$ rf-fold -sl 2.4 -in -0.2 DMS_data/ CMCT_data/
two experiments, one generated using DMS and the other using CMCT, are combined into a single prediction that incorporates both sets of reactivities.
While in this case, the same slope and intercept values will be used for both experiments, this might not be ideal, as different reagents might require very different slope/intercept pairs. Therefore, RF Fold can also accept multiple slope/intercept pairs, as a comma-separated list, which must follow the same order as the provided experiments, for example:
$ rf-fold -sl 2.4,4.5 -in -0.2 DMS_data/ CMCT_data/
At this point, a slope of 2.4 will be used for the DMS dataset and a slope of 4.5 will be used for the CMCT datset, while the intercept will be -0.2 for both experiments.
Usage
$ rf-fold [options] XML_dir_1/ XML_dir_2/ .. XML_dir_N/
$ rf-fold [options] rna1.xml rna2.xml .. rnaN.xml
To list all available parameters, simply type:
$ rf-fold -h
| Parameter | Type | Description |
|---|---|---|
| -o or --output-dir | string | Output directory for writing inferred structures (Default: rf_fold/) |
| -ow or --overwrite | Overwrites the output directory if already exists | |
| -ct or --connectivity-table | Writes predicted structures in CT format (Default: Dot-bracket notation) | |
| -m or --folding-method | int | Folding method (1-2, Default: 1): 1. ViennaRNA 2. RNAstructure |
| -p or --processors | int | Number of processors (threads) to use (Default: 1) |
| -oc or --only-common | int | In case of multiple experiments, only transcripts covered across at least this number of experiments will be folded |
| -g or --img | Enables the generation of graphical reports (requires R and, optionally, RNAplot v2.7.0 or greater) | |
| -vrp or --vienna-rnaplot | string | Path to ViennaRNA RNAplot v2.7.0 (or greater) executable (Default: assumes RNAplot is in PATH) |
| -R or --R-path | string | Path to R executable (Default: assumes R is in PATH) Note: also check $RF_RPATH under Environment variables |
| -t or --temperature | float | Temperature in Celsius degrees (Default: 37.0) |
| -sl or --slope | float | Sets the slope used with structure probing data restraints (Default: 1.8 [kcal/mol]) |
| -in or --intercept | float | Sets the intercept used with structure probing data restraints (Default: -0.6 [kcal/mol]) |
| -md or --maximum-distance | int | Maximum pairing distance (in nt) between transcript's residues (Default: 0 [no limit]) |
| -nlp or --no-lonelypairs | Disallows lonely base-pairs (1 bp helices) inside predicted structures | |
| -i or --ignore-reactivity | Ignores XML reactivity data when performing folding (MFE unconstrained prediction) | |
| -is or --ignore-sequence | In case of multiple experiments, nucleotide differences (e.g. SNVs) between XML files are ignored | |
| -hc or --hard-constraint | Besides performing soft-constraint folding, allows specifying a reactivity cutoff (specified by -f) for hard-constraining a base to be single-stranded |
|
| -c or --constraints | string | Path to a directory containing constraint files (in dot-bracket notation), that will be used to enforce specific base-pairs in the structure models |
| -f or --cutoff | float | Reactivity cutoff for constraining a position as unpaired (>0, Default: 0.7) |
| -w or --windowed | Enables windowed folding | |
| -pt or --partition | string | Path to RNAstructure partition executable (Default: assumes partition is in PATH)Note: by default, partition-smp will be used (if available) |
| -pp or --probabilityplot | string | Path to RNAstructure ProbabilityPlot executable (Default: assumes ProbabilityPlot is in PATH) |
| -fw or --fold-window | int | Window size (in nt) for performing MFE folding (>=50, Default: 600) |
| -fo or --fold-offset | int | Offset (in nt) for MFE folding window sliding (Default: 200) |
| -pw or --partition-window | int | Window size (in nt) for performing partition function (>=50, Default: 600) |
| -po or --partition-offset | int | Offset (in nt) for partition function window sliding (Default: 200) |
| -wt or --window-trim | int | Number of bases to trim from both ends of the partition windows to avoid end biases (Default: 100) |
| -dp or --dotplot | Enables generation of dot-plots of base-pairing probabilities | |
| -sh or --shannon-entropy | Enables generation of a WIGGLE track file with per-base Shannon entropies | |
| -pk or --pseudoknots | Enables detection of pseudoknots (computationally intensive) | |
| -ksl or --pseudoknot-slope | float | Sets slope used for pseudoknots prediction (Default: same as -sl <slope>) |
| -kin or --pseudoknot-intercept | float | Sets intercept used for pseudoknots prediction (Default: same as -in <intercept>) |
| -kp1 or --pseudoknot-penality1 | float | Pseudoknot penality P1 (Default: 0.35) |
| -kp2 or --pseudoknot-penality2 | float | Pseudoknot penality P2 (Default: 0.65) |
| -kt or --pseudoknot-tollerance | float | Maximum tollerated deviation of suboptimal structures energy from MFE (>0-1, Default: 0.5 [50%]) |
| -kh or --pseudoknot-helices | int | Number of candidate pseudoknotted helices to evaluate (>0, Default: 100) |
| -kw or --pseudoknot-window | int | Window size (in nt) for performing pseudoknots detection (>=50, Default: 600) |
| -ko or --pseudoknot-offset | int | Offset (in nt) for pseudoknots detection window sliding (Default: 200) |
| -kc or --pseudoknot-cutoff | float | Reactivity cutoff for retaining a pseudoknotted helix (0-1, Default: 0.5) |
| -km or --pseudoknot-method | int | Algorithm for pseudoknots prediction (1-2, Default: 1): 1. RNA Framework 2. ShapeKnots Note: the chosen folding method (specified by -m) affects the algorithm used by RNA Framework (pseudoknot detection method #1) to define the initial MFE structure |
| RNA Framework pseudoknots detection algorithm options | ||
| -vrs or --vienna-rnasubopt | string | Path to ViennaRNA RNAsubopt executable (Default: assumes RNAsubopt is in PATH) |
| -ks or --pseudoknot-suboptimal | int | Number of suboptimal structures to evaluate for pseudoknots prediction (>0, Default: 1000) |
| -nz or --no-zuker | Disables the inclusion of Zuker suboptimal structures (reduces the sampled folding space) | |
| -zs or --zuker-suboptimal | Number of Zuker suboptimal structures to include (>0, Default: 1000) | |
| ShapeKnots pseudoknots detection algorithm options | ||
| -sk or --shapeknots | string | Path to ShapeKnots executable (Default: assumes ShapeKnots is in PATH)Note: by default, ShapeKnots-smp will be used (if available) |
| Folding method #1 options (ViennaRNA) | ||
| -vrf or --vienna-rnafold | string | Path to ViennaRNA RNAfold executable (Default: assumes RNAfold is in PATH) |
| -ngu or --no-closing-gu | Disallows G:U wobbles at the end of helices | |
| Folding method #2 options (RNAstructure) | ||
| -rs or --rnastructure | string | Path to RNAstructure Fold executable (Default: assumes Fold is in PATH)Note: by default, Fold-smp will be used (if available) |
| -d or --data-path | string | Path to RNAstructure data tables (Default: assumes DATAPATH environment variable is already set) |
Information
For additional details relatively to ViennaRNA soft-constrained prediction, please refer to the ViennaRNA documentation, or to Lorenz et al., 2016 (PMID: 26353838).
Information
For additional details relatively to ShapeKnots pseudoknots detection parameters, please refer to Hajdin et al., 2013 (PMID: 23503844).
Constraint files
Constraint files allow forcing base-pairing of certain positions in the RNA. These files are standard dot-bracket files and they must be named after the transcript ID used in the corresponding XML files (for instance, if the XML file is named XYZ.xml, the module will look for a XYZ.db file in the constraint folder):
>XYZ
UUUCGUACGUAGCGAGCGAGUAGCUGAUGCUGAUAGCGGCGAUGCUAGCUGAUCGUAGCGCGCGAUCGAUCGAUGC
..(((.............................................................))).......
In the above example, the constraint file instructs the module to force the base-pairing between positions 3-69, 4-68 and 5-67 of the XYZ transcript.
Information
At present, only nested base-pairs are allowed. Pseudoknotted helices will be automatically discarded.
Output dot-plot files
When option -dp is provided, RF Fold produces a dot-plot file for each transcript being analyzed, with the following structure:
1549 # RNA's length
i j -log10(Probability) # Header
8 254 0.459355416499312
9 253 0.446335563943221
10 252 0.456738523239413
11 251 0.454733421725068
12 250 0.46965667808714
13 249 0.47837140333524
21 35 0.268192200569539
22 34 0.0183400615262171
23 33 0.0166665677814708
24 32 0.0128927546134575
25 31 0.0148601207296645
26 30 0.0252017532628297
-- cut --
1497 1510 0.0147874890078331
1498 1509 0.0102803152157546
1499 1508 0.0137510190884233
1500 1507 0.0402352346970943
where i and j are the positions (1-based) of the bases involved in a given base-pair, followed by the -log10 of their base-pairing probability.
These files can be easily viewed using the Integrative Genomics Viewer (IGV) (for additional details, please refer to the official Broad Institute's IGV page).