The RF Norm module takes one (Rouskin and Zubradt methods), two (Ding and Siegfried methods), or three (Siegfried method) RC files generated by the RF Count module, and performs normalization to obtain transcriptome-wide per-base reactivities.
Reactivity scores can be computed using 3 methods:

Scoring of RT-stops/nuclease cuts-based methods

[1] Ding et al., 2014 (PMID: 24270811)

Per-base signal is calculated as the ratio of the natural log (ln) of the raw count of RT-stops/nuclease cuts at a given position of a transcript, to the average of the ln of RT-stops/nuclease cuts along the whole transcript:

Ui=ln(nUi+p)j=0lln(nUj+p)l


Ti=ln(nTi+p)j=0lln(nTj+p)l
where nUi and nTi are respectively the raw read counts in the untreated (or RNase V1) and treated (DMS, CMCT, SHAPE, or Nuclease S1) samples at position i of the transcript, l is the transcript’s length, and p is a pseudocount added to deal with non-covered regions. Ui and Ti are respectively the normalized number of RT-stops/nuclease cuts at position i in the untreated and treated samples.
Score at position i is then calculated as:

Si=max(0, (Ti-Ui))
[2] Rouskin et al., 2014 (PMID: 24336214)

The untreated sample is not considered. Per-base RT-stops/nuclease cuts are used as a direct measure of the raw signal.

Warning

Normalization of data processed by Rouskin method, can only be performed using the 90% Winsorizing approach.


Scoring of mutational profiling-based methods

[3] Siegfried et al., 2014 (PMID: 25028896)

This method takes into account both an untreated sample, and (optionally) a denatured control sample.
Per-base raw signal is calculated as:

Si=nTicTi-nUicUinDicDi


where nTi, nUi, and nDi are respectively the mutation counts in the treated, untreated, and denatured samples at position i of the transcript, while cTi, cUi, and cDi are respectively the reads covering position i of the transcript in the treated, untreated, and denatured samples.
If no denatured control sample is provided, raw reactivities are simply calculated as:

Si=nTicTi-nUicUi

[4] Zubradt et al., 2016 (PMID: 27819661)

The untreated sample is not considered. Per-base raw signal is calculated as:

Si=nTicTi
where nTi, and cTi are respectively the mutations count and the read coverage at position i of the transcript.

Normalization of raw reactivities

Raw reactivity scores can be normalized using 3 different approaches:

Method Description
2-8% Normalization From the top 10% of values, the top 2% is ignored, then any reactivity value along the entire transcript is divided by the average of the remaining 8%
90% Winsorizing Each reactivity value above the 95th percentile is set to the 95th percentile and each reactivity value below the 5th percentile is set to the 5th percentile, then the reactivity at each position of the transcript is divided by the value of the 95th percentile
Box-plot Normalization Values greater than 1.5x the interquartile range (numerical distance between the 25th and 75th percentiles) above the 75th percentile are removed. After excluding these outliers, the next 10% of remaining reactivities are averaged, and all reactivities (including outliers) are divided by this value.


Normalized reactivities can be further remapped to values ranging from 0 to 1 according to Zarringhalam et al., 2012 (PMID: 23091593). In this approach, values < 0.25 are linearly mapped to [0-0.35[, values ≥ 0.25 and < 0.3 are linearly mapped to [0.35-0.55[, values ≥ 0.3 and < 0.7 are linearly mapped to [0.55-0.85[, and values ≥ 0.7 are linearly mapped to [0.85-1].

Sliding-window normalization

RF Norm supports data normalization in sliding windows. Windows can be both static (default) or dynamic. When a window size is chosen, data normalization is performed by sliding by the chosen offset, a window of that size. While choice of window's type is irrelevant with SHAPE data, it becomes particularly important when dealing with base-specific reagents. Let consider the example below, in which an RNA has been modified by DMS, which only reacts with A and C residues.

Static vs. Dynamic windows

In the above example, use of static windows of size 10 nt results in an erroneous overestimation of base reactivities for certain residues (marked in red). This is caused by the fact that A/C residues are unevenly distributed along the transcript, thus causing certain windows to have far less than 50% of A/C bases (contrary to what it would be expected by chance). Instead, use of dynamic windows of size 10 nt avoids this overestimation, as the window's size is dynamically adjusted to always include 10 A/C residues.
The overestimation effect can also be minimized by increasing the size of static windows.

Usage

To list the required parameters, simply type:

$ rf-norm -h
Parameter Type Description
-u or --untreated string Path to the RC file for the non-treated/denatured (or RNase V1) sample
(required by Ding/Siegfried scoring methods)
-t or --treated string Path to the RC file for the treated (or Nuclease S1) sample
-d or --denatured string Path to the RC file for the denatured sample
(optional for Siegfried scoring method)
-i or --index string[,string] A comma separated (no spaces) list of RCI index files for the provided RC files
Note #1: RCI files must be provided in the order 1. Untreated/Denatured, 2. Treated
Note #2: If a single RTI file is specified, it will be used for all RC files
Note #3: If no RCI index is provided, it will be generated at runtime, and stored in the same folder of the untreated/denatured/treated samples
-p or --processors int Number of processors (threads) to use (Default: 1)
-o or --output-dir string Output directory for writing normalized data in XML format (Default: <treated>_vs_<untreated>_norm/ for Ding method or Siegfried method without denatured sample, <treated>_norm/ for Rouskin/Zubradt methods, <treated>_vs_<untreated>_<denatured>_norm/ for Siegfried method with denatured sample)
-ow or --overwrite Overwrites the output directory if already exists
-c or --config-file string Path to a configuration file with normalization parameters (see Configuration files paragraph)
Note #1: If the provided file exists, the loaded configuration will override any command-line specified parameter
Note #2: If the provided file doesn’t exist, it will be generated using the specified command-line (or default) parameters
-sm or --scoring-method int Method for score calculation (1-4, Default: 1):
1. Ding et al., 2014
2. Rouskin et al., 2014
3. Siegfried et al., 2014
4. Zubradt et al., 2016
-nm or --norm-method int Method for signal normalization (1-3, Default: 1):
1. 2-8% Normalization
2. 90% Winsorizing
3. Box-plot Normalization
-r or --raw Reports raw reactivities (skips data normalization)
-rm or --remap-reactivities Remaps normalized reactivities to values ranging from 0 to 1 according to Zarringhalam et al., 2012
-rb or --reactive-bases string Reactive bases to consider for signal normalization (Default: N [ACGT])
Note: This parameter accepts any IUPAC code, or their combination (e.g. -rb M, or -rb AC). Any other base will be reported as NaN
-ni or --norm-independent Each one of the reactive bases will be normalized independently (e.g. -rb AC -ni will independently normalize A and C residues)
-dw or --dynamic-window When enabled, the normalization window is dynamically resized to include at least that number of reactive bases (e.g. -rb AC -nw 50 -dw instructs RF Norm to normalize reactivities in windows containing at least 50 A/C residues)
-nf or --norm-factor float[,float] When provided, this will be used as the normalization factor for all transcripts (default behavior is to calculate the normalization factor independently for each transcript)
Note: 90% Winsorizing requires 2 normalization factors, provided as a comma-separated list, respectively corresponding to the 5th and 95th percentiles of the distribution of raw reactivities
-mc or --mean-coverage float Discards any transcript with mean coverage below this threshold (≥0, Default: 0)
-ec or --median-coverage float Discards any transcript with median coverage below this threshold (≥0, Default: 0)
-nw or --norm-window int Window size (in nt) for signal normalization (≥3, Default: whole transcript [Ding; Siegfried], 50 [Rouskin; Zubradt])
Note: a maximum window size of 30,000 nt is allowed when -dw (or --dynamic-window) is enabled
-wo or --window-offset int Offset (in nt) for window sliding during normalization (Default: none [Ding; Siegfried], 50 [Rouskin; Zubradt])
-D or --decimals int Number of decimals for reporting reactivities (1-10, Default: 3)
-n or --nan int Positions of transcript with read coverage behind this threshold, will be reported as NaN in the reactivity profile (>0, Default: 10)
Scoring method #1 options (Ding et al., 2014)
-pc or --pseudocount float Pseudocount added to reactivities to avoid division by 0 (>0, Default: 1)
-s or --max-score float Score threshold for capping raw reactivities (>0, Default: 10)
Scoring method #3 options (Siegfried et al., 2014)
-mu or --max-untreated-mut float Maximum per-base mutation rate in untreated sample (≤1, Default: 0.05 [5%])
Scoring methods #1 and #3 options (Ding et al., 2014 & Siegfried et al., 2014)
-il or --ignore-lower-than-untreated Bases having raw reactivity in the treated sample lower than the untreated control, will be ignored (not used during reactivity normalization) and will be reported as NaNs
Scoring methods #3 and #4 options (mutational profiling)
-mm or --max-mutation-rate float Maximum per-base mutation rate (≤1, Default: 1 [100%])


Configuration files

RF Norm configuration files are used to provide normalization parameters for the analysis, without the need to manually specify them from the command-line.
Configuration files are composed of a list of key/value pairs, separated by the equal sign (=), or by the colon punctuation mark (:). Keys and values are case-insensitive.
Accepted key/value pairs are:

Parameters Accepted values Default value
scoreMethod "Ding" (or 1); "Rouskin" (or 2); "Siegfried" (or 3); "Zubradt" (or 4) Ding
normMethod "2-8%" (or 1); "90% Winsorizing" (or 2); "Box-plot" (or 3) 2-8%
reactiveBases [ACGTURYSWKMBDHVN] (or "all") all
normIndependent TRUE/FALSE; Yes/No; 1/0 FALSE
normWindow Positive integer ≥ 3 1e9 [Ding; Siegfried]
50 [Rouskin; Zubradt]
windowOffset Positive integer > 0 1e9 [Ding; Siegfried]
50 [Rouskin; Zubradt]
meanCoverage Positive float ≥ 0 0
medianCoverage Positive float ≥ 0 0
remapReactivities TRUE/FALSE; Yes/No; 1/0 FALSE
Scoring method #1 options
maxScore Positive float > 0 10
pseudoCount Positive float > 0 1
Scoring method #3 options
maxUntreatedMut 0 ≤ r ≤ 1 0.05
maxMutationRate 0 ≤ r ≤ 1 0.2
# A sample configuration file

scoreMethod=Ding
normMethod=2-8%
maxScore=10
pseudoCount=1
reactiveBases=N
normIndependent=FALSE
normWindow=1e9
windowOffset=1e9
meanCoverage=1


Output XML files

RF Norm produces a XML file for each transcript being analyzed, with the following structure:

<?xml version="1.0" encoding="UTF-8"?>
<data [attributes]>
    <transcript id=”Transcript ID” length=”Transcript length”>
        <sequence>
            Transcript sequence
        </sequence>
        <reactivity>
            Comma-separated list of reactivity values
        </reactivity>
    </transcript>
</data>

The data tag’s attributes allow keeping track of the analysis performed:

Attribute Possible values Description
tool rf-norm The tool that generated this XML file
scoring Ding, Rouskin, Siegfried, or Zubradt Scoring method
norm 2-8%, Winsorizing 90%, or Box-plot Normalization method
reactive [ACGT] Reactive bases
win Positive integer ≥ 3 Normalization window's size (in nt)
offset Positive integer ≥ 1 Offset for normalization window sliding
remap TRUE/FALSE Whether normalized reactivities have been remapped according to Zarringhalam et al., 2012
Scoring method #1 (Ding et al., 2014)
max Positive float > 0 Score threshold for capping raw reactivities
pseudo Positive float > 0 Pseudocount added to avoid division by 0 during reactivity calculation
Scoring method #3 (Siegfried et al., 2014)
maxumut 0 ≤ r ≤ 1 Maximum per-base mutation rate in untreated sample
maxmutrate 0 ≤ r ≤ 1 Maximum per-base mutation rate
Scoring method #4 (Zubradt et al., 2017)
maxmutrate 0 ≤ r ≤ 1 Maximum per-base mutation rate