The RF Norm module takes one (Rouskin and Zubradt methods), two (Ding and Siegfried methods), or three (Siegfried method) RC files generated by the RF Count module, and performs normalization to obtain transcriptome-wide per-base reactivities.
Reactivity scores can be computed using 3 methods:
Scoring of RT-stops/nuclease cuts-based methods
[1] Ding et al., 2014 (PMID: 24270811)
Per-base signal is calculated as the ratio of the natural log (ln) of the raw count of RT-stops/nuclease cuts at a given position of a transcript, to the average of the ln of RT-stops/nuclease cuts along the whole transcript:
where nUi and nTi are respectively the raw read counts in the untreated (or RNase V1) and treated (DMS, CMCT, SHAPE, or Nuclease S1) samples at position i of the transcript, l is the transcript’s length, and p is a pseudocount added to deal with non-covered regions. Ui and Ti are respectively the normalized number of RT-stops/nuclease cuts at position i in the untreated and treated samples.
Score at position i is then calculated as:
[2] Rouskin et al., 2014 (PMID: 24336214)
The untreated sample is not considered. Per-base RT-stops/nuclease cuts are used as a direct measure of the raw signal.
Warning
Normalization of data processed by Rouskin method, can only be performed using the 90% Winsorizing approach.
Scoring of mutational profiling-based methods
[3] Siegfried et al., 2014 (PMID: 25028896)
This method takes into account both an untreated sample, and (optionally) a denatured control
sample.
Per-base raw signal is calculated as:
where nTi, nUi, and nDi are respectively the mutation counts in the treated, untreated, and denatured samples at position i of the transcript, while cTi, cUi, and cDi are respectively the reads covering position i of the transcript in the treated, untreated, and denatured samples.
If no denatured control sample is provided, raw reactivities are simply calculated as:
[4] Zubradt et al., 2016 (PMID: 27819661)
The untreated sample is not considered. Per-base raw signal is calculated as:
where nTi, and cTi are respectively the mutations count and the read coverage at position i of the transcript.
Normalization of raw reactivities
Raw reactivity scores can be normalized using 3 different approaches:
Method | Description |
---|---|
2-8% Normalization | From the top 10% of values, the top 2% is ignored, then any reactivity value along the entire transcript is divided by the average of the remaining 8% |
90% Winsorizing | Each reactivity value above the 95th percentile is set to the 95th percentile and each reactivity value below the 5th percentile is set to the 5th percentile, then the reactivity at each position of the transcript is divided by the value of the 95th percentile |
Box-plot Normalization | Values greater than 1.5x the interquartile range (numerical distance between the 25th and 75th percentiles) above the 75th percentile are removed. After excluding these outliers, the next 10% of remaining reactivities are averaged, and all reactivities (including outliers) are divided by this value. |
Normalized reactivities can be further remapped to values ranging from 0 to 1 according to Zarringhalam et al., 2012 (PMID: 23091593). In this approach, values < 0.25 are linearly mapped to [0-0.35[, values ≥ 0.25 and < 0.3 are linearly mapped to [0.35-0.55[, values ≥ 0.3 and < 0.7 are linearly mapped to [0.55-0.85[, and values ≥ 0.7 are linearly mapped to [0.85-1].
Sliding-window normalization
RF Norm supports data normalization in sliding windows. Windows can be both static (default) or dynamic. When a window size is chosen, data normalization is performed by sliding by the chosen offset, a window of that size. While choice of window's type is irrelevant with SHAPE data, it becomes particularly important when dealing with base-specific reagents. Let consider the example below, in which an RNA has been modified by DMS, which only reacts with A and C residues.
In the above example, use of static windows of size 10 nt results in an erroneous overestimation of base reactivities for certain residues (marked in red). This is caused by the fact that A/C residues are unevenly distributed along the transcript, thus causing certain windows to have far less than 50% of A/C bases (contrary to what it would be expected by chance). Instead, use of dynamic windows of size 10 nt avoids this overestimation, as the window's size is dynamically adjusted to always include 10 A/C residues. The overestimation effect can also be minimized by increasing the size of static windows.
Usage
To list the required parameters, simply type:
$ rf-norm -h
Parameter | Type | Description |
---|---|---|
-u or --untreated | string | Path to the RC file for the non-treated/denatured (or RNase V1) sample (required by Ding/Siegfried scoring methods) |
-t or --treated | string | Path to the RC file for the treated (or Nuclease S1) sample |
-d or --denatured | string | Path to the RC file for the denatured sample (optional for Siegfried scoring method) |
-i or --index | string[,string] | A comma separated (no spaces) list of RCI index files for the provided RC files Note #1: RCI files must be provided in the order 1. Untreated/Denatured, 2. Treated Note #2: If a single RTI file is specified, it will be used for all RC files Note #3: If no RCI index is provided, it will be generated at runtime, and stored in the same folder of the untreated/denatured/treated samples |
-p or --processors | int | Number of processors (threads) to use (Default: 1) |
-o or --output-dir | string | Output directory for writing normalized data in XML format (Default: <treated>_vs_<untreated>_norm/ for Ding method or Siegfried method without denatured sample, <treated>_norm/ for Rouskin/Zubradt methods, <treated>_vs_<untreated>_<denatured>_norm/ for Siegfried method with denatured sample) |
-ow or --overwrite | Overwrites the output directory if already exists | |
-c or --config-file | string | Path to a configuration file with normalization parameters (see Configuration files paragraph) Note #1: If the provided file exists, the loaded configuration will override any command-line specified parameter Note #2: If the provided file doesn’t exist, it will be generated using the specified command-line (or default) parameters |
-sm or --scoring-method | int | Method for score calculation (1-4, Default: 1): 1. Ding et al., 2014 2. Rouskin et al., 2014 3. Siegfried et al., 2014 4. Zubradt et al., 2016 |
-nm or --norm-method | int | Method for signal normalization (1-3, Default: 1): 1. 2-8% Normalization 2. 90% Winsorizing 3. Box-plot Normalization |
-r or --raw | Reports raw reactivities (skips data normalization) | |
-rm or --remap-reactivities | Remaps normalized reactivities to values ranging from 0 to 1 according to Zarringhalam et al., 2012 | |
-rb or --reactive-bases | string | Reactive bases to consider for signal normalization (Default: N [ACGT]) Note: This parameter accepts any IUPAC code, or their combination (e.g. -rb M , or -rb AC ). Any other base will be reported as NaN |
-ni or --norm-independent | Each one of the reactive bases will be normalized independently (e.g. -rb AC -ni will independently normalize A and C residues) | |
-dw or --dynamic-window | When enabled, the normalization window is dynamically resized to include at least that number of reactive bases (e.g. -rb AC -nw 50 -dw instructs RF Norm to normalize reactivities in windows containing at least 50 A/C residues) |
|
-mc or --mean-coverage | float | Discards any transcript with mean coverage below this threshold (≥0, Default: 0) |
-ec or --median-coverage | float | Discards any transcript with median coverage below this threshold (≥0, Default: 0) |
-nw or --norm-window | int | Window size (in nt) for signal normalization (≥3, Default: whole transcript [Ding; Siegfried], 50 [Rouskin; Zubradt]) Note: a maximum window size of 30,000 nt is allowed when -dw (or --dynamic-window ) is enabled |
-wo or --window-offset | int | Offset (in nt) for window sliding during normalization (Default: none [Ding; Siegfried], 50 [Rouskin; Zubradt]) |
-D or --decimals | int | Number of decimals for reporting reactivities (1-10, Default: 3) |
-n or --nan | int | Positions of transcript with read coverage behind this threshold, will be reported as NaN in the reactivity profile (>0, Default: 10) |
Scoring method #1 options (Ding et al., 2014) | ||
-pc or --pseudocount | float | Pseudocount added to reactivities to avoid division by 0 (>0, Default: 1) |
-s or --max-score | float | Score threshold for capping raw reactivities (>0, Default: 10) |
Scoring method #3 options (Siegfried et al., 2014) | ||
-mu or --max-untreated-mut | float | Maximum per-base mutation rate in untreated sample (≤1, Default: 0.05 [5%]) |
-mm or --max-mutation-rate | float | Maximum per-base mutation rate (≤1, Default: 0.2 [20%]) |
Scoring method #4 options (Zubradt et al., 2017) | ||
-mm or --max-mutation-rate | float | Maximum per-base mutation rate (≤1, Default: 0.2 [20%]) |
Configuration files
RF Norm configuration files are used to provide normalization parameters for the analysis, without the need to manually specify them from the command-line.
Configuration files are composed of a list of key/value pairs, separated by the equal sign (=), or by the colon punctuation mark (:). Keys and values are case-insensitive.
Accepted key/value pairs are:
Parameters | Accepted values | Default value |
---|---|---|
scoreMethod | "Ding" (or 1); "Rouskin" (or 2); "Siegfried" (or 3); "Zubradt" (or 4) | Ding |
normMethod | "2-8%" (or 1); "90% Winsorizing" (or 2); "Box-plot" (or 3) | 2-8% |
reactiveBases | [ACGTURYSWKMBDHVN] (or "all") | all |
normIndependent | TRUE/FALSE; Yes/No; 1/0 | FALSE |
normWindow | Positive integer ≥ 3 | 1e9 [Ding; Siegfried] 50 [Rouskin; Zubradt] |
windowOffset | Positive integer > 0 | 1e9 [Ding; Siegfried] 50 [Rouskin; Zubradt] |
meanCoverage | Positive float ≥ 0 | 0 |
medianCoverage | Positive float ≥ 0 | 0 |
remapReactivities | TRUE/FALSE; Yes/No; 1/0 | FALSE |
Scoring method #1 options | ||
maxScore | Positive float > 0 | 10 |
pseudoCount | Positive float > 0 | 1 |
Scoring method #3 options | ||
maxUntreatedMut | 0 ≤ r ≤ 1 | 0.05 |
maxMutationRate | 0 ≤ r ≤ 1 | 0.2 |
# A sample configuration file
scoreMethod=Ding
normMethod=2-8%
maxScore=10
pseudoCount=1
reactiveBases=N
normIndependent=FALSE
normWindow=1e9
windowOffset=1e9
meanCoverage=1
Output XML files
RF Norm produces a XML file for each transcript being analyzed, with the following structure:
<?xml version="1.0" encoding="UTF-8"?>
<data [attributes]>
<transcript id=”Transcript ID” length=”Transcript length”>
<sequence>
Transcript sequence
</sequence>
<reactivity>
Comma-separated list of reactivity values
</reactivity>
</transcript>
</data>
The data tag’s attributes allow keeping track of the analysis performed:
Attribute | Possible values | Description |
---|---|---|
tool | rf-norm | The tool that generated this XML file |
scoring | Ding, Rouskin, Siegfried, or Zubradt | Scoring method |
norm | 2-8%, Winsorizing 90%, or Box-plot | Normalization method |
reactive | [ACGT] | Reactive bases |
win | Positive integer ≥ 3 | Normalization window's size (in nt) |
offset | Positive integer ≥ 1 | Offset for normalization window sliding |
remap | TRUE/FALSE | Whether normalized reactivities have been remapped according to Zarringhalam et al., 2012 |
Scoring method #1 (Ding et al., 2014) | ||
max | Positive float > 0 | Score threshold for capping raw reactivities |
pseudo | Positive float > 0 | Pseudocount added to avoid division by 0 during reactivity calculation |
Scoring method #3 (Siegfried et al., 2014) | ||
maxumut | 0 ≤ r ≤ 1 | Maximum per-base mutation rate in untreated sample |
maxmutrate | 0 ≤ r ≤ 1 | Maximum per-base mutation rate |
Scoring method #4 (Zubradt et al., 2017) | ||
maxmutrate | 0 ≤ r ≤ 1 | Maximum per-base mutation rate |