RF Eval allows evaluating the agreement between a given (set of) secondary structure(s) and a (set of) XML reactivity files.
Reference structures can be provided either in Vienna format (dot-bracket notation), or in CT format. A single file containing the structure for multiple transcripts can be provided:
# Vienna format
>Transcript#1
AAAAAAAAAAAAAAAAAAAAUUUUUUUUUUUUUUUUUUUUU
.((((((((((((((((((....))))))))))))))))))
>Transcript#2
CCCCCCCCCCCCCCCCCGGGGGGGGGGGGGGGGGGGG
(((((((((((((((((...)))))))))))))))))
>Transcript#3
GCUAGCUAGCUAGCUAGCUAGUCAAGACGAGUCGAUGCU
(((((((((....))))))))).................
Important
The IDs of the provided structures must match the file name of the reactivity XML file (e.g. "Transcript#1" expects an XML file named "Transcript#1.xml")
Metrics
RF Eval computes 3 metrics of agreement between reactivity data and structure. All 3 metrics yield values comprised between 0 and 1, with 0 representing 0% agreement and 1 representing 100% agreement.
[1] Unpaired coefficient
This is the simplest metric and it measures the fraction of highly reactive bases (bases whose reactivity exceeds a user-defined threshold t) that are unpaired in the secondary structure:
where k is the set of bases having reactivity > t, while u and p are respectively the sets of unpaired and paired bases in the structure.
[2] Data-Structure Correlation Index (DSCI)
This metric was originally proposed by Lan et al., 2021 (doi: 10.1101/2020.06.29.178343) and it is closely related to the Mann-Whitney U statistic. The DSCI is defined as the probability that a randomly chosen unpaired base will have greater reactivity than a randomly chosen paired base:
where p is the set of reactivities for all m paired bases, while u is the set of reactivities for all n unpaired bases.
[3] Area Under the Receiver Operating Characteristic Curve (AUROC)
This metric is typically employed to assess the performance of a binary classifier model at varying threshold values.
Briefly, the reactivity threshold t is slowly increased from 0 to 1, in 0.005 increments. At each threshold, the True Positive Rate (TPR) is calculated as:
where TP is the number of unpaired bases whose reactivity ≥ t, and P is the total number of unpaired bases in the structure.
The True Negative Rate (TNR) is instead calculated as:
where TN is the number of paired bases whose reactivity ≥ t, and N is the total number of paired bases in the structure.
The AUROC is then defined as the area underlying the curve described by the set of FPR-TPR value pairs at each value of t.
Usage
To list the required parameters, simply type:
$ rf-eval -h
Parameter | Type | Description |
---|---|---|
-s or --structures | string | Path to a (folder of) structure file(s) |
-r or --reactivities | string | Path to a (folder of) XML reactivity file(s) |
-o or --output | string | Output file with metrics per transcript (Default: rf_eval.txt) |
-ow or --overwrite | Overwrites output file (if the specified file already exists) | |
-p or --processors | int | Number of processors to use (≥1, Default: 1) |
-tu or --terminal-as-unpaired | Treats terminal base-pairs as if they were unpaired Note: this parameter and -it are mutually exclusive |
|
-it or --ignore_terminal | Terminal base-pairs are excluded from calculations Note: this parameter and -tu are mutually exclusive |
|
-kl or --keep-lonelypairs | Lonely base-pairs (helices of 1 bp) are retained | |
-kp or --keep-pseudoknots | Pseudoknotted base-pairs are retained | |
-c or --reactivity-cutoff | Cutoff for considering a base highly-reactive when computing the unpaired coefficient (>0, Default: 0.7) |