Important
It is strongly advised to use STAR for read mapping. Currently, mapping has to be performed by directly calling STAR. In future releases, support for STAR will be added to the rf-index
and rf-map
modules.
Usage
RF Duplex accepts as input a comma-separated list of two files: a BAM file and a Chimeric.out.junction file. If an aligner other than STAR is being used, the Chimeric.out.junction file can be omitted.
$ rf-duplex file1.bam,file1_Chimeric.out.junction ... filen.bam,filen_Chimeric.out.junction
To list the required parameters, simply type:
$ rf-duplex -h
Parameter | Type | Description |
---|---|---|
-p or --processors | int | Number of processors to use (Default: 1) |
-st or --single-transcript | Enables multithreading optimization for single trancript analysis Note: when active, a single transcript at a time will be analyzed, and multiple processors will be used to speed-up time-comsuming operations such as read clustering and base-pair analysis. This is useful, for example, for the analysis of COMRADES experiments, in which a single target transcript is analyzed at a high sequencing depth |
|
-o or --output-dir | string | Output directory (Default: rf_duplex/) |
-ow or --overwrite | Overwrites the output directory if already exists | |
-f or --fasta | string | Path to a FASTA file containing the reference transcripts Note: Transcripts in this file must match transcripts in SAM/BAM file headers |
-wl or --whitelist | string | A whitelist containing transcript IDs (one per each row) to restrict the analysis to (optional) |
-mg or --min-gap | int | Minimum gap length (in nt) to consider a chimeric read (>0, Default: 1) |
-mh or --min-half | int | Minimum length (in nt) of each half of a chimeric read to consider it (>0, Default: 20) |
-xr or --max-reads | int | Maximum number of chimeric reads to analyze for each transcript (>0, Default: 100000) Note: all chimeric reads are first imported, then a random sample is extracted |
-c or --constraint | Generates a dot-bracket constraint file containing the inferred base-pairs whose probability exceeds a given threshold (controlled by -ct or --constraint-theshold ), to be used with the rf-fold module |
|
-ct or --constraint-threshold | float | Sets the probability threshold to include a base-pair in the constraint (0.5-1, Default: 0.5) |
-es or --eval-structures | string | Path to a folder of structure files (in .db or .ct format), for which the number of chimeric reads supporting each base-pair will be calculated Note: structure files must be named after their respective reference transcript |
-eo or --eval-only | Only calculates the chimeric read support for a reference structure (requires -es ), and exits |
|
Base-pairs inference options | ||
-nv or --no-use-vienna | A modified Smith-Waterman local alignment algorithm is used instead of the ViennaRNA package to infer base-pairs from chimeric reads | |
-mr or --min-reads | int | Minimum number of chimeric reads to retain a cluster (>0, Default: 2) |
-mpf or --min-paired-frac | float | Minimum fraction of paired bases in a chimeric read to retain it (>0-1, Default: 0.25) |
-mp or --min-prob | float | Minimum probability to retain an inferred base-pair (>0-1, Default: 0.01) |
-rmc or --require-min-chimera | Instead of being inferred from the cluster's centroid, base-pairs are inferred from every single chimeric read belonging to the cluster, and only base-pairs common to a certain fraction of the reads in the cluster (controlled by -mcf or --min-chimera-frac ), are retainedNote: if no base-pair is supported by at least -mcf chimeric reads, then the cluster is discarded |
|
-mcf or --min-chimera-frac | float | Sets the minimum fraction of chimeric reads supporting a base-pair (requires -rmc; 0.5-1, Default: 0.5) |
Plotting options | ||
-g or --img | Enables the generation of graphical reports reporting the inferred base-pairing probabilities, and the Shannon Entropy | |
-cs or --color-scale | string | Allows specifying 4 cut-points for the color scale used to report base-pairing probabilities (Default: 0.05,0.1,0.4,0.7 for white |
Clustering options | ||
-cb or --make-cluster-bam | Generates a BAM file for each transcript being analyzed, with chimeric reads belonging to the same cluster grouped together via the XG tag | |
-mss or --min-sample-size | int | Minimum number of reads to be considered for each iteration of mini-batch clustering (>0, Default: 1000) |
-mo or --min-overlap-frac | float | Minimum fractional overlap between the two halves of two chimeric reads to trigger clustering (>0-1, Default: 0.05) |
-mi or --max-iterations | int | Maximum number of iterations for K-means (>0, Default 5) |
Understanding the algorithm
In its current implementation, the module can do the following:
- Cluster chimeric reads (and generate a clustered BAM file for visualization)
- Infer base-pairs and base-pairing probabilities from each cluster
- Generate a structure constraint corresponding to the major conformation detected (estimated from base-pairing probabilities)
Analysis performed RF Duplex can be quite computational intensive. To this end, different multithreading optimizations are adopted to ensure reasonable execution times. By default, the module is optimized for the analysis of PARIS/SPLASH datasets; these experiments are usually performed transcriptome-wide, and they are characterized by a relatively low number of reads mapping to several different transcripts. In these cases, multiple transcripts are simultaneously analyzed on multiple threads. However, in the case of COMRADES experiments, a single transcript is mapped with a very high coverage. To deal with these cases, enabling the --single-transcript
parameter will force a single transcript at a time to be analyzed, and multiple processors will be used to speed-up time-comsuming operations such as read clustering and base-pair analysis.
When analyzing large COMRADES datasets, clustering can become challenging. To this end, it is possible to limit the analysis to a randomly extracted sample of --max-reads
chimeric reads. Clustering is performed by using a modified version of the K-means algorithm, that allows automatically determining the optimal number of clusters. To speed up the process, clustering is performed in mini-batches of --min-sample-size
reads. The maximum number of K-means clustering iterations is controlled via the --max-iterations
parameter.
As shown above, for each cluster, a centroid is defined by identifying the point of maximum coverage on both sides of the chimeras making up the cluster, and by enlarging it upstream and downstream by L/2, where L is the median length of the each half of the chimeras in the cluster. A new chimeric read is assigned to a given cluster if each half of the chimera overlap by at least --min-overlap-frac
with the corresponding half of the centroid. Clusters supported by < --min-reads
chimeric reads are discarded.
The results of clustering can be visualized by enabling the generation of clustered BAM files via the --make-cluster-bam
parameter. These BAM files store the cluster each read has been assigned to in their XG tag, and they can be easily visualized with Integrative Genomics Viewer (IGV) (for additional details, please refer to the official Broad Institute's IGV page).
Once clusters have been defined, the algorithm proceeds to identifying the possible base-pairs, and it estimates the base-pairing probabilities. By default, base-pairs are identified by taking the cluster centroid, and by co-folding the two halves of the centroid using the ViennaRNA package. Alternatively, if the --no-use-vienna
parameter is provided, a modified Smith-Watherman algorithm is used to identify the candidate base-pairs. If < --min-paired-frac
of the centroid is base-paired, the cluster is discarded. For each cluster, a probability is calculated as the ratio of the number of reads belonging to the cluster, divided by the total number of reads in the cluster, plus the reads belonging to clusters that are incompatible with the cluster in analysis. This probability is then assigned to the base-pairs inferred from that cluster. It is theoretically possible, under certain extreme conditions, for two clusters to share one or more base-pairs; in such situations, the base-pair is assigned the sum of the probabilities of the two clusters.
As an alternative to estimating base-pairs from the cluster centroid, one might evaluate independently each of the chimeras making up a given cluster by enabling the --require-min-chimera
parameter; in this case, only the base-pairs common to at least --min-chimera-frac
× the chimeras of the cluster will be retained. Base-pairing probabilities are reported as dot-plot files, that can be readily imported and visualized with IGV.
Furthemore, base-pairing probabilities can be exported as arc-plots in SVG format by enabling the --img
parameter.
In addition to computing base-pairing probabilities, if the --constraint
parameter has been specified, the algorithm generates a constraint in dot-bracket (Vienna) format, containing only the inferred base-pairs whose probability exceedes --constraint-threshold
. This constraint, likely representing the base-pairs present in the predominant conformation for the RNA in analysis, can be further used for structure inference.
Information
Starting with the next release, it will be possible to directly import constraint files into rf-fold
, hence enabling the integration of structure probing and direct RNA-RNA interaction capture data.