Important

It is strongly advised to use STAR for read mapping. Currently, mapping has to be performed by directly calling STAR. In future releases, support for STAR will be added to the rf-index and rf-map modules.

Usage

RF Duplex accepts as input a comma-separated list of two files: a BAM file and a Chimeric.out.junction file. If an aligner other than STAR is being used, the Chimeric.out.junction file can be omitted.

$ rf-duplex file1.bam,file1_Chimeric.out.junction ... filen.bam,filen_Chimeric.out.junction

To list the required parameters, simply type:

$ rf-duplex -h

Parameter	Type	Description
-p or --processors	int	Number of processors to use (Default: 1)
-st or --single-transcript		Enables multithreading optimization for single trancript analysis Note: when active, a single transcript at a time will be analyzed, and multiple processors will be used to speed-up time-comsuming operations such as read clustering and base-pair analysis. This is useful, for example, for the analysis of COMRADES experiments, in which a single target transcript is analyzed at a high sequencing depth
-o or --output-dir	string	Output directory (Default: rf_duplex/)
-ow or --overwrite		Overwrites the output directory if already exists
-f or --fasta	string	Path to a FASTA file containing the reference transcripts Note: Transcripts in this file must match transcripts in SAM/BAM file headers
-wl or --whitelist	string	A whitelist containing transcript IDs (one per each row) to restrict the analysis to (optional)
-mg or --min-gap	int	Minimum gap length (in nt) to consider a chimeric read (>0, Default: 1)
-mh or --min-half	int	Minimum length (in nt) of each half of a chimeric read to consider it (>0, Default: 20)
-xr or --max-reads	int	Maximum number of chimeric reads to analyze for each transcript (>0, Default: 100000) Note: all chimeric reads are first imported, then a random sample is extracted
-c or --constraint		Generates a dot-bracket constraint file containing the inferred base-pairs whose probability exceeds a given threshold (controlled by `-ct` or `--constraint-theshold`), to be used with the `rf-fold` module
-ct or --constraint-threshold	float	Sets the probability threshold to include a base-pair in the constraint (0.5-1, Default: 0.5)
-es or --eval-structures	string	Path to a folder of structure files (in .db or .ct format), for which the number of chimeric reads supporting each base-pair will be calculated Note: structure files must be named after their respective reference transcript
-eo or --eval-only		Only calculates the chimeric read support for a reference structure (requires `-es`), and exits
-dc or --dump-chimaeras		Dumps extracted chimeric reads to file
do or --dump-chimaeras-only		Only dumps chimeric reads to file (requires `-dc`) and exits
		Base-pairs inference options
-nv or --no-use-vienna		A modified Smith-Waterman local alignment algorithm is used instead of the ViennaRNA package to infer base-pairs from chimeric reads
-mr or --min-reads	int	Minimum number of chimeric reads to retain a cluster (>0, Default: 2)
-mpf or --min-paired-frac	float	Minimum fraction of paired bases in a chimeric read to retain it (>0-1, Default: 0.25)
-mp or --min-prob	float	Minimum probability to retain an inferred base-pair (>0-1, Default: 0.01)
-rmc or --require-min-chimera		Instead of being inferred from the cluster's centroid, base-pairs are inferred from every single chimeric read belonging to the cluster, and only base-pairs common to a certain fraction of the reads in the cluster (controlled by `-mcf` or `--min-chimera-frac`), are retained Note: if no base-pair is supported by at least `-mcf` chimeric reads, then the cluster is discarded
-mcf or --min-chimera-frac	float	Sets the minimum fraction of chimeric reads supporting a base-pair (requires -rmc; 0.5-1, Default: 0.5)
		Plotting options
-g or --img		Enables the generation of graphical reports reporting the inferred base-pairing probabilities, and the Shannon Entropy
-cs or --color-scale	string	Allows specifying 4 cut-points for the color scale used to report base-pairing probabilities (Default: 0.05,0.1,0.4,0.7 for white
		Clustering options
-cb or --make-cluster-bam		Generates a BAM file for each transcript being analyzed, with chimeric reads belonging to the same cluster grouped together via the XG tag
-mss or --min-sample-size	int	Minimum number of reads to be considered for each iteration of mini-batch clustering (>0, Default: 1000)
-mo or --min-overlap-frac	float	Minimum fractional overlap between the two halves of two chimeric reads to trigger clustering (>0-1, Default: 0.05)
-mi or --max-iterations	int	Maximum number of iterations for K-means (>0, Default 5)

Understanding the algorithm

In its current implementation, the module can do the following:

Cluster chimeric reads (and generate a clustered BAM file for visualization)
Infer base-pairs and base-pairing probabilities from each cluster
Generate a structure constraint corresponding to the major conformation detected (estimated from base-pairing probabilities)

Analysis performed RF Duplex can be quite computational intensive. To this end, different multithreading optimizations are adopted to ensure reasonable execution times. By default, the module is optimized for the analysis of PARIS/SPLASH datasets; these experiments are usually performed transcriptome-wide, and they are characterized by a relatively low number of reads mapping to several different transcripts. In these cases, multiple transcripts are simultaneously analyzed on multiple threads. However, in the case of COMRADES experiments, a single transcript is mapped with a very high coverage. To deal with these cases, enabling the --single-transcript parameter will force a single transcript at a time to be analyzed, and multiple processors will be used to speed-up time-comsuming operations such as read clustering and base-pair analysis.

When analyzing large COMRADES datasets, clustering can become challenging. To this end, it is possible to limit the analysis to a randomly extracted sample of --max-reads chimeric reads. Clustering is performed by using a modified version of the K-means algorithm, that allows automatically determining the optimal number of clusters. To speed up the process, clustering is performed in mini-batches of --min-sample-size reads. The maximum number of K-means clustering iterations is controlled via the --max-iterations parameter.

Centroid definition

As shown above, for each cluster, a centroid is defined by identifying the point of maximum coverage on both sides of the chimeras making up the cluster, and by enlarging it upstream and downstream by L/2, where L is the median length of the each half of the chimeras in the cluster. A new chimeric read is assigned to a given cluster if each half of the chimera overlap by at least --min-overlap-frac with the corresponding half of the centroid. Clusters supported by < --min-reads chimeric reads are discarded.
The results of clustering can be visualized by enabling the generation of clustered BAM files via the --make-cluster-bam parameter. These BAM files store the cluster each read has been assigned to in their XG tag, and they can be easily visualized with Integrative Genomics Viewer (IGV) (for additional details, please refer to the official Broad Institute's IGV page).

IGV Clusters

Once clusters have been defined, the algorithm proceeds to identifying the possible base-pairs, and it estimates the base-pairing probabilities. By default, base-pairs are identified by taking the cluster centroid, and by co-folding the two halves of the centroid using the ViennaRNA package. Alternatively, if the --no-use-vienna parameter is provided, a modified Smith-Watherman algorithm is used to identify the candidate base-pairs. If < --min-paired-frac of the centroid is base-paired, the cluster is discarded. For each cluster, a probability is calculated as the ratio of the number of reads belonging to the cluster, divided by the total number of reads in the cluster, plus the reads belonging to clusters that are incompatible with the cluster in analysis. This probability is then assigned to the base-pairs inferred from that cluster. It is theoretically possible, under certain extreme conditions, for two clusters to share one or more base-pairs; in such situations, the base-pair is assigned the sum of the probabilities of the two clusters.
As an alternative to estimating base-pairs from the cluster centroid, one might evaluate independently each of the chimeras making up a given cluster by enabling the --require-min-chimera parameter; in this case, only the base-pairs common to at least --min-chimera-frac × the chimeras of the cluster will be retained. Base-pairing probabilities are reported as dot-plot files, that can be readily imported and visualized with IGV.
Furthemore, base-pairing probabilities can be exported as arc-plots in SVG format by enabling the --img parameter.

Graphics

In addition to computing base-pairing probabilities, if the --constraint parameter has been specified, the algorithm generates a constraint in dot-bracket (Vienna) format, containing only the inferred base-pairs whose probability exceedes --constraint-threshold. This constraint, likely representing the base-pairs present in the predominant conformation for the RNA in analysis, can be further used for structure inference using rf-fold.