Important

It is strongly advised to use STAR for read mapping. Currently, mapping has to be performed by directly calling STAR. In future releases, support for STAR will be added to the rf-index and rf-map modules.


Usage

RF Duplex accepts as input a comma-separated list of two files: a BAM file and a Chimeric.out.junction file. If an aligner other than STAR is being used, the Chimeric.out.junction file can be omitted.

$ rf-duplex file1.bam,file1_Chimeric.out.junction ... filen.bam,filen_Chimeric.out.junction

To list the required parameters, simply type:

$ rf-duplex -h
Parameter Type Description
-p or --processors int Number of processors to use (Default: 1)
-st or --single-transcript Enables multithreading optimization for single trancript analysis
Note: when active, a single transcript at a time will be analyzed, and multiple processors will be used to speed-up time-comsuming operations such as read clustering and base-pair analysis. This is useful, for example, for the analysis of COMRADES experiments, in which a single target transcript is analyzed at a high sequencing depth
-o or --output-dir string Output directory (Default: rf_duplex/)
-ow or --overwrite Overwrites the output directory if already exists
-f or --fasta string Path to a FASTA file containing the reference transcripts
Note: Transcripts in this file must match transcripts in SAM/BAM file headers
-wl or --whitelist string A whitelist containing transcript IDs (one per each row) to restrict the analysis to (optional)
-mg or --min-gap int Minimum gap length (in nt) to consider a chimeric read (>0, Default: 1)
-mh or --min-half int Minimum length (in nt) of each half of a chimeric read to consider it (>0, Default: 20)
-xr or --max-reads int Maximum number of chimeric reads to analyze for each transcript (>0, Default: 100000)
Note: all chimeric reads are first imported, then a random sample is extracted
-c or --constraint Generates a dot-bracket constraint file containing the inferred base-pairs whose probability exceeds a given threshold (controlled by -ct or --constraint-theshold), to be used with the rf-fold module
-ct or --constraint-threshold float Sets the probability threshold to include a base-pair in the constraint (0.5-1, Default: 0.5)
-es or --eval-structures string Path to a folder of structure files (in .db or .ct format), for which the number of chimeric reads supporting each base-pair will be calculated
Note: structure files must be named after their respective reference transcript
-eo or --eval-only Only calculates the chimeric read support for a reference structure (requires -es), and exits
-dc or --dump-chimaeras Dumps extracted chimeric reads to file
do or --dump-chimaeras-only Only dumps chimeric reads to file (requires -dc) and exits
Base-pairs inference options
-nv or --no-use-vienna A modified Smith-Waterman local alignment algorithm is used instead of the ViennaRNA package to infer base-pairs from chimeric reads
-mr or --min-reads int Minimum number of chimeric reads to retain a cluster (>0, Default: 2)
-mpf or --min-paired-frac float Minimum fraction of paired bases in a chimeric read to retain it (>0-1, Default: 0.25)
-mp or --min-prob float Minimum probability to retain an inferred base-pair (>0-1, Default: 0.01)
-rmc or --require-min-chimera Instead of being inferred from the cluster's centroid, base-pairs are inferred from every single chimeric read belonging to the cluster, and only base-pairs common to a certain fraction of the reads in the cluster (controlled by -mcf or --min-chimera-frac), are retained
Note: if no base-pair is supported by at least -mcf chimeric reads, then the cluster is discarded
-mcf or --min-chimera-frac float Sets the minimum fraction of chimeric reads supporting a base-pair (requires -rmc; 0.5-1, Default: 0.5)
Plotting options
-g or --img Enables the generation of graphical reports reporting the inferred base-pairing probabilities, and the Shannon Entropy
-cs or --color-scale string Allows specifying 4 cut-points for the color scale used to report base-pairing probabilities (Default: 0.05,0.1,0.4,0.7 for white
Clustering options
-cb or --make-cluster-bam Generates a BAM file for each transcript being analyzed, with chimeric reads belonging to the same cluster grouped together via the XG tag
-mss or --min-sample-size int Minimum number of reads to be considered for each iteration of mini-batch clustering (>0, Default: 1000)
-mo or --min-overlap-frac float Minimum fractional overlap between the two halves of two chimeric reads to trigger clustering (>0-1, Default: 0.05)
-mi or --max-iterations int Maximum number of iterations for K-means (>0, Default 5)


Understanding the algorithm

In its current implementation, the module can do the following:

  1. Cluster chimeric reads (and generate a clustered BAM file for visualization)
  2. Infer base-pairs and base-pairing probabilities from each cluster
  3. Generate a structure constraint corresponding to the major conformation detected (estimated from base-pairing probabilities)

Analysis performed RF Duplex can be quite computational intensive. To this end, different multithreading optimizations are adopted to ensure reasonable execution times. By default, the module is optimized for the analysis of PARIS/SPLASH datasets; these experiments are usually performed transcriptome-wide, and they are characterized by a relatively low number of reads mapping to several different transcripts. In these cases, multiple transcripts are simultaneously analyzed on multiple threads. However, in the case of COMRADES experiments, a single transcript is mapped with a very high coverage. To deal with these cases, enabling the --single-transcript parameter will force a single transcript at a time to be analyzed, and multiple processors will be used to speed-up time-comsuming operations such as read clustering and base-pair analysis.

When analyzing large COMRADES datasets, clustering can become challenging. To this end, it is possible to limit the analysis to a randomly extracted sample of --max-reads chimeric reads. Clustering is performed by using a modified version of the K-means algorithm, that allows automatically determining the optimal number of clusters. To speed up the process, clustering is performed in mini-batches of --min-sample-size reads. The maximum number of K-means clustering iterations is controlled via the --max-iterations parameter.

Centroid definition

As shown above, for each cluster, a centroid is defined by identifying the point of maximum coverage on both sides of the chimeras making up the cluster, and by enlarging it upstream and downstream by L/2, where L is the median length of the each half of the chimeras in the cluster. A new chimeric read is assigned to a given cluster if each half of the chimera overlap by at least --min-overlap-frac with the corresponding half of the centroid. Clusters supported by < --min-reads chimeric reads are discarded.
The results of clustering can be visualized by enabling the generation of clustered BAM files via the --make-cluster-bam parameter. These BAM files store the cluster each read has been assigned to in their XG tag, and they can be easily visualized with Integrative Genomics Viewer (IGV) (for additional details, please refer to the official Broad Institute's IGV page).

IGV Clusters

Once clusters have been defined, the algorithm proceeds to identifying the possible base-pairs, and it estimates the base-pairing probabilities. By default, base-pairs are identified by taking the cluster centroid, and by co-folding the two halves of the centroid using the ViennaRNA package. Alternatively, if the --no-use-vienna parameter is provided, a modified Smith-Watherman algorithm is used to identify the candidate base-pairs. If < --min-paired-frac of the centroid is base-paired, the cluster is discarded. For each cluster, a probability is calculated as the ratio of the number of reads belonging to the cluster, divided by the total number of reads in the cluster, plus the reads belonging to clusters that are incompatible with the cluster in analysis. This probability is then assigned to the base-pairs inferred from that cluster. It is theoretically possible, under certain extreme conditions, for two clusters to share one or more base-pairs; in such situations, the base-pair is assigned the sum of the probabilities of the two clusters.
As an alternative to estimating base-pairs from the cluster centroid, one might evaluate independently each of the chimeras making up a given cluster by enabling the --require-min-chimera parameter; in this case, only the base-pairs common to at least --min-chimera-frac × the chimeras of the cluster will be retained. Base-pairing probabilities are reported as dot-plot files, that can be readily imported and visualized with IGV.
Furthemore, base-pairing probabilities can be exported as arc-plots in SVG format by enabling the --img parameter.

Graphics

In addition to computing base-pairing probabilities, if the --constraint parameter has been specified, the algorithm generates a constraint in dot-bracket (Vienna) format, containing only the inferred base-pairs whose probability exceedes --constraint-threshold. This constraint, likely representing the base-pairs present in the predominant conformation for the RNA in analysis, can be further used for structure inference using rf-fold.