The RF Count module is the core component of the framework. It can process any number of SAM/BAM files to calculate per-base RT-stops/mutations and read coverage on each transcript.

Information

In future releases, the rf-count module will be superseded by the rf-count-genome module. The functionalities of the current rf-count module connected with the generation of MM files for ensemble deconvolution using DRACO will be provided by a new tool.

Usage

To list the required parameters, simply type:

$ rf-count -h

Parameter	Type	Description
-p or --processors	int	Number of processors (threads) to use (Default: 1)
-wt or --working-threads	int	Number of working threads to use for each instance of SAMTools/Bowtie (Default: 1). Note: RT Counter executes 1 instance of SAMTools for each processor specified by `-p`. At least `-p <processors>` * `-wt <threads>` processors are required.
-a or --fast		Reference sequences are kept in memory instead of being loaded on the fly Note: this can significantly decrease the runtime when processing large sets of transcripts, but increases memory usage
-o or --output-dir	string	Output directory for writing counts in RC (RNA Count) format (Default: rf_count/)
-ow or --overwrite		Overwrites the output directory if already exists
-t or --tmp-dir	string	Path to a directory for temporary files creation (Default: /tmp) Note: If the provided directory does not exist, it will be created
-s or --samtools	string	Path to `samtools` executable (Default: assumes `samtools` is in PATH)
-r or --sorted		In case SAM/BAM files are passed, assumes that they are already sorted lexicographically by transcript ID, and numerically by position
-t5 or --trim-5prime	int[,int]	Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0) Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads) Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files Note #3: This parameter has no effect when `-m` (or `--count-mutations`) is enabled
-fh or --from-header		Instead of providing the number of bases trimmed from 5'-end of reads through the `-t5` (or `--trim-5prime`) parameter, RF Count will try to guess it automatically from the header of the provided SAM/BAM files
-f or --fasta	string	Path to a FASTA file containing the reference transcripts Note: Transcripts in this file must match transcripts in SAM/BAM file headers
-mf or --mask-file	string	Path to a mask file
-ndd or --no-discard-duplicates		Reads marked as PCR/optical duplicates, discarded by default, will be also considered
-pn or --primary-only		Considers only primary alignments (SAM bitwise flag != 256)
-po or --paired-only		When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered
-pp or --properly-paired		When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered
-i or --include-clipped		Include reads that have been soft/hard-clipped at their 5'-end when calculating RT-stops Note: The default behavior is to exclude soft/hard-clipped reads. When this option is active, the RT-stop position is considered to be the position preceding the clipped bases. This option has no effect when `-m` (or `--count-mutations`) is enabled.
-mq or --map-quality	int	Minimum mapping quality to consider a read (Default: 10)
-co or --coverage-only		Only calculates per-base coverage (disables RT-stops/mutations count)
-m or --count-mutations		Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq)
		Mutation count mode options
-orc or --only-raw-counts		Generates a text file reporting raw (unfiltered) mutation counts, broken down by class (single nucleotide mutations, insertions, deletions) Note: the reported counts are affected by the `-nd`, `-na`, `-mq` and `-md` parameters, but not by deletion realignment options
-om or --only-mut	string	Only the specified mutations will be counted Note #1: mutations must be provided in the form [original]2[mutated]. For example, "A2T" (or "A>T", or "A:T") will only count mutation events in which a reference A base has been sequenced as a T. IUPAC codes are also accepted. Multiple mutations must be provided as a comma (or semi-colon) separated list (e.g. A2T;C:N,G>A) Note #2: when specified, this parameter automatically disables insertion and deletion count Note #3: when specified, an extra ouput folder `frequencies/` will be generated, with a text file for each sample, containing the overall base substitution frequencies
-ds or --discard-shorter	int	Discards reads shorter than this length (excluding clipped bases, Default: 1) Note: when set to MEDIAN (case-insensitive), the median read length will be used
-q or --min-quality	int	Minimum quality score value to consider a mutation (Phred+33, requires `-m`, Default: 20)
-es or --eval-surrounding		When considering a mutation/indel, also evaluates the quality of surrounding bases (±1 nt) Note: the quality score threshold set by `-q` (or `--min-quality`) also applies to these bases
-nd or --no-deletions		Ignores deletions
-ni or --no-insertions		Ignores insertions
-na or --no-ambiguous		Ignores ambiguously mapped deletions Note: the default behavior is to re-align them to their right-most valid position (or to their left-most valid position if `-la` has been specified)
-la or --left-align		Re-aligns ambiguously mapped deletions to their left-most valid position
-rd or --right-deletion		Only the right-most base in a deletion is marked as mutated
-ld or --left-deletion		Only the left-most base in a deletion is marked as mutated
-md or --max-deletion-len	int	Ignores deletions longer than this number of nucleotides (Default: 10)
-me or --max-edit-distance	float	Discards reads with editing distance frequency higher than this threshold (0<m≤1, Default: 0.15 [15%])
-eq or --median-quality	int	Median quality score threshold for discarding low-quality reads (Phred+33, Default: 20)
-dc or --discard-consecutive	int	Discards consecutive mutations within this distance from eachothers
-cc or --collapse-consecutive		Collapses consecutive mutations/indels toward the 3'-most one (recommended for SHAPE-MaP experiments)
-mc or --max-collapse-distance	int	Maximum distance between consecutive mutations/indels to allow collapsing (requires `-cc`, ≥0, Default_ 2)
-mv or --max-coverage	int	Downsamples reads to achieve this maximum mean per-base coverage (≥1000, Default: off)
-mm or --mutation-map		Generates a mutation map (MM) file for alternative structure deconvolution with DRACO
-wl or --whitelist	int	Generates a DRACO-compatible whitelist file, containing the IDs of transcripts with median coverage ≥ to the specified value

Coverage calculation

It is important to note that, by default (so when counting RT-stop events), RF Count will consider the contribution of the RT drop-off event to the coverage of the base on which the drop-off has occurred.
Take into account the following example:

Coverage calculation

In this case, 3 sets of reads have been aligned to the reference, resulting in 3 RT-stop sites, respectively corresponding to 2, 4 and 3 RT drop-off events. When executing RF count without specifying either parameters -co or -m, these additional counts will be added to the coverage of the base corresponding to the site on which the RT dropped off (-1 with respect to the read start mapping position).

Deletions re-alignment in mutational profiling-based methods

Mutational profiling (MaP) methods for RNA structure analysis are based on the ability of certain reverse transcriptase enzymes to read-through the sites of SHAPE/DMS modification under specific reaction conditions. Some of them (e.g. SuperScript II) can introduce deletions when encountering a SHAPE/DMS-modified residue. When performing reads mapping, the aligner often reports a single possible alignment of the deletion, although many equally-scoring alignments are possible.
To avoid counting of ambiguously aligned deletions, that can introduce noise in the measured structural signal, RF Count performs a deletion re-alignment step to detect and re-align/discard these ambiguously aligned deletions (see also "Handling of mutations/indels"):

Ambiguous deletions

For more information, please refer to Smola et al., 2015 (PMID: 26426499).

Handling of mutations/indels

By giving a rapid look to the numerous parameters provided by RF Count, it appears immediately clear that different parameter combinations produce very different outcomes. Here follows a brief scheme aimed at illustrating the different behaviors of RF Count with different parameter combinations (dots correspond to sites of assigned mutations):

RF Count MaP handling

Downsampling reads for analysis with DRACO

One of the possibilities offered by RF Count is the generation of mutation map (MM) files via the -mm (or --mutation-map) parameter. These files store, for each mapped read, the index of the mutations with respect to the reference transcripts, and they can be used as input for DRACO.

DRACO (Deconvolution of RNA Alternative COnformations) is a method that allows deconvoluting the individual structural profiles of coexisting RNA structural conformations, from MaP experiments.
When analyzing a transcript, DRACO keeps in memory all the reads mapping to that transcript. At extreme sequencing depths (>500,000X), the memory consumption can become prohibitive, so it might be beneficial to randomly downsample the reads along the transcript, to achieve a lower mean coverage. Furthermore, as the coverage along the transcript might be unevenly distributed because of library preparation and sequencing biases, downsampling will result in a more uniform coverage. Usually, a coverage of 20,000X is sufficient for DRACO to deconvolute the underlying structural heterogeneity.

DRACO downsampling

Downsampling is controlled via the -mv (or --max-coverage). In the above example, reads have been downsampled to reach a final maximum coverage per base of 1,000X. To do this, RF Count estimates the median read length, and calculates the theoretical number of reads of that length, starting at each position of the transcript, needed to achieve that mean coverage per base. However, as a consequence of adapter trimming and soft clipping, read lengths can significantly differ within the same experiment; this results in a less uniform coverage than it would be theoretically expected, as it is possible to observe in the second track (in green).
To partially compensate for this, it is advisable to set the -ds (or --discard-shorter) parameter to the value MEDIAN (case insensitive). In this way, reads shorter than the median read length will be discarded, resulting in a much more uniform coverage, as it is possible to observe in the third track (in blue).

RC (RNA Count) format

RF Count produces a RC (RNA Count) file for each analyzed sample. RC files are binary files storing the transcript’s sequence, per-base RT-stop/mutation counts, per-base read coverage, and total number of mapped reads. These files can be indexed for fast random access.
Each entry in a RC file is structured as follows:

Field	Description	Type
len_transcript_id	Length of the transcript ID (plus 1, including NULL)	uint32_t
transcript_id	Transcript ID (NULL terminated)	char[len_transcript_id]
len_seq	Length of sequence	uint32_t
seq	4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first)	uint8_t[(len_seq+1)/2]
counts	Transcript's per base RT-stops (or mutations)	uint32_t[len_seq]
coverage	Transcript's per base coverage	uint32_t[len_seq]
n_t	Total reads mapping on transcript	unint64_t

RC files EOF stores the number of total mapped reads, and is structured as follows:

Field	Description	Type
n	Total experiment reads (only primary alignments are counted, SAM bitwise flag != 256)	uint64_t
version	RC file version	uint16_t
marker	EOF marker (\x5b\x65\x6f\x66\x72\x63\x5d)	char[7]

The current RC standard's version is 1.
RCI (RC Index) files enable random access to transcript data within RC files.
The RCI index is structured as follows:

Field	Description	Type
len_transcript_id	Length of the transcript ID (plus 1, including NULL)	uint32_t
transcript_id	Transcript ID (NULL terminated)	char[len_transcript_id]
offset	Offset of transcript in the RC file	uint64_t

Information

All values are forced to be in little-endian byte-order.

MM (Mutation Map) format

When invoked with the -mm parameter, RF Count produces a MM (Mutation Map) file for each analyzed sample. These files serve as input for the deconvolution of coexisting RNA conformations with DRACO. MM files are binary files storing, for each aligned read, the start and end mapping position, the position of the mutations (or indels) with respect to the reference, as well as the reference transcript’s ID and sequence. These files can be indexed for fast random access.
Each entry in a MM file is structured as follows:

Field	Description	Type
len_transcript_id	Length of the transcript ID (plus 1, including NULL)	uint32_t
transcript_id	Transcript ID (NULL terminated)	char[len_transcript_id]
len_seq	Length of sequence	uint32_t
seq	4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first)	uint8_t[(len_seq+1)/2]
read_count	Number of reads mapping on the transcript	uint32_t

This is followed by n blocks (where n is equal to read_count), with the following structure:

Field	Description	Type
start	Read's start mapping position	uint32_t
end	Read's end mapping position	uint32_t
mut_count	Number of mutations (or indels)	uint32_t
indices	Indices of the mutations (or indels)	uint32_t[mut_count]

MM files always end with the EOF marker \x5b\x6d\x6d\x65\x6f\x66\x5d. MMI (MM Index) files enable random access to transcript data within MM files.
The MMI has the same structure of the RCI:

Field	Description	Type
len_transcript_id	Length of the transcript ID (plus 1, including NULL)	uint32_t
transcript_id	Transcript ID (NULL terminated)	char[len_transcript_id]
offset	Offset of transcript in the RC file	uint64_t

Information

All values are forced to be in little-endian byte-order.

Mask file

The mask file allows excluding specific transcript regions from being counted. This is particularly useful when performing targeted MaP analyses, in order to mask the primer pairing regions.
The mask file is composed of one or more lines, each one reporting the transcript ID and a comma (or semicolon) separated list of either base ranges (0-based, inclusive), or of nucleotide sequences of the regions that need to be masked:

Transcript_1;AGCGTATTAGCGATGCGATGCGA;25-38;504-551
Transcript_2,331-402,AUAUGGAUCGGACG,984-1008
Transcript_3;GUUACAUUCGA,98-123;47-68

Transcript regions specified in the mask file will have both 0 counts and coverage in the resulting RC file.