The RF Count module is the core component of the framework. It can process any number of SAM/BAM files to calculate per-base RT-stops/mutations and read coverage on each transcript.
Usage
$ rf-count [options] sample1.bam sample2.sam .. sampleN.bam
To list all available parameters, simply type:
$ rf-count -h
| Parameter | Type | Description |
|---|---|---|
| -p or --processors | int | Number of files to process in parallel (Default: 1) Note: please check "Multithread support" below for additional details |
| -wt or --working-threads | int | Number of working threads to use for each file (Default: 1). Note: please check "Multithread support" below for additional details |
| -P or --per-file-progress | The progress of each individual file is shown as a separate progress bar Note: this only works in interactive mode. If output is redirected to file, a single progress bar is shown reporting the overall status. Similarly, if the number of samples exceedes the number of lines in the terminal, a single progress bar is shown. |
|
| -a or --fast | Reference sequences are kept in memory instead of being loaded on the fly Note: this can significantly decrease the runtime when processing large sets of transcripts, but increases memory usage |
|
| -o or --output-dir | string | Output directory for writing counts in RC (RNA Count) format (Default: rf_count/) |
| -ow or --overwrite | Overwrites the output directory if already exists | |
| -s or --samtools | string | Path to samtools executable (Default: assumes samtools is in PATH) |
| -sm or --samtools-memory | string | Maximum memory (per thread) to be used by samtools sort (Default: 500M) Note: at least -p <files> × -wt <threads> × -sm <memory> RAM (+swap) is needed |
| -g or --img | Enables the generation of statistic plots of per-base % mutations/RT-stops and, for MaP experiments only, of % mutated reads (requires R) | |
| -R or --R-path | string | Path to R executable (Default: assumes R is in path) Note: also check $RF_RPATH under Environment variables |
| -t5 or --trim-5prime | int[,int] | Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0) Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads) Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files Note #3: This parameter has no effect when -m (or --count-mutations) is enabled |
| -f or --fasta | string | Path to a FASTA file containing the reference transcripts Note: Transcripts in this file must match transcripts in SAM/BAM file headers |
| -mf or --mask-file | string | Path to a mask file |
| -ndd or --no-discard-duplicates | Reads marked as PCR/optical duplicates, discarded by default, will be also considered | |
| -pn or --primary-only | Considers only primary alignments (SAM bitwise flag != 0x100) | |
| -po or --paired-only | When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered (SAM flag = 0x01) | |
| -pp or --properly-paired | When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered (SAM flag = 0x02) | |
| -ic or --include-clipped-bases | In RT-stop mode, the default behavior is to exclude soft/hard-clipped reads. Note #1: When this option is active, the RT-stop position is considered to be the position preceding the clipped bases. Note #2: In mutation count mode, these bases will be also included in the reads in MM files |
|
| -mp or --max-clipped-bases | int or string | Maximum allowed number of clipped bases from either end of the read (requires -ic) (≥0, Default: no limit)Note #1: Different thresholds can be applied to the 5' and 3' end of reads. If a single value is provided, this is treated as the maximum number of clipped bases per read (5′ + 3′). When the value is preceded by the prefix 5p or 3p, the filter will only be applied to the 5′ or 3′ end respectively. Two values for the two ends can be provided simultaneously (e.g., 5p4,3p3 will allow up to 4 and 3 clipped bases from the 5′ and 3′ end respectively). Note #2: If the number of clipped bases exceeds this value, clipped bases will not be counted in the final coverage (and will not be included in the reads in MM files). In RT-stop mode, reads having more than the specified number of 5′ clipped bases will be discarded. |
| -dp or --discard-clipped-reads | Reads (or fragments in case of paired-end experiments) having a number of bases clipped from their 5′ or 3′ end > --max-clipped-bases are discarded |
|
| -mq or --map-quality | int | Minimum mapping quality to consider a read (Default: 0) |
| -co or --coverage-only | Only calculates per-base coverage (disables RT-stops/mutations count) | |
| -m or --count-mutations | Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq) | |
| Mutation count mode options | ||
| -sbn or --sort-by-read-name | If processing large paired-end experiments, reads will be pre-sorted by read name, so that paired-end reads will be processed consecutively, significantly reducing the memory footprint (but to the cost of increased disk usage) | |
| -pam or --paired-end-all-mutations | When processing paired-end reads, if they overlap and mutations are only present in one of the two mates, these will be retained Note: the default behavior is to discard mutations in overlapping regions that are not supported by both mates |
|
| -fsr or --force-single-read | When processing paired-end reads, the two mates will be treated as separate reads Note: if the two mates overlap, this will cause both coverage and mutations within the overlapping portion to be counted twice. Furthermore, the two mates will be reported separately in the MM file, rather than as one long read |
|
| -orc or --out-raw-counts | Generates a text file reporting raw (unfiltered) mutation counts, broken down by class (single nucleotide mutations, insertions, deletions) Note: the reported counts are affected by the -nd, -na, -mq and -md parameters, but not by deletion realignment options |
|
| -om or --only-mut | string | Only the specified mutations will be counted Note #1: mutations must be provided in the form [original]2[mutated]. For example, "A2T" (or "A>T", or "A:T") will only count mutation events in which a reference A base has been sequenced as a T. IUPAC codes are also accepted. Multiple mutations must be provided as a comma (or semi-colon) separated list (e.g. A2T;C:N,G>A) Note #2: when specified, this parameter automatically disables insertion and deletion count Note #3: when specified, an extra ouput folder frequencies/ will be generated, with a text file for each sample, containing the overall base substitution frequencies |
| -ds or --discard-shorter | int | Discards reads spanning less than this number of bases, excluding clipped bases (unless -ic is specified) (Default: 1)Note: when set to MEDIAN (case-insensitive), the median read length will be used |
| -q or --min-quality | int | Minimum quality score value to consider a mutation (Phred+33, requires -m, Default: 20) |
| -ncl or --no-cov-low-qual | If a mutated base (or one of the surrounding bases, if -es is specified) does not exceed the -mq minimum quality threshold, that base will be considered as non covered |
|
| -es or --eval-surrounding | When considering a mutation/indel, also evaluates the quality of surrounding bases (±1 nt) Note: the quality score threshold set by -q (or --min-quality) also applies to these bases |
|
| -nd or --no-deletions | Ignores deletions | |
| -ni or --no-insertions | Ignores insertions | |
| -na or --no-ambiguous | Ignores ambiguously mapped deletions Note: the default behavior is to re-align them to their right-most valid position (or to their left-most valid position if -la has been specified) |
|
| -la or --left-align | Re-aligns ambiguously mapped deletions to their left-most valid position | |
| -rd or --right-deletion | Only the right-most base in a deletion is marked as mutated | |
| -ld or --left-deletion | Only the left-most base in a deletion is marked as mutated | |
| -md or --max-deletion-len | int | Ignores deletions longer than this number of nucleotides (Default: 10) |
| -me or --max-edit-distance | float | Discards reads with editing distance frequency higher than this threshold (0<m≤1, Default: 0.15 [15%]) |
| -eq or --median-quality | int | Median quality score threshold for discarding low-quality reads (Phred+33, Default: 20) |
| -dc or --discard-consecutive | int | Discards consecutive mutations within this distance from eachothers |
| -cc or --collapse-consecutive | Collapses consecutive mutations/indels toward the 3'-most one (recommended for SHAPE-MaP experiments) | |
| -mc or --max-collapse-distance | int | Maximum distance between consecutive mutations/indels to allow collapsing (requires -cc, ≥0, Default_ 2) |
| -mv or --max-coverage | int | Downsamples reads to achieve this maximum mean per-base coverage (≥1000, Default: off) |
| -mm or --mutation-map | Generates a mutation map (MM) file for alternative structure deconvolution with DRACO | |
| -msp or --mm-split-paired-end | int | In case of non-overlapping paired-end reads, the two mates and the respective mutatations will be reported as separate reads in the MM file if the distance between the end of R1 and the start of R2 exceeds this value (≥ 0, Default: 0 [no limit]) Note: The default behavior is to report them as a single long read spanning from the start of R1 to the end of R2 |
| -mdp or --mm-discard-pair | int | In case of non-overlapping paired-end reads, the two mates will not be reported in the MM file if the distance between the end of R1 and the start of R2 exceeds this value (≥ 1, Default: no limit) |
| --kp or --keep-pair | string | When -mdp is specified, and a pair would normally be discarded from the MM file, this parameter enables forcefully retaining one of the two mates (5p for the 5′-most mate, 3p for the 3′-most mate) |
| -wl or --whitelist | int | Generates a DRACO-compatible whitelist file, containing the IDs of transcripts with median coverage ≥ to the specified value |
Coverage calculation
It is important to note that, by default (so when counting RT-stop events), RF Count will consider the contribution of the RT drop-off event to the coverage of the base that triggered it.
Consider the following example:
In this case, 3 sets of reads have been aligned to the reference, resulting in 3 RT-stop sites, respectively corresponding to 2, 4 and 3 RT drop-off events. When executing RF count without specifying either parameters -co or -m, these additional counts will be added to the coverage of the base corresponding to the site on which the RT dropped off (-1 with respect to the read start mapping position).
Deletions re-alignment in mutational profiling-based methods
Mutational profiling (MaP) methods for RNA structure analysis are based on the ability of certain reverse transcriptase enzymes to read-through the sites of SHAPE/DMS modification under specific reaction conditions. Some of them (e.g. SuperScript II) can introduce deletions when encountering a SHAPE/DMS-modified residue. When performing reads mapping, the aligner often reports a single possible alignment of the deletion, although many equally-scoring alignments are possible.
To avoid counting of ambiguously aligned deletions, that can introduce noise in the measured structural signal, RF Count performs a deletion re-alignment step to detect and re-align/discard these ambiguously aligned deletions (see also "Handling of mutations/indels"):
For more information, please refer to Smola et al., 2015 (PMID: 26426499).
Handling of mutations/indels
By giving a rapid look to the numerous parameters provided by RF Count, it appears immediately clear that different parameter combinations produce very different outcomes.
Here follows a brief scheme aimed at illustrating the different behaviors of RF Count with different parameter combinations (dots correspond to sites of assigned mutations):
Handling of paired-end reads
Since version 2.9.5, RF Count can handle paired-end reads, without the need for merging pre-mapping. Below is a simple scheme showing the different behaviors of RF Count with different parameters in case of paired-end reads:
Particularly, different behaviors are possible when the two mates in a pair overlap. By default, within the overlap region, only mutaations common to both mates are retained. If the -pam parameter is specified, then also the mutations present in just one of the two mates are retained.
If the -fsr parameter is enabled, the two mates are treated as independent single reads, therefore mutations common to both of them will be counted twice. In such cases, also the coverage of the overlapping portion will be increased by 2×. This will also cause the two mates to be reported separately in the MM file.
The -msp parameter, instead, controls how paired-end reads should be reported in the MM file, when they do no share an overlapping portion. By default, non-overlapping paired-end reads are reported as a single long read, spanning from the start of R1 to the end of R2. When a value is specified via the -msp parameter, then the two mates are reported as single long read, if the distance between the end of R1 and start of R2 does not exceed the specified value, or separately otherwise. Please note that this will have no effect on overlapping paired-end reads.
Downsampling reads for analysis with DRACO
One of the possibilities offered by RF Count is the generation of mutation map (MM) files via the -mm (or --mutation-map) parameter. These files store, for each mapped read, the index of the mutations with respect to the reference transcripts, and they can be used as input for DRACO.
DRACO (Deconvolution of RNA Alternative COnformations) is a method that allows deconvoluting the individual structural profiles of coexisting RNA structural conformations, from MaP experiments.
When analyzing a transcript, DRACO keeps in memory all the reads mapping to that transcript. At extreme sequencing depths (>500,000X), the memory consumption can become prohibitive, so it might be beneficial to randomly downsample the reads along the transcript, to achieve a lower mean coverage. Furthermore, as the coverage along the transcript might be unevenly distributed because of library preparation (e.g., PCR jackpotting) and sequencing biases, downsampling will result in a more uniform coverage. Usually, a coverage of 20,000X is sufficient for DRACO to efficiently deconvove the underlying structural heterogeneity.
Downsampling is controlled via the -mv (or --max-coverage). In the above example, reads have been downsampled to reach a final maximum coverage per base of 1,000X. To do this, RF Count estimates the median read length, and calculates the theoretical number of reads of that length, starting at each position of the transcript, needed to achieve that mean coverage per base. However, as a consequence of adapter trimming and soft clipping, read lengths can significantly differ within the same experiment; this results in a less uniform coverage than it would be theoretically expected, as it is possible to observe in the second track (in green).
To partially compensate for this, it is advisable to set the -ds (or --discard-shorter) parameter to the value MEDIAN (case insensitive). In this way, reads shorter than the median read length will be discarded, resulting in a much more uniform coverage, as it is possible to observe in the third track (in blue).
Sample labeling
By default, RF Count uses the input file names (stripped of their extension) as labels in the output files.
It is however possible to specify custom labels by prepending them to the input files, in the form label::
$ rf-count [options] Sample_1:file1.bam Sample_2:file2.bam .. Sample_N:fileN.bam
RC (RNA Count) format
RF Count produces a RC (RNA Count) file for each analyzed sample. RC files are binary files storing the transcript’s sequence, per-base RT-stop/mutation counts, per-base read coverage, and total number of mapped reads. These files can be indexed for fast random access.
Each entry in a RC file is structured as follows:
| Field | Description | Type |
|---|---|---|
| len_transcript_id | Length of the transcript ID (plus 1, including NULL) | uint32_t |
| transcript_id | Transcript ID (NULL terminated) | char[len_transcript_id] |
| len_seq | Length of sequence | uint32_t |
| seq | 4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first) | uint8_t[(len_seq+1)/2] |
| counts | Transcript's per base RT-stops (or mutations) | uint32_t[len_seq] |
| coverage | Transcript's per base coverage | uint32_t[len_seq] |
| nt | Total reads mapping on transcript | unint64_t |
RC files EOF stores the number of total mapped reads, and is structured as follows:
| Field | Description | Type |
|---|---|---|
| n | Total experiment reads (only primary alignments are counted, SAM bitwise flag != 256) |
uint64_t |
| version | RC file version | uint16_t |
| marker | EOF marker (\x5b\x65\x6f\x66\x72\x63\x5d) | char[7] |
The current RC standard's version is 1.
RCI (RC Index) files enable random access to transcript data within RC files.
The RCI index is structured as follows:
| Field | Description | Type |
|---|---|---|
| len_transcript_id | Length of the transcript ID (plus 1, including NULL) | uint32_t |
| transcript_id | Transcript ID (NULL terminated) | char[len_transcript_id] |
| offset | Offset of transcript in the RC file | uint64_t |
Information
All values are forced to be in little-endian byte-order.
MM (Mutation Map) format
When invoked with the -mm parameter, RF Count produces a MM (Mutation Map) file for each analyzed sample. These files serve as input for the deconvolution of coexisting RNA conformations with DRACO. MM files are binary files storing, for each aligned read, the start and end mapping position, the position of the mutations (or indels) with respect to the reference, as well as the reference transcript’s ID and sequence. These files can be indexed for fast random access.
Each entry in a MM file is structured as follows:
| Field | Description | Type |
|---|---|---|
| len_transcript_id | Length of the transcript ID (plus 1, including NULL) | uint32_t |
| transcript_id | Transcript ID (NULL terminated) | char[len_transcript_id] |
| len_seq | Length of sequence | uint32_t |
| seq | 4-bit encoded sequence: 'ACGTN' -> [0,4] (High nybble first) | uint8_t[(len_seq+1)/2] |
| read_count | Number of reads mapping on the transcript | uint32_t |
This is followed by n blocks (where n is equal to read_count), with the following structure:
| Field | Description | Type |
|---|---|---|
| start | Read's start mapping position | uint32_t |
| end | Read's end mapping position | uint32_t |
| mut_count | Number of mutations (or indels) | uint32_t |
| indices | Indices of the mutations (or indels) | uint32_t[mut_count] |
MM files always end with the EOF marker \x5b\x6d\x6d\x65\x6f\x66\x5d.
MMI (MM Index) files enable random access to transcript data within MM files.
The MMI has the same structure of the RCI:
| Field | Description | Type |
|---|---|---|
| len_transcript_id | Length of the transcript ID (plus 1, including NULL) | uint32_t |
| transcript_id | Transcript ID (NULL terminated) | char[len_transcript_id] |
| offset | Offset of transcript in the RC file | uint64_t |
Information
All values are forced to be in little-endian byte-order.
Mask file
The mask file allows excluding specific transcript regions from being counted. This is particularly useful when performing targeted MaP analyses, in order to mask the primer pairing regions.
The mask file is composed of one or more lines, each one reporting the transcript ID and a comma (or semicolon) separated list of either base ranges (0-based, inclusive), or of nucleotide sequences of the regions that need to be masked:
Transcript_1;AGCGTATTAGCGATGCGATGCGA;25-38;504-551
Transcript_2,331-402,AUAUGGAUCGGACG,984-1008
Transcript_3;rc:GUUACAUUCGA,98-123;47-68
When rc: is prepended to a sequence, it specifies that the sequence has to be reverse complemented. This is useful, for example, to mask the pairing region of a reverse primer.
Transcript regions specified in the mask file will have both 0 counts and coverage in the resulting RC file.
Multithread support
Since version 2.8.9, RF Count has multithread support for faster processing. The way this is implemented is on a per-transcript base, meaning that each process will analyze one transcript. Therefore, when processing large BAM files encompassing a single transcript, specifying multiple processors will not speed up the analysis as a single processor will still be used.
Parameters -p and -wt allow controlling the number of processors to be used for the analysis. Their meaning differs at different stages of the analysis:
- During tasks such as SAM to BAM conversion, BAM sorting and BAM indexing,
-pspecifies the number of files to be processed in parallel, and each SAMTools process will use-wtcores. - During the count phase, all files are processed in parallel and the number of concurrent processes will be min(# available cores,
-p×-wt)