The RF Count Genome module is an extension of the RF Count module, introduced in version 2.8.0 to process genome-level alignments. It can process any number of genome-level SAM/BAM files to calculate per-base RT-stops/mutations and read coverage. For any information on the RC file format, or on the RF Count algorithm, please refer to the manual page of rf-count.

Usage

$ rf-count-genome [options] sample1.bam sample2.sam .. sampleN.bam

To list all available parameters, simply type:

$ rf-count-genome -h
Parameter Type Description
-p or --processors int Number of processors (threads) to use (Default: 1)
Note: please check "Multithread support" below for additional details
-wt or --working-threads int Number of working threads to use for each file (Default: 1)
Note: please check "Multithread support" below for additional details
-P or --per-file-progress The progress of each individual file is shown as a separate progress bar
Note: this only works in interactive mode. If output is redirected to file, a single progress bar is shown reporting the overall status. Similarly, if the number of samples exceedes the number of lines in the terminal, a single progress bar is shown.
-bs or --block-size int Maximum size of the chromosome block to keep in memory (>1000, Default: 100000)
-o or --output-dir string Output directory for writing counts in RC (RNA Count) format (Default: rf_count_genome/)
-ow or --overwrite Overwrites the output directory if already exists
-s or --samtools string Path to samtools executable (Default: assumes samtools is in PATH)
-g or --img Enables the generation of statistic plots of per-base % mutations/RT-stops and, for MaP experiments only, of % mutated reads (requires R)
-R or --R-path string Path to R executable (Default: assumes R is in path)
Note: also check $RF_RPATH under Environment variables
-t5 or --trim-5prime int[,int] Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0)
Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads)
Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files
-f or --fasta string Path to a FASTA file containing the reference transcripts
Note: Transcripts in this file must match transcripts in SAM/BAM file headers
-ndd or --no-discard-duplicates Reads marked as PCR/optical duplicates, discarded by default, will be also considered
-pn or --primary-only Considers only primary alignments (SAM bitwise flag != 256)
-po or --paired-only When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered
-pp or --properly-paired When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered
-mq or --map-quality int Minimum mapping quality to consider a read (Default: 0)
-ls or --library-strandedness string Defines which genomic strand alignment-derived counts must be assigned to (check "Strandedness of genome-level alignments" below for possible values, Default: unstranded (with -m or -co), second-strand otherwise)
Note: strandedness specified via -ls can be overridden for individual samples by appending a colon followed by the library type to the sample name (check "Strandedness of genome-level alignments" below for additional details).
-co or --coverage-only Only calculates per-base coverage (disables RT-stops/mutations count)
-m or --count-mutations Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq)
Mutation count mode options
-om or --only-mut string Only the specified mutations will be counted
Note #1: mutations must be provided in the form [original]2[mutated]. For example, "A2T" (or "A>T", or "A:T") will only count mutation events in which a reference A base has been sequenced as a T. IUPAC codes are also accepted. Multiple mutations must be provided as a comma (or semi-colon) separated list (e.g. A2T;C:N,G>A)
Note #2: when specified, this parameter automatically disables insertion and deletion count
Note #3: when specified, an extra ouput folder frequencies/ will be generated, with a text file for each sample, containing the overall base substitution frequencies
-ds or --discard-shorter int Discards reads spanning less than this number of bases, excluding clipped bases (unless -ic is specified) (Default: 1)
-q or --min-quality int Minimum quality score value to consider a mutation (Phred+33, requires -m, Default: 20)
-es or --eval-surrounding When considering a mutation/indel, also evaluates the quality of surrounding bases (±1 nt)
Note: the quality score threshold set by -q (or --min-quality) also applies to these bases
-nd or --no-deletions Ignores deletions
-ni or --no-insertions Ignores insertions
-na or --no-ambiguous Ignores ambiguously mapped deletions
Note: the default behavior is to re-align them to their right-most valid position (or to their left-most valid position if -la has been specified)
-la or --left-align Re-aligns ambiguously mapped deletions to their left-most valid position
-rd or --right-deletion Only the right-most base in a deletion is marked as mutated
-ld or --left-deletion Only the left-most base in a deletion is marked as mutated
-md or --max-deletion-len int Ignores deletions longer than this number of nucleotides (Default: 10)
-me or --max-edit-distance float Discards reads with editing distance frequency higher than this threshold (0<m≤1, Default: 0.15 [15%])
-eq or --median-quality int Median quality score threshold for discarding low-quality reads (Phred+33, Default: 20)
-dc or --discard-consecutive int Discards consecutive mutations within this distance from eachothers
-cc or --collapse-consecutive Collapses consecutive mutations/indels toward the 3'-most one (recommended for SHAPE-MaP experiments)
-mc or --max-collapse-distance int Maximum distance between consecutive mutations/indels to allow collapsing (requires -cc, ≥0, Default_ 2)


Strandedness of genome-level alignments

An important difference with transcriptome-level analyses is that, at the level of the genome, the structure signal can be originated by either of the two DNA strands, dependening on where the gene resides. The strandedness depends on how the library has been generated. Specifying the proper strandedness of the library is crucial as it determines the way rf-count-genome will assign the RT-stop/mutation counts to the two genomic strands.
The strandedness of each sample can be specified in two ways:

  1. by specifying a single library type for all samples being analyzed via the -ls (or --library-strandedness)
  2. by appending a colon followed by the library type to each individual SAM/BAM file being passed to rf-count-genome (see below)
Value Strandedness
0 or u or unstranded The information on the genomic strand that originated the transcript is not preserved
1 or f or first or first-strand R1 aligns to the strand complementary to the one that originated the transcript
2 or s or second or second-strand R1 aligns to the same strand that originated the transcript


Second-strand is the default (and only accepted) mode for the analysis of RT-stop-based RNA structure mapping experiments. When mutation count (-m) or coverage-only (-co) modes are enabled, if the strandedness of the samples is not specified, samples are assumed to be unstranded.
For instance, in the examples below:

$ rf-count-genome -m -f reference.fasta -r sample1.bam:s sample2.bam:f sample3.bam
$ rf-count-genome -m -f reference.fasta -r -ls second-strand sample1.bam:s sample2.bam:f sample3.bam

sample1.bam is generated using a second-strand directional library prep strategy, whìle sample1.bam is generated using a first-strand directional library prep strategy. In the first example sample3.bam is assumed to have been generated using a non-directional library prep strategy (unstranded), while in the second example sample3.bam is assumed to have been generated using a second-strand directional library prep strategy (-ls second-strand).

When processing unstranded experiments, rf-count-genome will generate a single output RC file, named after the sample, having the .plus.rc suffix. When processing stranded experiments, two output RC files will be generated, named after the sample, respectively having the .plus.rc and .minus.rc suffixes and corresponding to the plus and minus strands of the genome.
Transcriptome-level RC files can be generated starting from genome-level RC files using the extract tool of the rf-rctools module. For additional information, please refer to the manual page.


Multithread support

Since version 2.8.9, RF Count Genome has multithread support for faster processing. The way this is implemented is on a per-chromosome base, meaning that each process will analyze one chromosome. Therefore, when processing large BAM files encompassing a single chromosome, specifying multiple processors will not speed up the analysis as a single processor will still be used.

Parameters -p and -wt allow controlling the number of processors to be used for the analysis. Their meaning differs at different stages of the analysis:

  • During tasks such as SAM to BAM conversion, BAM sorting and BAM indexing, -p specifies the number of files to be processed in parallel, and each SAMTools process will use -wt cores.
  • During the count phase, all files are processed in parallel and the number of concurrent processes will be min(# available cores, -p × -wt)