The RF Count Genome module is an extension of the RF Count module, introduced in version 2.8.0 to process genome-level alignments. It can process any number of genome-level SAM/BAM files to calculate per-base RT-stops/mutations and read coverage. For any information on the RC file format, or on the RF Count algorithm, please refer to the manual page of rf-count.

Usage

To list the required parameters, simply type:

$ rf-count-genome -h

Parameter	Type	Description
-p or --processors	int	Number of processors (threads) to use (Default: 1)
-wt or --working-threads	int	Number of working threads to use for each instance of SAMTools/Bowtie (Default: 1). Note: RT Counter executes 1 instance of SAMTools for each processor specified by `-p`. At least `-p <processors>` * `-wt <threads>` processors are required.
-o or --output-dir	string	Output directory for writing counts in RC (RNA Count) format (Default: rf_count_genome/)
-ow or --overwrite		Overwrites the output directory if already exists
-t or --tmp-dir	string	Path to a directory for temporary files creation (Default: /tmp) Note: If the provided directory does not exist, it will be created
-s or --samtools	string	Path to `samtools` executable (Default: assumes `samtools` is in PATH)
-r or --sorted		In case SAM/BAM files are passed, assumes that they are already sorted lexicographically by transcript ID, and numerically by position
-t5 or --trim-5prime	int[,int]	Comma separated list (no spaces) of values indicating the number of bases trimmed from the 5'-end of reads in the respective sample SAM/BAM files (Default: 0) Note #1: Values must be provided in the same order as the input files (e.g. rf-count -t5 0,5 file1.bam file2.bam, will consider 0 bases trimmed from file1 reads, and 5 bases trimmed from file2 reads) Note #2: If a single value is specified along with multiple SAM/BAM files, it will be used for all files
-fh or --from-header		Instead of providing the number of bases trimmed from 5'-end of reads through the `-t5` (or `--trim-5prime`) parameter, RF Count will try to guess it automatically from the header of the provided SAM/BAM files
-f or --fasta	string	Path to a FASTA file containing the reference transcripts Note: Transcripts in this file must match transcripts in SAM/BAM file headers
-mf or --mask-file	string	Path to a mask file
-ndd or --no-discard-duplicates		Reads marked as PCR/optical duplicates, discarded by default, will be also considered
-pn or --primary-only		Considers only primary alignments (SAM bitwise flag != 256)
-po or --paired-only		When processing SAM/BAM files from paired-end experiments, only those reads for which both mates are mapped will be considered
-pp or --properly-paired		When processing SAM/BAM files from paired-end experiments, only those reads mapped in a proper pair will be considered
-mq or --map-quality	int	Minimum mapping quality to consider a read (Default: 0)
-co or --coverage-only		Only calculates per-base coverage (disables RT-stops/mutations count)
-m or --count-mutations		Enables mutations count instead of RT-stops count (for SHAPE-MaP/DMS-MaPseq)
		Mutation count mode options
-om or --only-mut	string	Only the specified mutations will be counted Note #1: mutations must be provided in the form [original]2[mutated]. For example, "A2T" (or "A>T", or "A:T") will only count mutation events in which a reference A base has been sequenced as a T. IUPAC codes are also accepted. Multiple mutations must be provided as a comma (or semi-colon) separated list (e.g. A2T;C:N,G>A) Note #2: when specified, this parameter automatically disables insertion and deletion count Note #3: when specified, an extra ouput folder `frequencies/` will be generated, with a text file for each sample, containing the overall base substitution frequencies
-ds or --discard-shorter	int	Discards reads shorter than this length (excluding clipped bases, Default: 1)
-q or --min-quality	int	Minimum quality score value to consider a mutation (Phred+33, requires `-m`, Default: 20)
-es or --eval-surrounding		When considering a mutation/indel, also evaluates the quality of surrounding bases (±1 nt) Note: the quality score threshold set by `-q` (or `--min-quality`) also applies to these bases
-nd or --no-deletions		Ignores deletions
-ni or --no-insertions		Ignores insertions
-na or --no-ambiguous		Ignores ambiguously mapped deletions Note: the default behavior is to re-align them to their right-most valid position (or to their left-most valid position if `-la` has been specified)
-la or --left-align		Re-aligns ambiguously mapped deletions to their left-most valid position
-rd or --right-deletion		Only the right-most base in a deletion is marked as mutated
-ld or --left-deletion		Only the left-most base in a deletion is marked as mutated
-md or --max-deletion-len	int	Ignores deletions longer than this number of nucleotides (Default: 10)
-me or --max-edit-distance	float	Discards reads with editing distance frequency higher than this threshold (0<m≤1, Default: 0.15 [15%])
-eq or --median-quality	int	Median quality score threshold for discarding low-quality reads (Phred+33, Default: 20)
-dc or --discard-consecutive	int	Discards consecutive mutations within this distance from eachothers
-cc or --collapse-consecutive		Collapses consecutive mutations/indels toward the 3'-most one (recommended for SHAPE-MaP experiments)
-mc or --max-collapse-distance	int	Maximum distance between consecutive mutations/indels to allow collapsing (requires `-cc`, ≥0, Default_ 2)

Strandedness of genome-level alignments

An important difference with transcriptome-level analyses is that, at the level of the genome, the structure signal can be originated by either of the two DNA strands, dependening on where the gene resides. The strandedness depends on how the library has been generated. Specifying the proper strandedness of the library is crucial as it determines the way rf-count-genome will assign the RT-stop/mutation counts.
The strandedness of each sample can be specified by appending one of the following strings to the path to the SAM/BAM files to be processed:

String	Strandedness
:u or :unstranded	The information on the genomic strand that originated the transcript is not preserved
:f or :first or :first-strand	R1 aligns to the strand complementary to the one that originated the transcript
:s or :second or :second-strand	R1 aligns to the same strand that originated the transcript

Second-strand is the default (and only accepted) mode for the analysis of RT-stop-based RNA structure mapping experiments. When mutation count (-m) or coverage-only (-co) modes are enabled, if the strandedness of the samples is not specified, samples are assumed to be unstranded.
For instance, in the example below:

$ rf-count-genome -m -f reference.fasta -r sample1.bam:s sample2.bam:f sample3.bam

sample1.bam is generated using a second-strand directional library prep strategy, sample1.bam is generated using a first-strand directional library prep strategy and sample3.bam is assumed to have been generated using a non-directional library prep strategy.

When processing unstranded experiments, rf-count-genome will generate a single output RC file, named after the sample, having the .plus.rc suffix. When processing stranded experiments, two output RC files will be generated, named after the sample, respectively having the .plus.rc and .minus.rc suffixes and corresponding to the plus and minus strands of the genome.
Transcriptome-level RC files can be generated starting from genome-level RC files using the extract tool of the rf-rctools module. For additional information, please refer to the manual page.