The RF MMtools module enables easy visualization/manipulation of MM files.
This tool is particularly useful when the same sample is sequenced more than one time to increase its coverage. Now, instead of merging the BAM files and re-calling the rf-count on the whole dataset (that is very time-consuming), each sample can be processed independently and simply merged to the MM file from the previous analysis.

Usage

Available tools are: index, view, merge, extract and stats.

Tool	Description
view	Dumps to screen the content of the provided MM file
merge	Combines multiple MM files
extract	Generates a new MM file by extracting/filtering the reads according to a set of user-defined rules
split	Splits an MM file into multiple files, each containing a single transcript
alignIds	Rewrites a set of input MM files, keeping only the common transcripts
toRC	Converts MM files to RC format
correlate	Calculates the per-transcript correlation of co-mutation patters between 2 MM files
index	Indexes an MM file
stats	Prints length and mutation distributions

Note

MM files need to be indexed (via rf-mmtools index) before any of the above commands can be executed

To list the required parameters, simply type:

$ rf-mmtools [tool] -h

Parameter	Tool	Type	Description
-o or --output	merge, extract, split, alignIds, correlate or toRC	string	Output MM filename (Default: merge.mm for `merge`, <input>.extracted.rc for `extract`, <input>_split/ for `split`, alignedIds for `alignIds`, <MM#1>_vs_<MM#2>.tsv for `correlate`, or <input>.rc for `toRC`)
-ow or --overwrite	merge, extract, split, alignIds, correlate or toRC		Overwrites output file (if the specified file already exists)
-kb or --keepBases	extract	string	Only retains mutations on specified bases (Default: ACGT) Note: IUPAC codes are allowed
-mpr or --minMutPerRead	extract	int	Reads with < than this number of mutations are discarded (≥0, Default: 1)
-mrl or --minReadLen	extract	int	Reads shorter than this length are discarded (>0, Default: 1)
-rs or --randSubsample	extract	int	Randomly subsamples this fraction of reads (Default: keep all reads) Note: for example, if `-rs 2`, 1/2 of the reads will be subsampled
-a or --annotation	extract	string	Path to a list of regions (in BED format) to extract from the MM file Note: only the portion of the read falling within the boundaries of the provided BED intervals will be retained and subjected to the other filtering steps
-wl or --whitelist	extract	string	Path to a file containing a list (one per line) of transcripts to be extracted from the MM file
-dp or --discardPositions	extract	string	Path to a blacklist file containing a list of transcript positions to be filtered out of the output file (see Blacklist files below for details on the file format)
-nr or --minRate	extract	float	Positions with mutation rate < this value are discarded (0-1, Default: 0 [no cutoff])
-xr or --maxRate	extract	float	Positions with mutation rate > this value are discarded (0-1, Default: 1 [no cutoff])
-mr or --minReads	correlate	int	Transcripts having less than these number of reads are excluded (>0, Default: 1000)
-S or --spearman	correlate		Uses Spearman to calculate correlation (Default: Pearson)
-b3 or --by3End	split		Besides splitting by transcript, reads will be split by their 3′ end
-p or --processors	split	int	Number of processors to use to pre-sort reads by 3′ end coordinate (requires `-b3`) (≥ 1, Default: 1)

MMtools "view" output

The view command produces an output structured as follows:

Transcript#1
ATTTGCGAGCTAGCGATCGAGTCGATGC...GATGCGTACGTAGTCGTAGTC
0   119   23,35,73,99
0   88    80
0   119   58,69,74,96
0   102   35
0   95    32,64
0   103   75
0   93    38,71
0   107   32,59,62,71,101
0   119   23,63,83,102
0   119   70,102,105,113
0   113   53
0   92    21
0   119   73
0   108   21,48,72
0   59    5,51
0   107   52
0   88    35,38
0   92    35
0   120   35,51,53,72,102,115
0   119   22,29,93

in which, for each transcript, the first two rows correspond to the transcript's ID and sequence, followed by the reads, one per row. Each read row contains:

Start mapping position (0-based)
End mapping position (0-based)
Comma-separated list of indexes (0-based) of mutated bases (with respect to transcript's start)

Consecutive entries are separated by a newline.
If a comma (or semicolon) separated list of transcript IDs is provided, only those transcripts will be shown in the output (e.g. rf-mmtools view input.mm 'Transcript_2').

Merging MM files

The merge command allows merging multiple MM files. Files to be merged do not have to contain the same set of transcripts, therefore, MM files generated using different references can be combined. An important caveat here is that, if two transcripts with different sequences share the same ID, an exception will be thrown.

Subsetting and filtering reads

Starting from an input MM file, the extract command generates an output MM file by applying a number of user-defined filters.
If a BED annotation file is provided, the portion of reads falling within the user-defined regions will be extracted.

Notes

BED files will be interpreted as BED3, therefore only the start (0-based) and end (1-based) coordinates (2nd and 3rd field) will be considered).
A single BED entry per transcript is allowed

Reads can be filtered by mutated base (e.g., -kb AC or -kb M will only retain mutations on A/C bases), length (e.g., -mrl 100 will only retain reads ≥ 100 bp), or minimum number of mutations per base (e.g., -mpr 2 will only retain reads with ≥ 2 mutations).

Filters are applied in the following order:

Whitelisted transcripts
Subsetting of regions in BED file
Filtering by read length
Filtering by base
Filtering by number of mutations per read

In the following example:

0         10        20        30        40        50
|         |         |         |         |         |
GCATCGTGAGCGTATCGATGCGATGCTAGTCGAGCATCGAGCGACTGATGACTACG  Transcript
          CGTAgCGcTGgGATaCTA                              read#1
              TCGcTGCGgTGCTAGTtGAGCATCG                   read#2

read#1 starts at position 10, ends at position 27 (length: 18), and carries mutations at positions 14-17-20-24, while read#2 starts at position 14, ends at position 38 (length: 25), and carries mutations at positions 17-22-30.

Let's suppose the following BED file has been provided:

Transcript    15     35

This will result in transcript and reads being subsetted as it follows:

# Before subsetting:

0         10        20        30        40        50
|         |         |         |         |         |
GCATCGTGAGCGTATCGATGCGATGCTAGTCGAGCATCGAGCGACTGATGACTACG  Transcript
          CGTAgCGcTGgGATaCTA                              read#1
              TCGcTGCGgTGCTAGTtGAGCATCG                   read#2
               ====================                       BED region

# After subsetting:

15                 34
|                  |
CGATGCGATGCTAGTCGAGC                                      Transcript
CGcTGgGATaCTA                                             read#1
CGcTGCGgTGCTAGTtGAGC                                      read#2

Let's now suppose that the -kb AC -mpr 2 -mrl 20 filters have been specified, which would cause only reads at least 20 bp-long, carrying at least 2 mutations on A/C bases to be retained. #read1, which was originally 18 bp-long, has become 13 bp-long after being subsetted due to the region specified in the BED file, therefore it won't pass the -mrl 20 filter. On the other hand, #read2 is 20-bp long after subsetting, and it carries 3 mutations on A/C bases, therefore it will pass all filters and it will be retained.

Blacklist files

Blacklist files allow excluding mutations occuring at specific positions of a transcript. Blacklist files contain on each row a transcript ID followed by a comma (or semicolon) separated list of positions (0-based) to be discarded.
In the following example:

Transcript_1,10,44,90
Transcript_2;22;55-60

positions 10, 44 and 90 will be removed from reads mapping to Transcript_1, if mutated, while position 22, as well as any position between 55 and 60 (inclusive) will be removed from reads mapping to Transcript_2.