The RF MMtools module enables easy visualization/manipulation of MM files.
This tool is particularly useful when the same sample is sequenced more than one time to increase its coverage. Now, instead of merging the BAM files and re-calling the rf-count on the whole dataset (that is very time-consuming), each sample can be processed independently and simply merged to the MM file from the previous analysis.

Usage

Available tools are: index, view, merge, extract and stats.

Tool Description
view Dumps to screen the content of the provided MM file
merge Combines multiple MM files
extract Generates a new MM file by extracting/filtering the reads according to a set of user-defined rules
split Splits an MM file into multiple files, each containing a single transcript
alignIds Rewrites a set of input MM files, keeping only the common transcripts
toRC Converts MM files to RC format
correlate Calculates the per-transcript correlation of co-mutation patters between 2 MM files
index Indexes an MM file
stats Prints length and mutation distributions

Note

MM files need to be indexed (via rf-mmtools index) before any of the above commands can be executed

To list the required parameters, simply type:

$ rf-mmtools [tool] -h
Parameter Tool Type Description
-o or --output merge, extract, split, alignIds, correlate or toRC string Output MM filename (Default: merge.mm for merge, <input>.extracted.rc for extract, <input>_split/ for split, alignedIds for alignIds, <MM#1>_vs_<MM#2>.tsv for correlate, or <input>.rc for toRC)
-ow or --overwrite merge, extract, split, alignIds, correlate or toRC Overwrites output file (if the specified file already exists)
-kb or --keepBases extract string Only retains mutations on specified bases (Default: ACGT)
Note: IUPAC codes are allowed
-mpr or --minMutPerRead extract int Reads with < than this number of mutations are discarded (≥0, Default: 1)
-mrl or --minReadLen extract int Reads shorter than this length are discarded (>0, Default: 1)
-rs or --randSubsample extract int Randomly subsamples this fraction of reads (Default: keep all reads)
Note: for example, if -rs 2, 1/2 of the reads will be subsampled
-a or --annotation extract string Path to a list of regions (in BED format) to extract from the MM file
Note: only the portion of the read falling within the boundaries of the provided BED intervals will be retained and subjected to the other filtering steps
-wl or --whitelist extract string Path to a file containing a list (one per line) of transcripts to be extracted from the MM file
-dp or --discardPositions extract string Path to a blacklist file containing a list of transcript positions to be filtered out of the output file (see Blacklist files below for details on the file format)
-nr or --minRate extract float Positions with mutation rate < this value are discarded (0-1, Default: 0 [no cutoff])
-xr or --maxRate extract float Positions with mutation rate > this value are discarded (0-1, Default: 1 [no cutoff])
-mr or --minReads correlate int Transcripts having less than these number of reads are excluded (>0, Default: 1000)
-S or --spearman correlate Uses Spearman to calculate correlation (Default: Pearson)
-b3 or --by3End split Besides splitting by transcript, reads will be split by their 3′ end
-p or --processors split int Number of processors to use to pre-sort reads by 3′ end coordinate (requires -b3) (≥ 1, Default: 1)


MMtools "view" output

The view command produces an output structured as follows:

Transcript#1
ATTTGCGAGCTAGCGATCGAGTCGATGC...GATGCGTACGTAGTCGTAGTC
0   119   23,35,73,99
0   88    80
0   119   58,69,74,96
0   102   35
0   95    32,64
0   103   75
0   93    38,71
0   107   32,59,62,71,101
0   119   23,63,83,102
0   119   70,102,105,113
0   113   53
0   92    21
0   119   73
0   108   21,48,72
0   59    5,51
0   107   52
0   88    35,38
0   92    35
0   120   35,51,53,72,102,115
0   119   22,29,93

in which, for each transcript, the first two rows correspond to the transcript's ID and sequence, followed by the reads, one per row. Each read row contains:

  • Start mapping position (0-based)
  • End mapping position (0-based)
  • Comma-separated list of indexes (0-based) of mutated bases (with respect to transcript's start)

Consecutive entries are separated by a newline.
If a comma (or semicolon) separated list of transcript IDs is provided, only those transcripts will be shown in the output (e.g. rf-mmtools view input.mm 'Transcript_2').

Merging MM files

The merge command allows merging multiple MM files. Files to be merged do not have to contain the same set of transcripts, therefore, MM files generated using different references can be combined. An important caveat here is that, if two transcripts with different sequences share the same ID, an exception will be thrown.

Subsetting and filtering reads

Starting from an input MM file, the extract command generates an output MM file by applying a number of user-defined filters.
If a BED annotation file is provided, the portion of reads falling within the user-defined regions will be extracted.

Notes

  1. BED files will be interpreted as BED3, therefore only the start (0-based) and end (1-based) coordinates (2nd and 3rd field) will be considered).
  2. A single BED entry per transcript is allowed

Reads can be filtered by mutated base (e.g., -kb AC or -kb M will only retain mutations on A/C bases), length (e.g., -mrl 100 will only retain reads ≥ 100 bp), or minimum number of mutations per base (e.g., -mpr 2 will only retain reads with ≥ 2 mutations).

Filters are applied in the following order:

  1. Whitelisted transcripts
  2. Subsetting of regions in BED file
  3. Filtering by read length
  4. Filtering by base
  5. Filtering by number of mutations per read

In the following example:

0         10        20        30        40        50
|         |         |         |         |         |
GCATCGTGAGCGTATCGATGCGATGCTAGTCGAGCATCGAGCGACTGATGACTACG  Transcript
          CGTAgCGcTGgGATaCTA                              read#1
              TCGcTGCGgTGCTAGTtGAGCATCG                   read#2

read#1 starts at position 10, ends at position 27 (length: 18), and carries mutations at positions 14-17-20-24, while read#2 starts at position 14, ends at position 38 (length: 25), and carries mutations at positions 17-22-30.

Let's suppose the following BED file has been provided:

Transcript    15     35

This will result in transcript and reads being subsetted as it follows:

# Before subsetting:

0         10        20        30        40        50
|         |         |         |         |         |
GCATCGTGAGCGTATCGATGCGATGCTAGTCGAGCATCGAGCGACTGATGACTACG  Transcript
          CGTAgCGcTGgGATaCTA                              read#1
              TCGcTGCGgTGCTAGTtGAGCATCG                   read#2
               ====================                       BED region

# After subsetting:

15                 34
|                  |
CGATGCGATGCTAGTCGAGC                                      Transcript
CGcTGgGATaCTA                                             read#1
CGcTGCGgTGCTAGTtGAGC                                      read#2

Let's now suppose that the -kb AC -mpr 2 -mrl 20 filters have been specified, which would cause only reads at least 20 bp-long, carrying at least 2 mutations on A/C bases to be retained. #read1, which was originally 18 bp-long, has become 13 bp-long after being subsetted due to the region specified in the BED file, therefore it won't pass the -mrl 20 filter. On the other hand, #read2 is 20-bp long after subsetting, and it carries 3 mutations on A/C bases, therefore it will pass all filters and it will be retained.

Blacklist files

Blacklist files allow excluding mutations occuring at specific positions of a transcript. Blacklist files contain on each row a transcript ID followed by a comma (or semicolon) separated list of positions (0-based) to be discarded.
In the following example:

Transcript_1,10,44,90
Transcript_2;22;55-60

positions 10, 44 and 90 will be removed from reads mapping to Transcript_1, if mutated, while position 22, as well as any position between 55 and 60 (inclusive) will be removed from reads mapping to Transcript_2.