The RF MMtools module enables easy visualization/manipulation of MM files.
This tool is particularly useful when the same sample is sequenced more than one time to increase its coverage. Now, instead of merging the BAM files and re-calling the rf-count on the whole dataset (that is very time-consuming), each sample can be processed independently and simply merged to the MM file from the previous analysis.
Usage
Available tools are: index, view, merge, extract and stats.
| Tool | Description |
|---|---|
| view | Dumps to screen the content of the provided MM file |
| merge | Combines multiple MM files |
| extract | Generates a new MM file by extracting/filtering the reads according to a set of user-defined rules |
| split | Splits an MM file into multiple files, each containing a single transcript |
| alignIds | Rewrites a set of input MM files, keeping only the common transcripts |
| toRC | Converts MM files to RC format |
| correlate | Calculates the per-transcript correlation of co-mutation patters between 2 MM files |
| index | Indexes an MM file |
| stats | Prints length and mutation distributions |
Note
MM files need to be indexed (via rf-mmtools index) before any of the above commands can be executed
To list the required parameters, simply type:
$ rf-mmtools [tool] -h
| Parameter | Tool | Type | Description |
|---|---|---|---|
| -o or --output | merge, extract, split, alignIds, correlate or toRC | string | Output MM filename (Default: merge.mm for merge, <input>.extracted.rc for extract, <input>_split/ for split, alignedIds for alignIds, <MM#1>_vs_<MM#2>.tsv for correlate, or <input>.rc for toRC) |
| -ow or --overwrite | merge, extract, split, alignIds, correlate or toRC | Overwrites output file (if the specified file already exists) | |
| -kb or --keepBases | extract | string | Only retains mutations on specified bases (Default: ACGT) Note: IUPAC codes are allowed |
| -mpr or --minMutPerRead | extract | int | Reads with < than this number of mutations are discarded (≥0, Default: 1) |
| -mrl or --minReadLen | extract | int | Reads shorter than this length are discarded (>0, Default: 1) |
| -rs or --randSubsample | extract | int | Randomly subsamples this fraction of reads (Default: keep all reads) Note: for example, if -rs 2, 1/2 of the reads will be subsampled |
| -a or --annotation | extract | string | Path to a list of regions (in BED format) to extract from the MM file Note: only the portion of the read falling within the boundaries of the provided BED intervals will be retained and subjected to the other filtering steps |
| -wl or --whitelist | extract | string | Path to a file containing a list (one per line) of transcripts to be extracted from the MM file |
| -dp or --discardPositions | extract | string | Path to a blacklist file containing a list of transcript positions to be filtered out of the output file (see Blacklist files below for details on the file format) |
| -nr or --minRate | extract | float | Positions with mutation rate < this value are discarded (0-1, Default: 0 [no cutoff]) |
| -xr or --maxRate | extract | float | Positions with mutation rate > this value are discarded (0-1, Default: 1 [no cutoff]) |
| -mr or --minReads | correlate | int | Transcripts having less than these number of reads are excluded (>0, Default: 1000) |
| -S or --spearman | correlate | Uses Spearman to calculate correlation (Default: Pearson) | |
| -b3 or --by3End | split | Besides splitting by transcript, reads will be split by their 3′ end | |
| -p or --processors | split | int | Number of processors to use to pre-sort reads by 3′ end coordinate (requires -b3) (≥ 1, Default: 1) |
MMtools "view" output
The view command produces an output structured as follows:
Transcript#1
ATTTGCGAGCTAGCGATCGAGTCGATGC...GATGCGTACGTAGTCGTAGTC
0 119 23,35,73,99
0 88 80
0 119 58,69,74,96
0 102 35
0 95 32,64
0 103 75
0 93 38,71
0 107 32,59,62,71,101
0 119 23,63,83,102
0 119 70,102,105,113
0 113 53
0 92 21
0 119 73
0 108 21,48,72
0 59 5,51
0 107 52
0 88 35,38
0 92 35
0 120 35,51,53,72,102,115
0 119 22,29,93
in which, for each transcript, the first two rows correspond to the transcript's ID and sequence, followed by the reads, one per row. Each read row contains:
- Start mapping position (0-based)
- End mapping position (0-based)
- Comma-separated list of indexes (0-based) of mutated bases (with respect to transcript's start)
Consecutive entries are separated by a newline.
If a comma (or semicolon) separated list of transcript IDs is provided, only those transcripts will be shown in the output (e.g. rf-mmtools view input.mm 'Transcript_2').
Merging MM files
The merge command allows merging multiple MM files. Files to be merged do not have to contain the same set of transcripts, therefore, MM files generated using different references can be combined. An important caveat here is that, if two transcripts with different sequences share the same ID, an exception will be thrown.
Subsetting and filtering reads
Starting from an input MM file, the extract command generates an output MM file by applying a number of user-defined filters.
If a BED annotation file is provided, the portion of reads falling within the user-defined regions will be extracted.
Notes
- BED files will be interpreted as BED3, therefore only the start (0-based) and end (1-based) coordinates (2nd and 3rd field) will be considered).
- A single BED entry per transcript is allowed
Reads can be filtered by mutated base (e.g., -kb AC or -kb M will only retain mutations on A/C bases), length (e.g., -mrl 100 will only retain reads ≥ 100 bp), or minimum number of mutations per base (e.g., -mpr 2 will only retain reads with ≥ 2 mutations).
Filters are applied in the following order:
- Whitelisted transcripts
- Subsetting of regions in BED file
- Filtering by read length
- Filtering by base
- Filtering by number of mutations per read
In the following example:
0 10 20 30 40 50
| | | | | |
GCATCGTGAGCGTATCGATGCGATGCTAGTCGAGCATCGAGCGACTGATGACTACG Transcript
CGTAgCGcTGgGATaCTA read#1
TCGcTGCGgTGCTAGTtGAGCATCG read#2
read#1 starts at position 10, ends at position 27 (length: 18), and carries mutations at positions 14-17-20-24, while read#2 starts at position 14, ends at position 38 (length: 25), and carries mutations at positions 17-22-30.
Let's suppose the following BED file has been provided:
Transcript 15 35
This will result in transcript and reads being subsetted as it follows:
# Before subsetting:
0 10 20 30 40 50
| | | | | |
GCATCGTGAGCGTATCGATGCGATGCTAGTCGAGCATCGAGCGACTGATGACTACG Transcript
CGTAgCGcTGgGATaCTA read#1
TCGcTGCGgTGCTAGTtGAGCATCG read#2
==================== BED region
# After subsetting:
15 34
| |
CGATGCGATGCTAGTCGAGC Transcript
CGcTGgGATaCTA read#1
CGcTGCGgTGCTAGTtGAGC read#2
Let's now suppose that the -kb AC -mpr 2 -mrl 20 filters have been specified, which would cause only reads at least 20 bp-long, carrying at least 2 mutations on A/C bases to be retained. #read1, which was originally 18 bp-long, has become 13 bp-long after being subsetted due to the region specified in the BED file, therefore it won't pass the -mrl 20 filter. On the other hand, #read2 is 20-bp long after subsetting, and it carries 3 mutations on A/C bases, therefore it will pass all filters and it will be retained.
Blacklist files
Blacklist files allow excluding mutations occuring at specific positions of a transcript. Blacklist files contain on each row a transcript ID followed by a comma (or semicolon) separated list of positions (0-based) to be discarded.
In the following example:
Transcript_1,10,44,90
Transcript_2;22;55-60
positions 10, 44 and 90 will be removed from reads mapping to Transcript_1, if mutated, while position 22, as well as any position between 55 and 60 (inclusive) will be removed from reads mapping to Transcript_2.