The RF PeakCall module takes two RC files generated by the RF Count module, and performs transcriptome-wide peak calling from RNA immunoprecipitation (IP) experiments.
Analysis is performed by sliding a window of length w along the transcript, and calculating the signal enrichment in the IP sample versus control sample as:

E(i..i+w)=j=ii+wlog2μIP(j)+pMdIP+pμCtrl(j)+pMdCtrl+pw


where i and i+w are the start and end position of the window, μIP(i) and μCtrl(i) are respectively the coverage at position i in the IP and control samples, MdIP and MdCtrl are respectively the median coverage on the whole transcript in the IP and control samples, and p is a pseudocount added to deal with non-covered regions/transcripts.
When a control sample is not provided, the signal enrichment is simply calculated as:

E(i..i+w)=j=ii+wlog2μIP(j)+pMdIP+pw


A p-value is then calculated for each window with detected enrichment above a defined cutoff, using a Fisher test. Thus, the following 2x2 contingency matrix is defined for each cutoff-passing window:

  n11 n12
n21 μIP(i..i+w) MdIP
n22 μCtrl(i..i+w) MdCtrl

If no control sample is provided, the contingency matrix is instead defined as:

  n11 n12
n21 μIP(i..i+w) MdIP
n22 μIP windows MdIP

where μIP windows is the average of the mean values for each possible window in the IP sample.
P-values are then subjected to Benjamini-Hochberg correction. Consecutive significantly enriched windows are then merged together, and p-values are combined by Stouffer's method.

Peak refinement

Peaks are called by sliding a window of a fixed size, with a user-defined offset. This means that, in some cases, the called interval might not include the full peak (or it might include additional low-enrichment bases). Consider the following example:

Peak refinement

In this example, peaks are called using a window size of 150 nt (default offset: 75 nt) and an enrichment threshold of 3 fold. With default parameters, the called interval does not entangle the full peak (as the coverage slightly drops around half of the peak in the IP sample). When the -r (or --refine) parameter is specified, RF PeakCall will progressively enlarge (or shrink) the interval to only include bases exceeding the enrichment threshold. In this case, refinement trims all the non-enriched bases from the 5' end of the interval. However, because of the slightly lower coverage of the second half of the peak, the enrichment (2.87) is just below the acceptance threshold. To account for such situations, the -x (or --relaxed) parameter can be specified to enable a more loose evaluation of the coverage in the IP sample. Specifically, the coverage will be rounded to nearest multiple of 0.5.

Note

Peak refinement also takes into account the median transcript coverage of the IP sample. While extending peak boundaries, only bases exceeding the median transcript coverage in the IP sample will be included, independently of their enrichment.


Usage

To list the required parameters, simply type:

$ rf-peakcall -h
Parameter Type Description
-c or --control string Path to the RC file for the control sample
-I or --IP string Path to the RC file for the immunoprecipitated (IP) sample
-i or --index string[,string] A comma separated (no spaces) list of RCI index files for the provided RC files
Note #1: RCI files must be provided in the order 1. Control, 2. IP
Note #2: If a single RTI file is specified, it will be used for all RC files
Note #3: If no RCI index is provided, it will be generated at runtime, and stored in the same folder of the control/IP samples
l or --whitelist string A whitelist containing transcript IDs (one per each row) to restrict the analysis to
-p or --processors int Number of processors (threads) to use (Default: 1)
-o or --output string Output folder (Default: <IP>_vs_<Control>/ if a control RC file is provided, or <IP>/ if only the IP RC file is provided)
-ow or --overwrite Overwrites the output folder (if the specified folder already exists)
Peak calling options
-w or --window int Window size (in nt) for peak calling (≥10, Default: 150)
-f or --offset int Offset (in nt) for window sliding (≥1, Default: window / 2)
-md or --merge-distance int Maximum distance (in nt) for merging non-overlapping windows (≥0, Default: 50)
-e or --enrichment float Minimum log2 enrichment in IP vs. Control for reporting a peak (≥1, Default: 3)
-v or --p-value float P-value cutoff for reporting a peak (0 ≤ p ≤ 1, Default: 0.05)
-s or --summit Generates an additional BED file containing the coordinates of peak summits (peak regions with the highest coverage)
-r or --refine Refines peak boundaries
-x or --relaxed Uses more relaxed criteria to refine peak boundaries (requires -r)
-pc or --pseudocount float Pseudocount added to read counts to avoid division by 0 (>0, Default: 1)
-mc or --mean-coverage float Discards any transcript with mean coverage in control sample below this threshold (≥0, Default: 0)
-ec or --median-coverage float Discards any transcript with median coverage in control sample below this threshold (≥0, Default: 0)
-D or --decimals int Number of decimals for reporting enrichment/p-value (1-10, Default: 3)
Plotting options
-g or --img Enables the generation of coverage plots
-mp or --meta-plot Enables the generation of meta-gene plots (both coverage and peaks' distribution)
-mcp or --meta-coding-plot Enables the generation of a protein-coding-only meta-gene plot, by aligning the TSS, start codon, stop codon, and TES
Note: the longest ORF will be automatically identified
-mo or --min-orf-length int Minimum length (in aa) to select the longest ORF (requires -mcp, Default: 50)
-als or --alt-start The longest ORF is allowed to start with alternative start codons (requires -mcp)
-ans or --any-start The longest ORF is allowed to start with any codon (requires -mcp)
-gc or --genetic-code int Genetic code table for the reference organism (requires -mcp, 1-33, Default: 1)
Note: for a detailed list of the available genetic code tables, please refer to the RF Mutate docs, or to https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi
-eo or --plot-enriched-only Meta-gene plots will be calculated only on transcripts with a detected enrichment (requires -mp or -mcp)


Output BED files

The peaks BED file contains 5 tab-delimited fields:

Field Description
Transcript ID ID of the transcript
Start Start coordinate of the peak (0-based)
End End coordinate of the peak (0-based)
Enrichment Fold enrichment of the IP signal versus Control signal
p-value Benjamini-Hochberg adjusted p-value


The summits BED file only contains 3 tab-delimited fields:

Field Description
Transcript ID ID of the transcript
Start Start coordinate of the summit (0-based)
End End coordinate of the summit (0-based)
Enrichment Fold enrichment of the IP signal versus Control signal
p-value Benjamini-Hochberg adjusted p-value