Algorithms behind MOOSE2

Background.

In working with many different RNA-Seq data sets, we found that accurate estimation of expression levels is still a difficult task, mainly due to these reasons: there is variability in the sequencing experiments, which can distort read counts depending on the abundance of the underlying transcripts. As a result, it is unclear how to interpret the results, and in particular what distribution to use when quantifying the statistical significance at which genes might be differentially expressed. Using any ad hoc distribution, such as the negative binomial distribution, for the only reason that "it fits better and models biological variation" is rather unsatisfying, and leaves the question as to what causes this "biological variation" open. With MOOSE2, we sought to address these issues, including exploring a more biologically (or perhaps even thermodynamically) relevant model to determine significant differences.


Algorithms.

The program performs multiple steps that are chained via a pipeline:

1. Identify "reference genes" for normalization.

MOOSE2 performs a dynamic programming (DP) step to find the cheapest path from the most lowly expressed gene to the most highly expressed one. Known reference genes can here serve as "way points" to guide the process and make it more accurate, even though the algorithm seems to perform quite well without. The purpose of this step is to identify genes and transcripts that do not vary in expression, even though their expression values might appear quite different in the beginning. The assumption behind using DP for this task is that genes that are regulated differently depending on condition, time point etc., are regulated to varying degrees, i.e. some go up or down dramatically, whereas others are less regulated. Unregulated genes, on the other hand, should appear similar to each other, as long as their overall expression levels are similar, so that DP can find a string of these genes by minimizing the distance between them, but allowing for gradual change with increasing expression.


2. Polynomial fit to the data.

MOOSE2 fits a quadratic function to the data, thus correcting different ranges of expression differently. While a quadratic function is not theoretically justified, it is a pragmatic way to apply a correction that depends on the abundance of transcription.

3. Estimate significance for differential expression.

To compute significance (p-values) for cross-sample comparisons, we recommend limma. A program for processing moose2-normalized data is here: https://github.com/thokall/ManfredData.

Alternatively - and highly experimentally!! - MOOSE2 performs a least-square fit minimization of the log differences of expression values to a convolution of a Laplace and a Gauss distribution, based on comparing biological replicates. This works, however, only if each condition has at least two replicates, since MOOSE2 estimates the significance based on the distribution described above, also allowing for comparisons of multiple replicates against multiple replicates. MOOSE2 tests for up- and down-regulation and computes two p-values (or e-values) respectively, reflecting the combined probability of a gene expressed significantly differentially in two conditions. Note that these p-values are typically lower than the values computed by other methods, so that we do not recommend to use this method for any publication at this point. A follow-up manuscript will discuss this in more detail.