Algorithms behind MOOSE2
Background.
In working with many different RNA-Seq data sets, we found that
accurate estimation of expression levels is still a difficult task,
mainly due to these reasons: there is variability in the sequencing
experiments, which can distort read counts depending on the
abundance of the underlying transcripts. As a result, it is unclear
how to interpret the results, and in particular what distribution to
use when quantifying the statistical significance at which genes
might be differentially expressed. Using any ad hoc
distribution, such as the negative binomial distribution, for the
only reason that "it fits better and models biological variation" is
rather unsatisfying, and leaves the question as to what causes this
"biological variation" open. With MOOSE2, we sought to
address these issues, including exploring a more biologically (or
perhaps even thermodynamically) relevant model to determine
significant differences.
Algorithms.
The program performs multiple steps that are chained via a pipeline:
1. Identify "reference genes" for normalization.
MOOSE2 performs a dynamic programming (DP) step to find
the cheapest path from the most lowly expressed gene to the most
highly expressed one. Known reference genes can here serve as "way
points" to guide the process and make it more accurate, even though
the algorithm seems to perform quite well without. The purpose of
this step is to identify genes and transcripts that do not vary in
expression, even though their expression values might appear quite
different in the beginning. The assumption behind using DP for this
task is that genes that are regulated differently depending on
condition, time point etc., are regulated to varying degrees, i.e.
some go up or down dramatically, whereas others are less regulated.
Unregulated genes, on the other hand, should appear similar to each
other, as long as their overall expression levels are similar, so
that DP can find a string of these genes by minimizing the distance
between them, but allowing for gradual change with increasing
expression.
2. Polynomial fit to the data.
MOOSE2 fits a quadratic function to the data, thus
correcting different ranges of expression differently. While a
quadratic function is not theoretically justified, it is a pragmatic
way to apply a correction that depends on the abundance of
transcription.
3. Estimate significance for differential expression.
To compute significance (p-values) for cross-sample comparisons, we
recommend limma.
A program for processing moose2-normalized data is here: https://github.com/thokall/ManfredData.
Alternatively - and highly experimentally!! - MOOSE2
performs a least-square fit minimization of the log differences of
expression values to a convolution of a Laplace and a Gauss
distribution, based on comparing biological replicates. This works,
however, only if each condition has at least two replicates, since
MOOSE2 estimates the significance based on the
distribution described above, also allowing for comparisons of
multiple replicates against multiple replicates. MOOSE2
tests for up- and down-regulation and computes two p-values (or
e-values) respectively, reflecting the combined probability of a
gene expressed significantly differentially in two conditions. Note
that these p-values are typically lower than the values computed by
other methods, so that we do not recommend to use this method for
any publication at this point. A follow-up manuscript will discuss
this in more detail.