The multiseq R package implements multiscale Poisson process approaches for differential or association analysis of high-throughput sequencing data. Key features that distinguish multiseq from typical differential or association analysis are to 1) better exploit high-resolution information in the high-throughput sequencing data, and 2) directly model the count nature of the data (see Shim et al., 2021 for details of the motivations, approaches, and comparison results).
Assume that we have high-throughput sequencing data from \(N\) samples across a genomic region with a length \(B\). Let \(x^{i}_b\) denote the read count for sample \(i\) at a position \(b\). We also assume that a covariate \(g^i\) is measured on each sample \(i\). The covariate is either a binary group membership (for differential analysis) or continuous. We model the read counts as arising from an inhomogeneous Poisson process: \[ x^i_b \sim \text{Pois}(\lambda^i_b) \] We use the multiseq methods to 1) estimate the effect of \(g\) on \(\log\lambda_b\) and 2) test for association (or difference for a binary \(g\)) between \(g\) and \(\log\lambda\). Here we present a brief demonstration of the methods.
We begin by loading the multiseq and ashr packages.
In this section, we first simulate data set using the procedure described in Shim et al., 2021, and then apply the multiseq method to the simulated data. Shim et al., 2021 simulates data by subsampling reads from DNase-seq data (see Shim et al., 2021 for the detailed description). We first load the DNase-seq data on chr17:10160989-10162012.
Using the DNase-seq data in this region, we will simulate a data set with \(N\) = 10 and expected library read depth = 39 M \(\times\) 2. The data set has two groups of samples (5 vs 5), and non-zero difference over 10161438-10161658.
total.count = as.numeric(apply(chr17.10160989.10162012.DNase.seq$x, 2, sum))
# two groups of samples (5 vs 5)
g = c(rep(0,5), rep(1,5))
# expected library read depth = 39 M X 2
read.depth.ratio = 2
if(!is.null(read.depth.ratio)){
total.count = floor(total.count*read.depth.ratio)
}
numSam = length(g)
numBPs = length(total.count)
# non-zero difference over 10161438-10161658
eff = rep(1, 1024)
eff[450:670] = 2
mu0.sig = 2/70/(1+eff)
mu1.sig = 2*eff/70/(1+eff)
The simulation procedure described in Shim et al., 2021 has been implemented in sample.from.Binomial.with.Overdispersion
function.
# simulated data will be saved here
x = matrix(data=NA, nr= numSam, nc = numBPs)
# set seed
set.seed(1111)
# group 0
wh0 = which(g == 0)
if(length(wh0) > 0){
x[wh0,] = sample.from.Binomial.with.Overdispersion(num.sam = length(wh0), total.count = total.count, mu.sig = mu0.sig)
}
# group 1
wh1 = which(g == 1)
if(length(wh1) > 0){
x[wh1,] = sample.from.Binomial.with.Overdispersion(num.sam = length(wh1), total.count = total.count, mu.sig = mu1.sig)
}
Let’s apply the multiseq method to x
and g
.
res.sim <- multiseq(x=x, g=g)
# test statistic for association analysis
res.sim$logLR$value
# [1] 10.97019
We can visualize the effect size using the function plot
. We will highlight areas with strong effects (i.e., zero is outside of the interval constructed by the posterior mean \(\pm\) threshold
\(\times\) posterior standard deviation). We also visualize the average read counts for each group. We obtain the smoothed average read counts using multiseq
function.
# smoothed average read counts for group 0
res.sim.0 <- multiseq(x=x[1:5,])
# smoothed average read counts for group 1
res.sim.1 <- multiseq(x=x[6:10,])
par(mfrow=c(2,1), oma = c(0,0,0,0) + 0.1, mar = c(2,2,1,2), mgp=c(0,1,0))
# visualize the smoothed average read counts for each group
plot(1:1024, exp(res.sim.1$baseline.mean),type="l", col = "navy", ylab = "", xlab="")
points(1:1024, exp(res.sim.0$baseline.mean), type="l", col ="orange")
# visualize area with true non-zero difference using dark green color
points(450:670,rep(0.01,670-450+1), col="dark green")
# visualize the effect size from multiseq; highlight areas with strong effects (i.e., zero is outside of the interval constructed by the posterior mean +/-2xposterior standard deviation).
plot(res.sim, threshold=2, is.xaxis=TRUE)
Also, we can obtain the information on the area with strong effect using the get.areas.strong.effects
function.
get.areas.strong.effects(res.sim, threshold=2)
# $start
# [1] 382
#
# $end
# [1] 696
#
# $sign
# [1] "+"
#
# $threshold
# [1] 2
multiseq identified the area with true signal well.
In this section, we apply multiseq to ATAC-seq data on chr1:568739-569762 from Shim et al. 2021 and reproduce the effect size, test statistic (logLR), and p-value in Figure 3 of Shim et al. 2021. We first load the ATAC-seq data on chr17:10160989-10162012 and relevant information:
data(chr1.568739.569762.ATACseq, package="multiseq")
?chr1.568739.569762.ATACseq
str(chr1.568739.569762.ATACseq)
# List of 7
# $ x : num [1:6, 1:1024] 0 0 1 0 1 0 0 0 0 0 ...
# $ region : chr "chr1:568739-569762"
# $ g : chr [1:6] "copper" "copper" "copper" "control" ...
# $ read.depth : num [1:6] 51781312 67287342 41962596 63826433 65498157 ...
# $ overall.logLR : num 0.381
# $ overall.effect : num -0.098
# $ overall.effect.var: num 0.0126
chr1:568739-569762$x
is a 6 by 1024 matrix of ATAC-seq read counts. multiseq_Shim_et_al github repository provides a script to obtain this matrix from the hdf5 files - extracting ATAC-seq read counts on chr1:568739-569762 (for three copper treated samples and three media control samples) from the hdf5 files and doing data-preprocessing as described in Shim et al. 2021. chr1:568739.569762
provides the log likelihood ratio (overall.logLR
), effect size estimate (effect size
), and its variance (overall.effect.var
) for overall expression which we computed using DESeq2 output. We will incorporate them into the multiseq to model the effect of \(g\) on the overall expression. logLR
in Figure 3 of Shim et al. 2021 has been reproduced.
# 1 for copper samples and 0 for control samples
g = c(1,1,1,0,0,0)
# Incorporate overall.logLR, effect size, overall.effect.var into the multiseq to model the effect of g on the overall expression.
res.ATACseq <- multiseq(x=chr1.568739.569762.ATACseq$x, g=g, overall.loglr = chr1.568739.569762.ATACseq$overall.logLR, overall.effect = c(chr1.568739.569762.ATACseq$overall.effect, chr1.568739.569762.ATACseq$overall.effect.var))
# test statistic for differential analysis
obs.statistic = res.ATACseq$logLR$value
obs.statistic
# [1] 98.91417
We visualize the effect size using the function plot
. The effect size in Figure 3 of Shim et al. 2021 has been reproduced.
res.ATACseq$region <- chr1.568739.569762.ATACseq$region
plot(res.ATACseq, threshold=2.6, is.xaxis=TRUE)
We applied multiseq to two controls (media vs. ethanol) for the 242,714 regions in Shim et al. 2021 and the resulting 242,714 test statistics have been saved in ATACseq.multiseq.stat.null
. To reproduce the p-value in Figure 3 of Shim et al 2021, we load ATACseq.multiseq.stat.null
.
The two controls are expected to have no differences in chromatin accessibility. Thus, we can construct the empirical null distribution of the multiseq test statistic using the 242,714 test statistics in ATACseq.multiseq.stat.null
. We can compute a p-value for each observed test statistic using the function get.pvalue.from.empirical.null.dist
. This function uses a randomization technique to produce continuous p-values that are uniformly distributed under the null, leading to the accurate estimation of FDR in the ‘qvalue’ package. See ?get.pvalue.from.empirical.null.dist
for the details of the randomization technique.
pval = get.pvalue.from.empirical.null.dist(ATACseq.multiseq.stat.null, obs.statistic, seed=149)
pval
# [1] 3.226061e-06
This p-value is slightly different from the p-value in Figure 3 of Shim et al 2021 due to the randomization technique. The function get.pvalue.from.empirical.null.dist
can compute p-values for multiple observed test statistics.
In this section, we apply multiseq to RNA-seq data on the gene OAS1 (chr12:113354417-113358512) to test for association with genotypes at SNP rs10774671
. The SNP rs10774671
was found to be associated with splicing in the gene OAS1 (see Figure 3 in Pickrell et al. 2010). We first load the RAN-seq data and relevant information:
Let’s apply the multiseq method to the RNA-seq data (x
) and genotype data (g
). We provide library read depth (read.depth
) in the function.
res.RNAseq <- multiseq(x=OAS1$x, g=OAS1$g, minobs=50, read.depth=OAS1$read.depth, lm.approx = FALSE)
# test statistic for association analysis
res.RNAseq$logLR$value
# [1] 265.8064
Let’s visualize the effect size using the function plot
.
# smoothed average read counts for each group
res.RNAseq.0 <- multiseq(x=OAS1$x[OAS1$g==0,], read.depth=OAS1$read.depth[OAS1$g==0])
res.RNAseq.1 <- multiseq(x=OAS1$x[OAS1$g==1,], read.depth=OAS1$read.depth[OAS1$g==1])
res.RNAseq.2 <- multiseq(x=OAS1$x[OAS1$g==2,], read.depth=OAS1$read.depth[OAS1$g==2])
# visualize smoothed average read counts for each group, and the effect size using `plot` function.
par(mfrow=c(4,1), oma = c(0,0,0,0) + 0.1, mar = c(2,2,1,2), mgp=c(0,1,0))
res.RNAseq$region <- OAS1$region
res.RNAseq.0$region <- OAS1$region
res.RNAseq.1$region <- OAS1$region
res.RNAseq.2$region <- OAS1$region
M <- max(res.RNAseq.0$baseline.mean, res.RNAseq.1$baseline.mean, res.RNAseq.2$baseline.mean)
m <- min(res.RNAseq.0$baseline.mean, res.RNAseq.1$baseline.mean, res.RNAseq.2$baseline.mean)
ylim <- exp(c(m,M))
plot(res.RNAseq.0, what="baseline", main="(genotype AA)", ylim=ylim)
plot(res.RNAseq.1, what="baseline", main="(genotype AG)", ylim=ylim)
plot(res.RNAseq.2, what="baseline", main="(genotype GG)", ylim=ylim)
plot(res.RNAseq, threshold=2.2)
This is the version of R and the packages that were used to generate the results shown above.
sessionInfo()
# R version 4.0.2 (2020-06-22)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Catalina 10.15.4
#
# Matrix products: default
# BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
#
# locale:
# [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
#
# attached base packages:
# [1] tools stats graphics grDevices utils datasets methods
# [8] base
#
# other attached packages:
# [1] multiseq_0.1.3 data.table_1.12.8 ashr_1.0.12 Rcpp_1.0.5
#
# loaded via a namespace (and not attached):
# [1] codetools_0.2-16 digest_0.6.25 foreach_1.5.1 truncnorm_1.0-8
# [5] MASS_7.3-51.6 magrittr_1.5 evaluate_0.14 highr_0.8
# [9] rlang_0.4.11 stringi_1.4.6 pscl_1.5.5 doParallel_1.0.16
# [13] rmarkdown_2.3 iterators_1.0.13 stringr_1.4.0 parallel_4.0.2
# [17] xfun_0.15 yaml_2.2.1 compiler_4.0.2 SQUAREM_2021.1
# [21] htmltools_0.5.0 knitr_1.29