multiseq

2021-06-07

The multiseq R package implements multiscale Poisson process approaches for differential or association analysis of high-throughput sequencing data. Key features that distinguish multiseq from typical differential or association analysis are to 1) better exploit high-resolution information in the high-throughput sequencing data, and 2) directly model the count nature of the data (see Shim et al., 2021 for details of the motivations, approaches, and comparison results).

Assume that we have high-throughput sequencing data from \(N\) samples across a genomic region with a length \(B\). Let \(x^{i}_b\) denote the read count for sample \(i\) at a position \(b\). We also assume that a covariate \(g^i\) is measured on each sample \(i\). The covariate is either a binary group membership (for differential analysis) or continuous. We model the read counts as arising from an inhomogeneous Poisson process: \[ x^i_b \sim \text{Pois}(\lambda^i_b) \] We use the multiseq methods to 1) estimate the effect of \(g\) on \(\log\lambda_b\) and 2) test for association (or difference for a binary \(g\)) between \(g\) and \(\log\lambda\). Here we present a brief demonstration of the methods.

Set up environment

We begin by loading the multiseq and ashr packages.

library(multiseq)
library(ashr)

Simulated data

In this section, we first simulate data set using the procedure described in Shim et al., 2021, and then apply the multiseq method to the simulated data. Shim et al., 2021 simulates data by subsampling reads from DNase-seq data (see Shim et al., 2021 for the detailed description). We first load the DNase-seq data on chr17:10160989-10162012.

data(chr17.10160989.10162012.DNase.seq, package="multiseq")

Using the DNase-seq data in this region, we will simulate a data set with \(N\) = 10 and expected library read depth = 39 M \(\times\) 2. The data set has two groups of samples (5 vs 5), and non-zero difference over 10161438-10161658.

total.count = as.numeric(apply(chr17.10160989.10162012.DNase.seq$x, 2, sum))

# two groups of samples (5 vs 5)
g = c(rep(0,5), rep(1,5))

# expected library read depth = 39 M X 2
read.depth.ratio = 2
if(!is.null(read.depth.ratio)){
  total.count = floor(total.count*read.depth.ratio)
}

numSam = length(g)
numBPs = length(total.count)

# non-zero difference over 10161438-10161658
eff = rep(1, 1024)
eff[450:670] = 2
mu0.sig = 2/70/(1+eff)
mu1.sig = 2*eff/70/(1+eff)

The simulation procedure described in Shim et al., 2021 has been implemented in sample.from.Binomial.with.Overdispersion function.

# simulated data will be saved here
x = matrix(data=NA, nr= numSam, nc = numBPs)

# set seed
set.seed(1111)

# group 0
wh0 = which(g == 0)
if(length(wh0) > 0){
    x[wh0,] = sample.from.Binomial.with.Overdispersion(num.sam = length(wh0), total.count = total.count, mu.sig = mu0.sig)
}

# group 1
wh1 = which(g == 1)
if(length(wh1) > 0){
    x[wh1,] = sample.from.Binomial.with.Overdispersion(num.sam = length(wh1), total.count = total.count, mu.sig = mu1.sig)
}

Let’s apply the multiseq method to x and g.

res.sim <- multiseq(x=x, g=g)

# test statistic for association analysis
res.sim$logLR$value
# [1] 10.97019

We can visualize the effect size using the function plot. We will highlight areas with strong effects (i.e., zero is outside of the interval constructed by the posterior mean \(\pm\) threshold \(\times\) posterior standard deviation). We also visualize the average read counts for each group. We obtain the smoothed average read counts using multiseq function.

# smoothed average read counts for group 0
res.sim.0 <- multiseq(x=x[1:5,])
# smoothed average read counts for group 1
res.sim.1 <- multiseq(x=x[6:10,])

par(mfrow=c(2,1), oma = c(0,0,0,0) + 0.1, mar = c(2,2,1,2), mgp=c(0,1,0))
# visualize the smoothed average read counts for each group
plot(1:1024, exp(res.sim.1$baseline.mean),type="l", col = "navy", ylab = "", xlab="")
points(1:1024, exp(res.sim.0$baseline.mean), type="l", col ="orange")
# visualize area with true non-zero difference using dark green color
points(450:670,rep(0.01,670-450+1), col="dark green")
# visualize the effect size from multiseq; highlight areas with strong effects (i.e., zero is outside of the interval constructed by the posterior mean +/-2xposterior standard deviation).
plot(res.sim, threshold=2, is.xaxis=TRUE)
&nbsp;

 

Also, we can obtain the information on the area with strong effect using the get.areas.strong.effects function.

get.areas.strong.effects(res.sim, threshold=2)
# $start
# [1] 382
# 
# $end
# [1] 696
# 
# $sign
# [1] "+"
# 
# $threshold
# [1] 2

multiseq identified the area with true signal well.

ATAC-seq data on chr1:568739-569762 from Shim et al., 2021

In this section, we apply multiseq to ATAC-seq data on chr1:568739-569762 from Shim et al. 2021 and reproduce the effect size, test statistic (logLR), and p-value in Figure 3 of Shim et al. 2021. We first load the ATAC-seq data on chr17:10160989-10162012 and relevant information:

data(chr1.568739.569762.ATACseq, package="multiseq")
?chr1.568739.569762.ATACseq
str(chr1.568739.569762.ATACseq)
# List of 7
#  $ x                 : num [1:6, 1:1024] 0 0 1 0 1 0 0 0 0 0 ...
#  $ region            : chr "chr1:568739-569762"
#  $ g                 : chr [1:6] "copper" "copper" "copper" "control" ...
#  $ read.depth        : num [1:6] 51781312 67287342 41962596 63826433 65498157 ...
#  $ overall.logLR     : num 0.381
#  $ overall.effect    : num -0.098
#  $ overall.effect.var: num 0.0126

chr1:568739-569762$x is a 6 by 1024 matrix of ATAC-seq read counts. multiseq_Shim_et_al github repository provides a script to obtain this matrix from the hdf5 files - extracting ATAC-seq read counts on chr1:568739-569762 (for three copper treated samples and three media control samples) from the hdf5 files and doing data-preprocessing as described in Shim et al. 2021. chr1:568739.569762 provides the log likelihood ratio (overall.logLR), effect size estimate (effect size), and its variance (overall.effect.var) for overall expression which we computed using DESeq2 output. We will incorporate them into the multiseq to model the effect of \(g\) on the overall expression. logLR in Figure 3 of Shim et al. 2021 has been reproduced.

# 1 for copper samples and 0 for control samples
g = c(1,1,1,0,0,0)

# Incorporate overall.logLR, effect size, overall.effect.var into the multiseq to model the effect of g on the overall expression.
res.ATACseq          <- multiseq(x=chr1.568739.569762.ATACseq$x, g=g, overall.loglr = chr1.568739.569762.ATACseq$overall.logLR, overall.effect = c(chr1.568739.569762.ATACseq$overall.effect, chr1.568739.569762.ATACseq$overall.effect.var))

# test statistic for differential analysis
obs.statistic = res.ATACseq$logLR$value
obs.statistic
# [1] 98.91417

We visualize the effect size using the function plot. The effect size in Figure 3 of Shim et al. 2021 has been reproduced.

res.ATACseq$region   <- chr1.568739.569762.ATACseq$region
plot(res.ATACseq, threshold=2.6, is.xaxis=TRUE)
&nbsp;

 

We applied multiseq to two controls (media vs. ethanol) for the 242,714 regions in Shim et al. 2021 and the resulting 242,714 test statistics have been saved in ATACseq.multiseq.stat.null. To reproduce the p-value in Figure 3 of Shim et al 2021, we load ATACseq.multiseq.stat.null.

data(ATACseq.multiseq.stat.null, package="multiseq")

The two controls are expected to have no differences in chromatin accessibility. Thus, we can construct the empirical null distribution of the multiseq test statistic using the 242,714 test statistics in ATACseq.multiseq.stat.null. We can compute a p-value for each observed test statistic using the function get.pvalue.from.empirical.null.dist. This function uses a randomization technique to produce continuous p-values that are uniformly distributed under the null, leading to the accurate estimation of FDR in the ‘qvalue’ package. See ?get.pvalue.from.empirical.null.dist for the details of the randomization technique.

pval = get.pvalue.from.empirical.null.dist(ATACseq.multiseq.stat.null, obs.statistic, seed=149)
pval
# [1] 3.226061e-06

This p-value is slightly different from the p-value in Figure 3 of Shim et al 2021 due to the randomization technique. The function get.pvalue.from.empirical.null.dist can compute p-values for multiple observed test statistics.

pval2 = get.pvalue.from.empirical.null.dist(ATACseq.multiseq.stat.null,c(obs.statistic, 0.1), seed=149)
pval2
# [1] 3.226061e-06 7.322094e-01

RAN-seq data on the gene OAS1 for splicingQTL analysis from Pickrell et al. 2010.

In this section, we apply multiseq to RNA-seq data on the gene OAS1 (chr12:113354417-113358512) to test for association with genotypes at SNP rs10774671. The SNP rs10774671 was found to be associated with splicing in the gene OAS1 (see Figure 3 in Pickrell et al. 2010). We first load the RAN-seq data and relevant information:

data(OAS1, package="multiseq")

Let’s apply the multiseq method to the RNA-seq data (x) and genotype data (g). We provide library read depth (read.depth) in the function.

res.RNAseq <- multiseq(x=OAS1$x, g=OAS1$g, minobs=50, read.depth=OAS1$read.depth, lm.approx = FALSE)

# test statistic for association analysis
res.RNAseq$logLR$value
# [1] 265.8064

Let’s visualize the effect size using the function plot.

# smoothed average read counts for each group
res.RNAseq.0  <- multiseq(x=OAS1$x[OAS1$g==0,], read.depth=OAS1$read.depth[OAS1$g==0])
res.RNAseq.1  <- multiseq(x=OAS1$x[OAS1$g==1,], read.depth=OAS1$read.depth[OAS1$g==1])
res.RNAseq.2  <- multiseq(x=OAS1$x[OAS1$g==2,], read.depth=OAS1$read.depth[OAS1$g==2])

# visualize smoothed average read counts for each group, and the effect size using `plot` function.
par(mfrow=c(4,1), oma = c(0,0,0,0) + 0.1, mar = c(2,2,1,2), mgp=c(0,1,0))
res.RNAseq$region   <- OAS1$region
res.RNAseq.0$region   <- OAS1$region
res.RNAseq.1$region   <- OAS1$region
res.RNAseq.2$region   <- OAS1$region
M <- max(res.RNAseq.0$baseline.mean, res.RNAseq.1$baseline.mean, res.RNAseq.2$baseline.mean)
m <- min(res.RNAseq.0$baseline.mean, res.RNAseq.1$baseline.mean, res.RNAseq.2$baseline.mean)
ylim         <- exp(c(m,M))
plot(res.RNAseq.0, what="baseline", main="(genotype AA)", ylim=ylim)
plot(res.RNAseq.1, what="baseline", main="(genotype AG)", ylim=ylim)
plot(res.RNAseq.2, what="baseline", main="(genotype GG)", ylim=ylim)
plot(res.RNAseq, threshold=2.2)
&nbsp;

 

Session info

This is the version of R and the packages that were used to generate the results shown above.

sessionInfo()
# R version 4.0.2 (2020-06-22)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Catalina 10.15.4
# 
# Matrix products: default
# BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
# LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
# 
# attached base packages:
# [1] tools     stats     graphics  grDevices utils     datasets  methods  
# [8] base     
# 
# other attached packages:
# [1] multiseq_0.1.3    data.table_1.12.8 ashr_1.0.12       Rcpp_1.0.5       
# 
# loaded via a namespace (and not attached):
#  [1] codetools_0.2-16  digest_0.6.25     foreach_1.5.1     truncnorm_1.0-8  
#  [5] MASS_7.3-51.6     magrittr_1.5      evaluate_0.14     highr_0.8        
#  [9] rlang_0.4.11      stringi_1.4.6     pscl_1.5.5        doParallel_1.0.16
# [13] rmarkdown_2.3     iterators_1.0.13  stringr_1.4.0     parallel_4.0.2   
# [17] xfun_0.15         yaml_2.2.1        compiler_4.0.2    SQUAREM_2021.1   
# [21] htmltools_0.5.0   knitr_1.29