Batch effect despite disabled parameters? #158

LeonHafner · 2022-11-30T13:19:40Z

Hi together,
thanks for developing splatter/splatPop, it is one of the best tools for scRNA-Seq simulation.
We are currently testing different DE methods for their performance on multiple simulation scenarios.
One scenario consists of 10 samples across one batch while a more difficult one distributes those 10 samples equally into two batches. Our problem is, that the simulation with batches is always too tough for the methods regardless of the batch parameters and thereby producing many false positives.
For debugging I created two equal simulations with 10 samples (500 cells each) and 1,000 genes. The second simulation assigns the samples to two batches, while the first one consists of only one batch. As my batch.facLoc and batch.facScale parameters are both zero (LogNorm(0, 0) = 1 and rlnorm(1, 0, 0) = 1), those batches should not introduce any effect, but lead to the same dataset as the first simulation. This is however not the case. After pseudo bulking the data and testing for DE with DESeq2 the first scenario (one batch) results in much more confident p-values. This in turn results in many false positives and makes the second simulation much tougher despite same parameters.
Did I misunderstand something or where does the additional batch effect come from despite disabled parameters?

Thanks a lot for your help!

Here is the code to simulate and test the two scenarios. You should be able to run it without additional information:

suppressPackageStartupMessages({
  library("splatter")
  library("scater")
  library("VariantAnnotation")
  library("ggplot2")
  library("DESeq2")
})

set.seed(0)

vcf <- mockVCF(n.samples = 10)

params.1 <- newSplatPopParams(nGenes = 1000,
                              eqtl.ES.shape = 0,
                              eqtl.ES.rate = 0,
                              out.prob = 0,
                              eqtl.n = 0,
                              eqtl.group.specific = 0,
                              eqtl.condition.specific = 0,
                              batch.size = 10,
                              batchCells = 500,
                              batch.facLoc = 0,
                              batch.facScale = 0,
                              condition.prob = c(0.5, 0.5),
                              cde.prob = 0.025,
                              cde.facLoc = 2.5,
                              cde.facScale = 0.4
)

params.2 <- newSplatPopParams(nGenes = 1000,
                              eqtl.ES.shape = 0,
                              eqtl.ES.rate = 0,
                              out.prob = 0,
                              eqtl.n = 0,
                              eqtl.group.specific = 0,
                              eqtl.condition.specific = 0,
                              batch.size = 5,
                              batchCells = c(500, 500),
                              batch.facLoc = 0,
                              batch.facScale = 0,
                              condition.prob = c(0.5, 0.5),
                              cde.prob = 0.025,
                              cde.facLoc = 2.5,
                              cde.facScale = 0.4
)

sim.1 <- splatPopSimulate(vcf = vcf, params = params.1)
sim.2 <- splatPopSimulate(vcf = vcf, params = params.2)

# Pseudobulk on sample level
pb.1 <- aggregateAcrossCells(x = sim.1, ids = colData(sim.1)[, c("Batch", "Sample")])
pb.2 <- aggregateAcrossCells(x = sim.2, ids = colData(sim.2)[, c("Batch", "Sample")])

# Prepare counts and colData for DESeq2
counts.1 <- counts(pb.1)
colnames(counts.1) <- colData(pb.1)[, "Sample"]
colData.1 <- colData(pb.1)
rownames(colData.1) <- colData(pb.1)[, "Sample"]

counts.2 <- counts(pb.2)
colnames(counts.2) <- colData(pb.2)[, "Sample"]
colData.2 <- colData(pb.2)
rownames(colData.2) <- colData(pb.2)[, "Sample"]

# DE testing with DESeq2
dds.1 <- DESeqDataSetFromMatrix(countData = counts(pb.1),
                                colData = colData.1,
                                design = ~ Condition)

dds.1 <- estimateSizeFactors(dds.1)
dds.1 <- DESeq(dds.1)
res.1 <- as.data.table(results(dds.1))


dds.2 <- DESeqDataSetFromMatrix(countData = counts(pb.2),
                                colData = colData.2,
                                design = ~ Condition)

dds.2 <- estimateSizeFactors(dds.2)
dds.2 <- DESeq(dds.2)
res.2 <- as.data.table(results(dds.2))

This is the output of DESeq2 for both scenarios sorted by p-value. Scenario 1 results in much more confident p-values:

> res.1[order(pvalue)]
        baseMean log2FoldChange      lfcSE          stat        pvalue          padj
   1:  39271.348   3.7530782222 0.09474242  39.613492571  0.000000e+00  0.000000e+00
   2:  39519.200  -4.4579613134 0.11356787 -39.253718144  0.000000e+00  0.000000e+00
   3:  14849.803  -3.9289920768 0.12054728 -32.592956021 5.160889e-233 1.715135e-230
   4: 387526.760   3.7592597035 0.13433183  27.984877552 2.482390e-172 6.187357e-170
   5:  12441.673   3.9235447150 0.14047984  27.929591721 1.166780e-171 2.326560e-169
  ---                                                                               
 996:  60840.819   0.0047395526 1.27712223   0.003711119  9.970390e-01  9.980400e-01
 997:  76783.583  -0.0002658594 0.11221507  -0.002369195  9.981097e-01  9.981097e-01
 998:    852.962   2.8026617725 1.13404163   2.471392325            NA            NA
 999: 113210.342  -2.1709415982 2.26108870  -0.960131110            NA            NA
1000:  29380.410  -1.1309774702 1.25779486  -0.899174822            NA            NA

> res.2[order(pvalue)]
        baseMean log2FoldChange      lfcSE          stat       pvalue        padj
   1:  21310.844  -1.181275e+01 2.47649864 -4.7699395379 1.842812e-06 0.001837284
   2:  32270.440  -4.253230e-01 0.11013273 -3.8619131284 1.125026e-04 0.056082533
   3:   2270.457   1.608649e+00 0.43072538  3.7347444984 1.879059e-04 0.062447381
   4:  42064.238  -2.932651e-01 0.08956397 -3.2743646938 1.058998e-03 0.263955354
   5:  46109.017   1.296668e-01 0.04750404  2.7295945781 6.341226e-03 0.997434679
  ---                                                                            
 996:  56936.018  -3.372419e-03 1.21392230 -0.0027781173 9.977834e-01 0.998785176
 997:  27274.740   2.891972e-04 0.77664735  0.0003723662 9.997029e-01 0.999702895
 998:  55969.223  -9.208808e-01 1.12961802 -0.8152143743           NA          NA
 999: 425851.590   9.142142e-01 2.64224044  0.3459996223           NA          NA
1000:  22809.505   2.695978e+00 0.93114677  2.8953304364           NA          NA

SessionInfo

> sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] tools stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.36.0 VariantAnnotation_1.42.1 Rsamtools_2.12.0 Biostrings_2.64.1
[5] XVector_0.36.0 splatter_1.20.0 tidyr_1.2.1 dplyr_1.0.10
[9] pheatmap_1.0.12 RColorBrewer_1.1-3 wesanderson_0.3.6 filesstrings_3.2.3
[13] stringr_1.4.1 readxl_1.4.1 data.table_1.14.6 zellkonverter_1.6.5
[17] scater_1.24.0 ggplot2_3.4.0 scuttle_1.6.3 MAST_1.22.0
[21] SingleCellExperiment_1.18.1 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[25] GenomeInfoDb_1.32.4 IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
[29] MatrixGenerics_1.8.1 matrixStats_0.62.0

loaded via a [1] backports_1.4.1 [6] digest_0.6.30 [11] memoise_2.0.1 [16] prettyunits_1.1.1 [21] crayon_1.5.2 [26] glue_1.6.2 [31] abind_1.4-5 [36] xtable_1.8-4 [41] preprocessCore_1.58.0 [46] XML_3.99-0.12 [51] here_1.0.1 [56] AnnotationDbi_1.58.0 [61] generics_0.1.3 [66] purrr_0.3.5 [71] compiler_4.2.1 [76] png_0.1-7 [81] GenomicFeatures_1.48.4 [86] lifecycle_1.0.3 [91] R6_2.5.1 [96] assertthat_0.2.1 [101] GenomeInfoDbData_1.2.8 [106] basilisk_1.8.1 namespace (and not attached):
BiocFileCache_2.4.0 plyr_1.8.8 splines_4.2.1 BiocParallel_1.30.4
viridis_0.6.2 fansi_1.0.3 magrittr_2.0.3 checkmate_2.1.0
strex_1.4.4 BSgenome_1.64.0 ScaledMatrix_1.4.1 annotate_1.74.0
colorspace_2.0-3 blob_1.2.3 rappdirs_0.3.3 ggrepel_0.9.2
RCurl_1.98-1.9 jsonlite_1.8.3 genefilter_1.78.0 survival_3.4-0
gtable_0.3.1 zlibbioc_1.42.0 DelayedArray_0.22.0 BiocSingular_1.12.0
scales_1.2.1 DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.1
progress_1.2.2 reticulate_1.26 bit_4.0.5 rsvd_1.0.5
httr_1.4.4 dir.expiry_1.4.0 ellipsis_0.3.2 pkgconfig_2.0.3
farver_2.1.1 dbplyr_2.2.1 locfit_1.5-9.6 utf8_1.2.2
tidyselect_1.2.0 labeling_0.4.2 rlang_1.0.6 reshape2_1.4.4
munsell_0.5.0 cellranger_1.1.0 cachem_1.0.6 cli_3.4.1
RSQLite_2.2.18 fastmap_1.1.0 yaml_2.3.6 bit64_4.0.5
KEGGREST_1.36.3 sparseMatrixStats_1.8.0 xml2_1.3.3 biomaRt_2.52.0
rstudioapi_0.14 beeswarm_0.4.0 filelock_1.0.2 curl_4.3.3
tibble_3.1.8 geneplotter_1.74.0 stringi_1.7.8 basilisk.utils_1.8.0
lattice_0.20-45 Matrix_1.5-3 vctrs_0.5.1 pillar_1.8.1
BiocNeighbors_1.14.0 bitops_1.0-7 irlba_2.3.5.1 rtracklayer_1.56.1
BiocIO_1.6.0 gridExtra_2.3 vipor_0.4.5 codetools_0.2-18
rprojroot_2.0.3 rjson_0.2.21 withr_2.5.0 GenomicAlignments_1.32.1
parallel_4.2.1 hms_1.1.2 grid_4.2.1 beachmat_2.12.0
DelayedMatrixStats_1.18.2 ggbeeswarm_0.6.0 restfulr_0.0.15

The text was updated successfully, but these errors were encountered:

lazappi · 2022-12-01T08:03:05Z

@azodichr Can you please help out here? I think you are right that those parameters should turn off the batch effects which come from the base Splat model but I'm not sure if there is another part of the splatPop model we need to think about.

LeonHafner · 2022-12-12T08:34:05Z

I would like to add those two sanity plots that check the count distribution for the above used pseudobulked datasets (pb.1 and pb.2).
Shown are the counts per condition, shaped by their associated batch, for the top 10 differential expressed genes (by foldchange) as indicated by the splatter/splatPop ConditionDE column (colData).

Plot for simulation 1 (based on pseudobulk)

Plot for simulation 2 (based on pseudobulk)

I think that makes it clear, why DESeq2 returns quite high p-values in the second scenario. The specified FCs are those initially sampled by splatPop from a log normal distribution (max(ConditionDE.Condition1 / ConditionDE.Condition2, ConditionDE.Condition2 / ConditionDE.Condition1)) and are not calculated based on the final simulated counts.

lazappi · 2022-12-15T15:50:54Z

Ok, it took me a while to understand but I think I get what the issue is now. Just to confirm (and in case I need to remind myself later) the issue is that you should get the same difference between conditions in simulation 2 as simulation 1 but for some reason you don't (and this effect seems to have something to do with the batch effect parameters)?

Are these simulations the same as the code in your original post (so I know what to check)?
Have you checked the same thing happens with splatSimulate() (is the base model the issue or something to do with splatPop)?
Have you checked that the batch effect factors are all equal to 1 in the final simulation?

LeonHafner · 2022-12-15T16:00:52Z

Exactly, since the batch effects are turned off in sim.2 I should get approximately the same p-values as with sim.1 which has no batches. This repeats in every run independent of any seed.

Yes, the plots in my second post are based on the pseudobulked datasets (pb.1 and pb.2).
I did not check the splatter base model regarding that problem. How would you suggest doing that, since there are no samples or conditions in splatter (if I remember correctly) for getting a comparable pseudobulk?
Yes, the corresponding columns in the rowData of the simulations are all 1.

lazappi · 2022-12-16T09:49:39Z

That's sounding more like there is an issue somewhere. For the base model I forgot that you were looking at pseudobulks. I think you can probably make similar boxplots of cell level expression. They won't be quite the same but we should be able to see if there is the same pattern.

LeonHafner · 2022-12-16T09:59:11Z

Can you replicate the problem I described in my first post when you run the code or does the problem only occur with me? I have tested it on two different machines, however they have very similar packages installed.

azodichr · 2022-12-19T20:48:03Z

Hi @LeonHafner Thanks for bringing this to our attention - and apologies for the slow response, I unexpectedly lost access to my post-doc accounts after starting a new position in the private sector.

Looking into your examples more closely, there seems to be a bug in the code that is causing the reported Condition IDs to not match up with the conditional group assignments happening within the code. This mix up is only triggered if batches are requested and is a bug that I believe was introduced just a few months ago with a separate bug fix relating to batch effect simulations. I will work on a fix and let you know when I have a patch ready.

Note, that if you plot the cells from your example using PCA coloring/shaping by batch/condition/sample, you can see that setting batch.facLoc=0 and batch.facScale=0 is having the desired consequence of no batch effects.

LeonHafner · 2022-12-23T09:08:05Z

Hi @azodichr,
thanks a lot for finding the cause of this bug. I really appreciate this, since I couldn't find it for a whole two days. Your explanation makes totally sense, with wrong assignment of samples to groups there should only be some random differential expression signal.
Is there currently any other way that I could make use of to simulate samples with conditions and batches?
Sure, there is no batch effect in the PCA, but how would that help me with the differential expression testing if the condition assignment is mixed up?
Thank you very much for your time, I really appreciate your effort!

azodichr · 2023-01-27T17:56:32Z

Hi @LeonHafner, thanks again for brining this issue to our attention and for your patience. I have fixed the problem with conditional group assignments getting mislabeled and pushed it to @lazappi (see pull reqest "Issue_158") to be merged into the dev version.

lazappi · 2023-01-30T10:30:56Z

I have merged the PR and the fix should be available on release and devel soon

lazappi added the question label Dec 1, 2022

lazappi added bug and removed question labels Jan 2, 2023

lazappi closed this as completed in ccedee1 Jan 30, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch effect despite disabled parameters? #158

Batch effect despite disabled parameters? #158

LeonHafner commented Nov 30, 2022

lazappi commented Dec 1, 2022

LeonHafner commented Dec 12, 2022 •

edited

Loading

lazappi commented Dec 15, 2022

LeonHafner commented Dec 15, 2022

lazappi commented Dec 16, 2022

LeonHafner commented Dec 16, 2022 •

edited

Loading

azodichr commented Dec 19, 2022

LeonHafner commented Dec 23, 2022

azodichr commented Jan 27, 2023

lazappi commented Jan 30, 2023

Batch effect despite disabled parameters? #158

Batch effect despite disabled parameters? #158

Comments

LeonHafner commented Nov 30, 2022

lazappi commented Dec 1, 2022

LeonHafner commented Dec 12, 2022 • edited Loading

Plot for simulation 1 (based on pseudobulk)

Plot for simulation 2 (based on pseudobulk)

lazappi commented Dec 15, 2022

LeonHafner commented Dec 15, 2022

lazappi commented Dec 16, 2022

LeonHafner commented Dec 16, 2022 • edited Loading

azodichr commented Dec 19, 2022

LeonHafner commented Dec 23, 2022

azodichr commented Jan 27, 2023

lazappi commented Jan 30, 2023

LeonHafner commented Dec 12, 2022 •

edited

Loading

LeonHafner commented Dec 16, 2022 •

edited

Loading