Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch effect despite disabled parameters? #158

Closed
LeonHafner opened this issue Nov 30, 2022 · 10 comments
Closed

Batch effect despite disabled parameters? #158

LeonHafner opened this issue Nov 30, 2022 · 10 comments
Labels

Comments

@LeonHafner
Copy link

Hi together,
thanks for developing splatter/splatPop, it is one of the best tools for scRNA-Seq simulation.
We are currently testing different DE methods for their performance on multiple simulation scenarios.
One scenario consists of 10 samples across one batch while a more difficult one distributes those 10 samples equally into two batches. Our problem is, that the simulation with batches is always too tough for the methods regardless of the batch parameters and thereby producing many false positives.
For debugging I created two equal simulations with 10 samples (500 cells each) and 1,000 genes. The second simulation assigns the samples to two batches, while the first one consists of only one batch. As my batch.facLoc and batch.facScale parameters are both zero (LogNorm(0, 0) = 1 and rlnorm(1, 0, 0) = 1), those batches should not introduce any effect, but lead to the same dataset as the first simulation. This is however not the case. After pseudo bulking the data and testing for DE with DESeq2 the first scenario (one batch) results in much more confident p-values. This in turn results in many false positives and makes the second simulation much tougher despite same parameters.
Did I misunderstand something or where does the additional batch effect come from despite disabled parameters?

Thanks a lot for your help!

Here is the code to simulate and test the two scenarios. You should be able to run it without additional information:

suppressPackageStartupMessages({
  library("splatter")
  library("scater")
  library("VariantAnnotation")
  library("ggplot2")
  library("DESeq2")
})

set.seed(0)

vcf <- mockVCF(n.samples = 10)

params.1 <- newSplatPopParams(nGenes = 1000,
                              eqtl.ES.shape = 0,
                              eqtl.ES.rate = 0,
                              out.prob = 0,
                              eqtl.n = 0,
                              eqtl.group.specific = 0,
                              eqtl.condition.specific = 0,
                              batch.size = 10,
                              batchCells = 500,
                              batch.facLoc = 0,
                              batch.facScale = 0,
                              condition.prob = c(0.5, 0.5),
                              cde.prob = 0.025,
                              cde.facLoc = 2.5,
                              cde.facScale = 0.4
)

params.2 <- newSplatPopParams(nGenes = 1000,
                              eqtl.ES.shape = 0,
                              eqtl.ES.rate = 0,
                              out.prob = 0,
                              eqtl.n = 0,
                              eqtl.group.specific = 0,
                              eqtl.condition.specific = 0,
                              batch.size = 5,
                              batchCells = c(500, 500),
                              batch.facLoc = 0,
                              batch.facScale = 0,
                              condition.prob = c(0.5, 0.5),
                              cde.prob = 0.025,
                              cde.facLoc = 2.5,
                              cde.facScale = 0.4
)

sim.1 <- splatPopSimulate(vcf = vcf, params = params.1)
sim.2 <- splatPopSimulate(vcf = vcf, params = params.2)

# Pseudobulk on sample level
pb.1 <- aggregateAcrossCells(x = sim.1, ids = colData(sim.1)[, c("Batch", "Sample")])
pb.2 <- aggregateAcrossCells(x = sim.2, ids = colData(sim.2)[, c("Batch", "Sample")])

# Prepare counts and colData for DESeq2
counts.1 <- counts(pb.1)
colnames(counts.1) <- colData(pb.1)[, "Sample"]
colData.1 <- colData(pb.1)
rownames(colData.1) <- colData(pb.1)[, "Sample"]

counts.2 <- counts(pb.2)
colnames(counts.2) <- colData(pb.2)[, "Sample"]
colData.2 <- colData(pb.2)
rownames(colData.2) <- colData(pb.2)[, "Sample"]

# DE testing with DESeq2
dds.1 <- DESeqDataSetFromMatrix(countData = counts(pb.1),
                                colData = colData.1,
                                design = ~ Condition)

dds.1 <- estimateSizeFactors(dds.1)
dds.1 <- DESeq(dds.1)
res.1 <- as.data.table(results(dds.1))


dds.2 <- DESeqDataSetFromMatrix(countData = counts(pb.2),
                                colData = colData.2,
                                design = ~ Condition)

dds.2 <- estimateSizeFactors(dds.2)
dds.2 <- DESeq(dds.2)
res.2 <- as.data.table(results(dds.2))

This is the output of DESeq2 for both scenarios sorted by p-value. Scenario 1 results in much more confident p-values:

> res.1[order(pvalue)]
        baseMean log2FoldChange      lfcSE          stat        pvalue          padj
   1:  39271.348   3.7530782222 0.09474242  39.613492571  0.000000e+00  0.000000e+00
   2:  39519.200  -4.4579613134 0.11356787 -39.253718144  0.000000e+00  0.000000e+00
   3:  14849.803  -3.9289920768 0.12054728 -32.592956021 5.160889e-233 1.715135e-230
   4: 387526.760   3.7592597035 0.13433183  27.984877552 2.482390e-172 6.187357e-170
   5:  12441.673   3.9235447150 0.14047984  27.929591721 1.166780e-171 2.326560e-169
  ---                                                                               
 996:  60840.819   0.0047395526 1.27712223   0.003711119  9.970390e-01  9.980400e-01
 997:  76783.583  -0.0002658594 0.11221507  -0.002369195  9.981097e-01  9.981097e-01
 998:    852.962   2.8026617725 1.13404163   2.471392325            NA            NA
 999: 113210.342  -2.1709415982 2.26108870  -0.960131110            NA            NA
1000:  29380.410  -1.1309774702 1.25779486  -0.899174822            NA            NA

> res.2[order(pvalue)]
        baseMean log2FoldChange      lfcSE          stat       pvalue        padj
   1:  21310.844  -1.181275e+01 2.47649864 -4.7699395379 1.842812e-06 0.001837284
   2:  32270.440  -4.253230e-01 0.11013273 -3.8619131284 1.125026e-04 0.056082533
   3:   2270.457   1.608649e+00 0.43072538  3.7347444984 1.879059e-04 0.062447381
   4:  42064.238  -2.932651e-01 0.08956397 -3.2743646938 1.058998e-03 0.263955354
   5:  46109.017   1.296668e-01 0.04750404  2.7295945781 6.341226e-03 0.997434679
  ---                                                                            
 996:  56936.018  -3.372419e-03 1.21392230 -0.0027781173 9.977834e-01 0.998785176
 997:  27274.740   2.891972e-04 0.77664735  0.0003723662 9.997029e-01 0.999702895
 998:  55969.223  -9.208808e-01 1.12961802 -0.8152143743           NA          NA
 999: 425851.590   9.142142e-01 2.64224044  0.3459996223           NA          NA
1000:  22809.505   2.695978e+00 0.93114677  2.8953304364           NA          NA
SessionInfo > sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] tools stats4 stats graphics grDevices utils datasets methods base

other attached packages:
[1] DESeq2_1.36.0 VariantAnnotation_1.42.1 Rsamtools_2.12.0 Biostrings_2.64.1
[5] XVector_0.36.0 splatter_1.20.0 tidyr_1.2.1 dplyr_1.0.10
[9] pheatmap_1.0.12 RColorBrewer_1.1-3 wesanderson_0.3.6 filesstrings_3.2.3
[13] stringr_1.4.1 readxl_1.4.1 data.table_1.14.6 zellkonverter_1.6.5
[17] scater_1.24.0 ggplot2_3.4.0 scuttle_1.6.3 MAST_1.22.0
[21] SingleCellExperiment_1.18.1 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[25] GenomeInfoDb_1.32.4 IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
[29] MatrixGenerics_1.8.1 matrixStats_0.62.0

loaded via a namespace (and not attached):
[1] backports_1.4.1 BiocFileCache_2.4.0 plyr_1.8.8 splines_4.2.1 BiocParallel_1.30.4
[6] digest_0.6.30 viridis_0.6.2 fansi_1.0.3 magrittr_2.0.3 checkmate_2.1.0
[11] memoise_2.0.1 strex_1.4.4 BSgenome_1.64.0 ScaledMatrix_1.4.1 annotate_1.74.0
[16] prettyunits_1.1.1 colorspace_2.0-3 blob_1.2.3 rappdirs_0.3.3 ggrepel_0.9.2
[21] crayon_1.5.2 RCurl_1.98-1.9 jsonlite_1.8.3 genefilter_1.78.0 survival_3.4-0
[26] glue_1.6.2 gtable_0.3.1 zlibbioc_1.42.0 DelayedArray_0.22.0 BiocSingular_1.12.0
[31] abind_1.4-5 scales_1.2.1 DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.1
[36] xtable_1.8-4 progress_1.2.2 reticulate_1.26 bit_4.0.5 rsvd_1.0.5
[41] preprocessCore_1.58.0 httr_1.4.4 dir.expiry_1.4.0 ellipsis_0.3.2 pkgconfig_2.0.3
[46] XML_3.99-0.12 farver_2.1.1 dbplyr_2.2.1 locfit_1.5-9.6 utf8_1.2.2
[51] here_1.0.1 tidyselect_1.2.0 labeling_0.4.2 rlang_1.0.6 reshape2_1.4.4
[56] AnnotationDbi_1.58.0 munsell_0.5.0 cellranger_1.1.0 cachem_1.0.6 cli_3.4.1
[61] generics_0.1.3 RSQLite_2.2.18 fastmap_1.1.0 yaml_2.3.6 bit64_4.0.5
[66] purrr_0.3.5 KEGGREST_1.36.3 sparseMatrixStats_1.8.0 xml2_1.3.3 biomaRt_2.52.0
[71] compiler_4.2.1 rstudioapi_0.14 beeswarm_0.4.0 filelock_1.0.2 curl_4.3.3
[76] png_0.1-7 tibble_3.1.8 geneplotter_1.74.0 stringi_1.7.8 basilisk.utils_1.8.0
[81] GenomicFeatures_1.48.4 lattice_0.20-45 Matrix_1.5-3 vctrs_0.5.1 pillar_1.8.1
[86] lifecycle_1.0.3 BiocNeighbors_1.14.0 bitops_1.0-7 irlba_2.3.5.1 rtracklayer_1.56.1
[91] R6_2.5.1 BiocIO_1.6.0 gridExtra_2.3 vipor_0.4.5 codetools_0.2-18
[96] assertthat_0.2.1 rprojroot_2.0.3 rjson_0.2.21 withr_2.5.0 GenomicAlignments_1.32.1
[101] GenomeInfoDbData_1.2.8 parallel_4.2.1 hms_1.1.2 grid_4.2.1 beachmat_2.12.0
[106] basilisk_1.8.1 DelayedMatrixStats_1.18.2 ggbeeswarm_0.6.0 restfulr_0.0.15

@lazappi
Copy link
Collaborator

lazappi commented Dec 1, 2022

@azodichr Can you please help out here? I think you are right that those parameters should turn off the batch effects which come from the base Splat model but I'm not sure if there is another part of the splatPop model we need to think about.

@LeonHafner
Copy link
Author

LeonHafner commented Dec 12, 2022

I would like to add those two sanity plots that check the count distribution for the above used pseudobulked datasets (pb.1 and pb.2).
Shown are the counts per condition, shaped by their associated batch, for the top 10 differential expressed genes (by foldchange) as indicated by the splatter/splatPop ConditionDE column (colData).

Plot for simulation 1 (based on pseudobulk)

without_batches

Plot for simulation 2 (based on pseudobulk)

with_batches

I think that makes it clear, why DESeq2 returns quite high p-values in the second scenario. The specified FCs are those initially sampled by splatPop from a log normal distribution (max(ConditionDE.Condition1 / ConditionDE.Condition2, ConditionDE.Condition2 / ConditionDE.Condition1)) and are not calculated based on the final simulated counts.

@lazappi
Copy link
Collaborator

lazappi commented Dec 15, 2022

Ok, it took me a while to understand but I think I get what the issue is now. Just to confirm (and in case I need to remind myself later) the issue is that you should get the same difference between conditions in simulation 2 as simulation 1 but for some reason you don't (and this effect seems to have something to do with the batch effect parameters)?

  • Are these simulations the same as the code in your original post (so I know what to check)?
  • Have you checked the same thing happens with splatSimulate() (is the base model the issue or something to do with splatPop)?
  • Have you checked that the batch effect factors are all equal to 1 in the final simulation?

@LeonHafner
Copy link
Author

Exactly, since the batch effects are turned off in sim.2 I should get approximately the same p-values as with sim.1 which has no batches. This repeats in every run independent of any seed.

  • Yes, the plots in my second post are based on the pseudobulked datasets (pb.1 and pb.2).
  • I did not check the splatter base model regarding that problem. How would you suggest doing that, since there are no samples or conditions in splatter (if I remember correctly) for getting a comparable pseudobulk?
  • Yes, the corresponding columns in the rowData of the simulations are all 1.

@lazappi
Copy link
Collaborator

lazappi commented Dec 16, 2022

That's sounding more like there is an issue somewhere. For the base model I forgot that you were looking at pseudobulks. I think you can probably make similar boxplots of cell level expression. They won't be quite the same but we should be able to see if there is the same pattern.

@LeonHafner
Copy link
Author

LeonHafner commented Dec 16, 2022

Can you replicate the problem I described in my first post when you run the code or does the problem only occur with me? I have tested it on two different machines, however they have very similar packages installed.

@azodichr
Copy link
Collaborator

Hi @LeonHafner Thanks for bringing this to our attention - and apologies for the slow response, I unexpectedly lost access to my post-doc accounts after starting a new position in the private sector.

Looking into your examples more closely, there seems to be a bug in the code that is causing the reported Condition IDs to not match up with the conditional group assignments happening within the code. This mix up is only triggered if batches are requested and is a bug that I believe was introduced just a few months ago with a separate bug fix relating to batch effect simulations. I will work on a fix and let you know when I have a patch ready.

Note, that if you plot the cells from your example using PCA coloring/shaping by batch/condition/sample, you can see that setting batch.facLoc=0 and batch.facScale=0 is having the desired consequence of no batch effects.

@LeonHafner
Copy link
Author

Hi @azodichr,
thanks a lot for finding the cause of this bug. I really appreciate this, since I couldn't find it for a whole two days. Your explanation makes totally sense, with wrong assignment of samples to groups there should only be some random differential expression signal.
Is there currently any other way that I could make use of to simulate samples with conditions and batches?
Sure, there is no batch effect in the PCA, but how would that help me with the differential expression testing if the condition assignment is mixed up?
Thank you very much for your time, I really appreciate your effort!

@lazappi lazappi added bug and removed question labels Jan 2, 2023
@azodichr
Copy link
Collaborator

Hi @LeonHafner, thanks again for brining this issue to our attention and for your patience. I have fixed the problem with conditional group assignments getting mislabeled and pushed it to @lazappi (see pull reqest "Issue_158") to be merged into the dev version.

@lazappi
Copy link
Collaborator

lazappi commented Jan 30, 2023

I have merged the PR and the fix should be available on release and devel soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants