-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch effect despite disabled parameters? #158
Comments
@azodichr Can you please help out here? I think you are right that those parameters should turn off the batch effects which come from the base Splat model but I'm not sure if there is another part of the splatPop model we need to think about. |
I would like to add those two sanity plots that check the count distribution for the above used pseudobulked datasets ( Plot for simulation 1 (based on pseudobulk)Plot for simulation 2 (based on pseudobulk)I think that makes it clear, why DESeq2 returns quite high p-values in the second scenario. The specified FCs are those initially sampled by splatPop from a log normal distribution ( |
Ok, it took me a while to understand but I think I get what the issue is now. Just to confirm (and in case I need to remind myself later) the issue is that you should get the same difference between conditions in simulation 2 as simulation 1 but for some reason you don't (and this effect seems to have something to do with the batch effect parameters)?
|
Exactly, since the batch effects are turned off in
|
That's sounding more like there is an issue somewhere. For the base model I forgot that you were looking at pseudobulks. I think you can probably make similar boxplots of cell level expression. They won't be quite the same but we should be able to see if there is the same pattern. |
Can you replicate the problem I described in my first post when you run the code or does the problem only occur with me? I have tested it on two different machines, however they have very similar packages installed. |
Hi @LeonHafner Thanks for bringing this to our attention - and apologies for the slow response, I unexpectedly lost access to my post-doc accounts after starting a new position in the private sector. Looking into your examples more closely, there seems to be a bug in the code that is causing the reported Condition IDs to not match up with the conditional group assignments happening within the code. This mix up is only triggered if batches are requested and is a bug that I believe was introduced just a few months ago with a separate bug fix relating to batch effect simulations. I will work on a fix and let you know when I have a patch ready. Note, that if you plot the cells from your example using PCA coloring/shaping by batch/condition/sample, you can see that setting |
Hi @azodichr, |
Hi @LeonHafner, thanks again for brining this issue to our attention and for your patience. I have fixed the problem with conditional group assignments getting mislabeled and pushed it to @lazappi (see pull reqest "Issue_158") to be merged into the dev version. |
I have merged the PR and the fix should be available on release and devel soon |
Hi together,
thanks for developing splatter/splatPop, it is one of the best tools for scRNA-Seq simulation.
We are currently testing different DE methods for their performance on multiple simulation scenarios.
One scenario consists of 10 samples across one batch while a more difficult one distributes those 10 samples equally into two batches. Our problem is, that the simulation with batches is always too tough for the methods regardless of the batch parameters and thereby producing many false positives.
For debugging I created two equal simulations with 10 samples (500 cells each) and 1,000 genes. The second simulation assigns the samples to two batches, while the first one consists of only one batch. As my
batch.facLoc
andbatch.facScale
parameters are both zero (LogNorm(0, 0) = 1 andrlnorm(1, 0, 0) = 1
), those batches should not introduce any effect, but lead to the same dataset as the first simulation. This is however not the case. After pseudo bulking the data and testing for DE with DESeq2 the first scenario (one batch) results in much more confident p-values. This in turn results in many false positives and makes the second simulation much tougher despite same parameters.Did I misunderstand something or where does the additional batch effect come from despite disabled parameters?
Thanks a lot for your help!
Here is the code to simulate and test the two scenarios. You should be able to run it without additional information:
This is the output of DESeq2 for both scenarios sorted by p-value. Scenario 1 results in much more confident p-values:
SessionInfo
> sessionInfo() R version 4.2.1 (2022-06-23) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 20.04.4 LTSMatrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] tools stats4 stats graphics grDevices utils datasets methods base
other attached packages:
[1] DESeq2_1.36.0 VariantAnnotation_1.42.1 Rsamtools_2.12.0 Biostrings_2.64.1
[5] XVector_0.36.0 splatter_1.20.0 tidyr_1.2.1 dplyr_1.0.10
[9] pheatmap_1.0.12 RColorBrewer_1.1-3 wesanderson_0.3.6 filesstrings_3.2.3
[13] stringr_1.4.1 readxl_1.4.1 data.table_1.14.6 zellkonverter_1.6.5
[17] scater_1.24.0 ggplot2_3.4.0 scuttle_1.6.3 MAST_1.22.0
[21] SingleCellExperiment_1.18.1 SummarizedExperiment_1.26.1 Biobase_2.56.0 GenomicRanges_1.48.0
[25] GenomeInfoDb_1.32.4 IRanges_2.30.1 S4Vectors_0.34.0 BiocGenerics_0.42.0
[29] MatrixGenerics_1.8.1 matrixStats_0.62.0
loaded via a namespace (and not attached):
[1] backports_1.4.1 BiocFileCache_2.4.0 plyr_1.8.8 splines_4.2.1 BiocParallel_1.30.4
[6] digest_0.6.30 viridis_0.6.2 fansi_1.0.3 magrittr_2.0.3 checkmate_2.1.0
[11] memoise_2.0.1 strex_1.4.4 BSgenome_1.64.0 ScaledMatrix_1.4.1 annotate_1.74.0
[16] prettyunits_1.1.1 colorspace_2.0-3 blob_1.2.3 rappdirs_0.3.3 ggrepel_0.9.2
[21] crayon_1.5.2 RCurl_1.98-1.9 jsonlite_1.8.3 genefilter_1.78.0 survival_3.4-0
[26] glue_1.6.2 gtable_0.3.1 zlibbioc_1.42.0 DelayedArray_0.22.0 BiocSingular_1.12.0
[31] abind_1.4-5 scales_1.2.1 DBI_1.1.3 Rcpp_1.0.9 viridisLite_0.4.1
[36] xtable_1.8-4 progress_1.2.2 reticulate_1.26 bit_4.0.5 rsvd_1.0.5
[41] preprocessCore_1.58.0 httr_1.4.4 dir.expiry_1.4.0 ellipsis_0.3.2 pkgconfig_2.0.3
[46] XML_3.99-0.12 farver_2.1.1 dbplyr_2.2.1 locfit_1.5-9.6 utf8_1.2.2
[51] here_1.0.1 tidyselect_1.2.0 labeling_0.4.2 rlang_1.0.6 reshape2_1.4.4
[56] AnnotationDbi_1.58.0 munsell_0.5.0 cellranger_1.1.0 cachem_1.0.6 cli_3.4.1
[61] generics_0.1.3 RSQLite_2.2.18 fastmap_1.1.0 yaml_2.3.6 bit64_4.0.5
[66] purrr_0.3.5 KEGGREST_1.36.3 sparseMatrixStats_1.8.0 xml2_1.3.3 biomaRt_2.52.0
[71] compiler_4.2.1 rstudioapi_0.14 beeswarm_0.4.0 filelock_1.0.2 curl_4.3.3
[76] png_0.1-7 tibble_3.1.8 geneplotter_1.74.0 stringi_1.7.8 basilisk.utils_1.8.0
[81] GenomicFeatures_1.48.4 lattice_0.20-45 Matrix_1.5-3 vctrs_0.5.1 pillar_1.8.1
[86] lifecycle_1.0.3 BiocNeighbors_1.14.0 bitops_1.0-7 irlba_2.3.5.1 rtracklayer_1.56.1
[91] R6_2.5.1 BiocIO_1.6.0 gridExtra_2.3 vipor_0.4.5 codetools_0.2-18
[96] assertthat_0.2.1 rprojroot_2.0.3 rjson_0.2.21 withr_2.5.0 GenomicAlignments_1.32.1
[101] GenomeInfoDbData_1.2.8 parallel_4.2.1 hms_1.1.2 grid_4.2.1 beachmat_2.12.0
[106] basilisk_1.8.1 DelayedMatrixStats_1.18.2 ggbeeswarm_0.6.0 restfulr_0.0.15
The text was updated successfully, but these errors were encountered: