-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds 'batch.rmEffect' parameter. #103
Conversation
When the 'batch.rmEffect' is TRUE, the simulation randomly generates the batch effects to preserve the random seed, but then removes the batch effect by over-writing the generated batch effects by 1. This allows creating simulation pairs with and without batch effects.
Note, lines 348-349 in splat-simulate.R. For some reason, I commented out:
I do not remember why I did that. I am guessing it should probably be uncommented? |
Hi @daynefiler This is an interesting idea. Thanks for sharing it. I guess the idea is to use the simulation with the "removed" batch effects as the ground truth when evaluating a batch correction method? I can see how that might be useful for some things. Do you have any examples you can show of what this looks like? |
I'm guessing because the factors are all 1 now so there's no point multiplying them? |
Looking, again, I think I commented it out because the To answer your question, you are completely correct. The idea is to build truth sets for testing batch correction methodologies. Drawing the random batch effects, then overwriting them (I think) ensures the rest of the random effects stay consistent across the two datasets (with & without batch). |
What would you like in terms of an example? I can try to put it together. |
Maybe just some pairs of t-SNEs/UMAPs coloured by batch/group with and
without batch effects to start with? I guess I'm mostly curious what this
looks like in practice.
…On Thu, 16 Jul 2020 at 16:00, Dayne Filer ***@***.***> wrote:
What would you like in terms of an example? I can try to put it together.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#103 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABUNCSFKD72RJ3E5MMD5D7LR34BYTANCNFSM4O2NBQNA>
.
|
Sorry I have forgotten about this a bit. Just checking if you are still interested in going ahead with this? |
Hi, sorry this fell off my to-do list. I will send some examples tomorrow afternoon. |
Hi, this illustrates the approach. Importance is using the same parameter object for each simulation, removing the batch multiplier in one.
|
Thanks for that, that's exactly what I wanted. I've had some time to think about this a bit more. You have probably been through this process already but I just wanted to double check. For a while I thought what you were suggesting was equivalent to setting p1 <- newSplatParams(group.prob = c(0.2, 0.3, 0.5), batchCells = c(1e3, 1e3))
s1 <- splatSimulate(p1, method = "groups")
s1 <- logNormCounts(s1)
s1 <- runPCA(s1)
plotPCA(s1, colour_by = "Batch", shape_by = "Group") s2 <- splatSimulate(p1, method = "groups", batch.facLoc = log(1), batch.facScale = 0)
s2 <- logNormCounts(s2)
s2 <- runPCA(s2)
plotPCA(s2, colour_by = "Batch", shape_by = "Group") That looks like it works but if you look for closely at the actual simulated values it's not quite true: DataFrame with 10000 rows and 9 columns
Gene BaseGeneMean OutlierFactor GeneMean BatchFacBatch1
<character> <numeric> <numeric> <numeric> <numeric>
Gene1 Gene1 2.2696734 1.0000 2.2696734 0.987669
Gene2 Gene2 5.3948593 1.0000 5.3948593 1.088644
Gene3 Gene3 0.3470967 61.6692 63.2300864 0.994561
Gene4 Gene4 0.0364211 1.0000 0.0364211 1.179244
Gene5 Gene5 3.4357531 1.0000 3.4357531 0.840552
... ... ... ... ... ...
Gene9996 Gene9996 0.506195 1 0.506195 1.073213
Gene9997 Gene9997 2.010122 1 2.010122 0.890210
Gene9998 Gene9998 2.581748 1 2.581748 1.220922
Gene9999 Gene9999 0.317081 1 0.317081 0.981184
Gene10000 Gene10000 1.217171 1 1.217171 1.306551
BatchFacBatch2 DEFacGroup1 DEFacGroup2 DEFacGroup3
<numeric> <numeric> <numeric> <numeric>
Gene1 1.038262 1.00000 1 1.00000
Gene2 1.006354 1.00000 1 1.00000
Gene3 1.169444 1.00000 1 1.53158
Gene4 0.942942 1.00000 1 1.06636
Gene5 1.327734 1.11651 1 1.00000
... ... ... ... ...
Gene9996 0.908150 1 1 1.000000
Gene9997 1.120104 1 1 1.287344
Gene9998 0.889335 1 1 0.513365
Gene9999 0.972102 1 1 1.000000
Gene10000 1.058190 1 1 1.000000
DataFrame with 10000 rows and 9 columns
Gene BaseGeneMean OutlierFactor GeneMean BatchFacBatch1
<character> <numeric> <numeric> <numeric> <numeric>
Gene1 Gene1 2.2696734 1.0000 2.2696734 1
Gene2 Gene2 5.3948593 1.0000 5.3948593 1
Gene3 Gene3 0.3470967 61.6692 63.2300864 1
Gene4 Gene4 0.0364211 1.0000 0.0364211 1
Gene5 Gene5 3.4357531 1.0000 3.4357531 1
... ... ... ... ... ...
Gene9996 Gene9996 0.506195 1 0.506195 1
Gene9997 Gene9997 2.010122 1 2.010122 1
Gene9998 Gene9998 2.581748 1 2.581748 1
Gene9999 Gene9999 0.317081 1 0.317081 1
Gene10000 Gene10000 1.217171 1 1.217171 1
BatchFacBatch2 DEFacGroup1 DEFacGroup2 DEFacGroup3
<numeric> <numeric> <numeric> <numeric>
Gene1 1 1 1 1
Gene2 1 1 1 1
Gene3 1 1 1 1
Gene4 1 1 1 1
Gene5 1 1 1 1
... ... ... ... ...
Gene9996 1 1.000000 1 1
Gene9997 1 1.000000 1 1
Gene9998 1 1.000000 1 1
Gene9999 1 0.913961 1 1
Gene10000 1 1.000000 1 1 You can see that everything before the # First set of numbers
> set.seed(1)
> rnorm(10)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684 0.4874291
[8] 0.7383247 0.5757814 -0.3053884
> rlnorm(10)
[1] 4.5348008 1.4767493 0.5372775 0.1091863 3.0800041 0.9560610 0.9839401 2.5698209
[9] 2.2732743 1.8110401
# Changing the parameters for the first random function changes the values from the second
> set.seed(1)
> rnorm(10, mean = 1, sd = 0)
[1] 1 1 1 1 1 1 1 1 1 1
> rlnorm(10)
[1] 0.5344838 1.2015872 0.4336018 4.9297132 1.3902836 0.4402254 1.6281250 2.0924271
[9] 1.7785196 0.7368371
# But if we use the same parameters we can get the first set back
> set.seed(1)
> rnorm(10)
[1] -0.6264538 0.1836433 -0.8356286 1.5952808 0.3295078 -0.8204684 0.4874291
[8] 0.7383247 0.5757814 -0.3053884
> rlnorm(10)
[1] 4.5348008 1.4767493 0.5372775 0.1091863 3.0800041 0.9560610 0.9839401 2.5698209
[9] 2.2732743 1.8110401 I then thought we might be able to come up some kind of The short story is I'm convinced this is the only way to do this 🎉 ! I have some very minor comments on the code but if you can fix those I'll be happy to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super minor things. If you don't have time let me know and I can probably merge and fix them myself.
If you are feeling keen it would be great to have some tests for things, but only if you are feeling keen 😸.
R/splat-simulate.R
Outdated
means.gene <- rowData(sim)$GeneMean | ||
|
||
for (idx in seq_len(nBatches)) { | ||
batch.facs <- getLNormFactors(nGenes, 1, 0.5, batch.facLoc[idx], | ||
batch.facScale[idx]) | ||
batch.means.gene <- means.gene * batch.facs | ||
if (batch.rmEffect) batch.facs <- rep(1, length(batch.facs)) | ||
# batch.means.gene <- means.gene * batch.facs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please just delete this line? I think this is just a hangover from some old code that isn't used any more.
R/splat-simulate.R
Outdated
means.gene <- rowData(sim)$GeneMean | ||
|
||
for (idx in seq_len(nBatches)) { | ||
batch.facs <- getLNormFactors(nGenes, 1, 0.5, batch.facLoc[idx], | ||
batch.facScale[idx]) | ||
batch.means.gene <- means.gene * batch.facs | ||
if (batch.rmEffect) batch.facs <- rep(1, length(batch.facs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add brackets to this if statement? i.e.
if (...) {
do things
}
Hi, thanks for your great work! I am curious about one case, that is, do I still need to consider batch effect information like batch.loc or batch.scale if I intend to have a dataset with no batch effect using rm.Effect? Thanks. |
Please open an issue if you have question and explain more about what you use case is. |
When the 'batch.rmEffect' is TRUE, the simulation randomly generates the batch effects to preserve the random seed, but then removes the batch effect by over-writing the generated batch effects by 1.
This allows creating simulation pairs with and without batch effects.