Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds 'batch.rmEffect' parameter. #103

Merged
merged 3 commits into from
Oct 25, 2020
Merged

Adds 'batch.rmEffect' parameter. #103

merged 3 commits into from
Oct 25, 2020

Conversation

daynefiler
Copy link
Contributor

When the 'batch.rmEffect' is TRUE, the simulation randomly generates the batch effects to preserve the random seed, but then removes the batch effect by over-writing the generated batch effects by 1.

This allows creating simulation pairs with and without batch effects.

When the 'batch.rmEffect' is TRUE, the simulation
randomly generates the batch effects to preserve
the random seed, but then removes the batch effect
by over-writing the generated batch effects by 1.

This allows creating simulation pairs with and
without batch effects.
@daynefiler
Copy link
Contributor Author

Note, lines 348-349 in splat-simulate.R. For some reason, I commented out:

# batch.means.gene <- means.gene * batch.facs

I do not remember why I did that. I am guessing it should probably be uncommented?

@lazappi
Copy link
Collaborator

lazappi commented Jul 16, 2020

Hi @daynefiler

This is an interesting idea. Thanks for sharing it. I guess the idea is to use the simulation with the "removed" batch effects as the ground truth when evaluating a batch correction method? I can see how that might be useful for some things. Do you have any examples you can show of what this looks like?

@lazappi
Copy link
Collaborator

lazappi commented Jul 16, 2020

Note, lines 348-349 in splat-simulate.R. For some reason, I commented out:

# batch.means.gene <- means.gene * batch.facs

I do not remember why I did that. I am guessing it should probably be uncommented?

I'm guessing because the factors are all 1 now so there's no point multiplying them?

@lazappi lazappi mentioned this pull request Jul 16, 2020
@daynefiler
Copy link
Contributor Author

Looking, again, I think I commented it out because the batch.means.gene parameter doesn't show up anywhere else in the file. Maybe it was a holdover from previous work?

To answer your question, you are completely correct. The idea is to build truth sets for testing batch correction methodologies. Drawing the random batch effects, then overwriting them (I think) ensures the rest of the random effects stay consistent across the two datasets (with & without batch).

@daynefiler
Copy link
Contributor Author

What would you like in terms of an example? I can try to put it together.

@lazappi
Copy link
Collaborator

lazappi commented Jul 17, 2020 via email

@lazappi
Copy link
Collaborator

lazappi commented Sep 28, 2020

Sorry I have forgotten about this a bit. Just checking if you are still interested in going ahead with this?

@daynefiler
Copy link
Contributor Author

Hi, sorry this fell off my to-do list. I will send some examples tomorrow afternoon.

@daynefiler
Copy link
Contributor Author

daynefiler commented Oct 1, 2020

Hi, this illustrates the approach. Importance is using the same parameter object for each simulation, removing the batch multiplier in one.

library(splatter)
library(scater)

## Setup initial parameters
p1 <- newSplatParams()
p1 <- setParam(p1, "group.prob", c(0.2, 0.3, 0.5))
p1 <- setParam(p1, "batchCells", c(1e3, 1e3))

## Perform simulation w/ batch effect
s1 <- splatSimulate(p1, method = "groups")
s1 <- logNormCounts(s1)
s1 <- runPCA(s1)
plotPCA(s1, colour_by = "Batch", shape_by = "Group")

splatterExample-withBatch

## Perform simulation w/o batch effect
p2 <- setParam(p1, "batch.rmEffect", TRUE)
s2 <- splatSimulate(p2, method = "groups")
s2 <- logNormCounts(s2)
s2 <- runPCA(s2)
plotPCA(s2, colour_by = "Batch", shape_by = "Group")

splatterExample-withoutBatch

@lazappi
Copy link
Collaborator

lazappi commented Oct 5, 2020

Thanks for that, that's exactly what I wanted. I've had some time to think about this a bit more. You have probably been through this process already but I just wanted to double check.

For a while I thought what you were suggesting was equivalent to setting batch.facLoc and batch.facScale so that all the batch factors are 1:

p1 <- newSplatParams(group.prob = c(0.2, 0.3, 0.5), batchCells = c(1e3, 1e3))
s1 <- splatSimulate(p1, method = "groups")
s1 <- logNormCounts(s1)
s1 <- runPCA(s1)
plotPCA(s1, colour_by = "Batch", shape_by = "Group")

image

s2 <- splatSimulate(p1, method = "groups", batch.facLoc = log(1), batch.facScale = 0)
s2 <- logNormCounts(s2)
s2 <- runPCA(s2)
plotPCA(s2, colour_by = "Batch", shape_by = "Group")

image

That looks like it works but if you look for closely at the actual simulated values it's not quite true:

DataFrame with 10000 rows and 9 columns
                 Gene BaseGeneMean OutlierFactor   GeneMean BatchFacBatch1
          <character>    <numeric>     <numeric>  <numeric>      <numeric>
Gene1           Gene1    2.2696734        1.0000  2.2696734       0.987669
Gene2           Gene2    5.3948593        1.0000  5.3948593       1.088644
Gene3           Gene3    0.3470967       61.6692 63.2300864       0.994561
Gene4           Gene4    0.0364211        1.0000  0.0364211       1.179244
Gene5           Gene5    3.4357531        1.0000  3.4357531       0.840552
...               ...          ...           ...        ...            ...
Gene9996     Gene9996     0.506195             1   0.506195       1.073213
Gene9997     Gene9997     2.010122             1   2.010122       0.890210
Gene9998     Gene9998     2.581748             1   2.581748       1.220922
Gene9999     Gene9999     0.317081             1   0.317081       0.981184
Gene10000   Gene10000     1.217171             1   1.217171       1.306551
          BatchFacBatch2 DEFacGroup1 DEFacGroup2 DEFacGroup3
               <numeric>   <numeric>   <numeric>   <numeric>
Gene1           1.038262     1.00000           1     1.00000
Gene2           1.006354     1.00000           1     1.00000
Gene3           1.169444     1.00000           1     1.53158
Gene4           0.942942     1.00000           1     1.06636
Gene5           1.327734     1.11651           1     1.00000
...                  ...         ...         ...         ...
Gene9996        0.908150           1           1    1.000000
Gene9997        1.120104           1           1    1.287344
Gene9998        0.889335           1           1    0.513365
Gene9999        0.972102           1           1    1.000000
Gene10000       1.058190           1           1    1.000000

DataFrame with 10000 rows and 9 columns
                 Gene BaseGeneMean OutlierFactor   GeneMean BatchFacBatch1
          <character>    <numeric>     <numeric>  <numeric>      <numeric>
Gene1           Gene1    2.2696734        1.0000  2.2696734              1
Gene2           Gene2    5.3948593        1.0000  5.3948593              1
Gene3           Gene3    0.3470967       61.6692 63.2300864              1
Gene4           Gene4    0.0364211        1.0000  0.0364211              1
Gene5           Gene5    3.4357531        1.0000  3.4357531              1
...               ...          ...           ...        ...            ...
Gene9996     Gene9996     0.506195             1   0.506195              1
Gene9997     Gene9997     2.010122             1   2.010122              1
Gene9998     Gene9998     2.581748             1   2.581748              1
Gene9999     Gene9999     0.317081             1   0.317081              1
Gene10000   Gene10000     1.217171             1   1.217171              1
          BatchFacBatch2 DEFacGroup1 DEFacGroup2 DEFacGroup3
               <numeric>   <numeric>   <numeric>   <numeric>
Gene1                  1           1           1           1
Gene2                  1           1           1           1
Gene3                  1           1           1           1
Gene4                  1           1           1           1
Gene5                  1           1           1           1
...                  ...         ...         ...         ...
Gene9996               1    1.000000           1           1
Gene9997               1    1.000000           1           1
Gene9998               1    1.000000           1           1
Gene9999               1    0.913961           1           1
Gene10000              1    1.000000           1           1

You can see that everything before the BatchFac columns is the same but after that you get differences. I'm not exactly sure why that is but it seems that just changing a parameter is enough to mess with the random seed.

# First set of numbers
> set.seed(1)
> rnorm(10)
 [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684  0.4874291
 [8]  0.7383247  0.5757814 -0.3053884
> rlnorm(10)
 [1] 4.5348008 1.4767493 0.5372775 0.1091863 3.0800041 0.9560610 0.9839401 2.5698209
 [9] 2.2732743 1.8110401
# Changing the parameters for the first random function changes the values from the second
> set.seed(1)
> rnorm(10, mean = 1, sd = 0)
 [1] 1 1 1 1 1 1 1 1 1 1
> rlnorm(10)
 [1] 0.5344838 1.2015872 0.4336018 4.9297132 1.3902836 0.4402254 1.6281250 2.0924271
 [9] 1.7785196 0.7368371
# But if we use the same parameters we can get the first set back
> set.seed(1)
> rnorm(10)
 [1] -0.6264538  0.1836433 -0.8356286  1.5952808  0.3295078 -0.8204684  0.4874291
 [8]  0.7383247  0.5757814 -0.3053884
> rlnorm(10)
 [1] 4.5348008 1.4767493 0.5372775 0.1091863 3.0800041 0.9560610 0.9839401 2.5698209
 [9] 2.2732743 1.8110401

I then thought we might be able to come up some kind of splatRemoveBatch() function but I think we would run into the same seed issues.

The short story is I'm convinced this is the only way to do this 🎉 ! I have some very minor comments on the code but if you can fix those I'll be happy to merge.

Copy link
Collaborator

@lazappi lazappi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor things. If you don't have time let me know and I can probably merge and fix them myself.

If you are feeling keen it would be great to have some tests for things, but only if you are feeling keen 😸.

means.gene <- rowData(sim)$GeneMean

for (idx in seq_len(nBatches)) {
batch.facs <- getLNormFactors(nGenes, 1, 0.5, batch.facLoc[idx],
batch.facScale[idx])
batch.means.gene <- means.gene * batch.facs
if (batch.rmEffect) batch.facs <- rep(1, length(batch.facs))
# batch.means.gene <- means.gene * batch.facs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please just delete this line? I think this is just a hangover from some old code that isn't used any more.

means.gene <- rowData(sim)$GeneMean

for (idx in seq_len(nBatches)) {
batch.facs <- getLNormFactors(nGenes, 1, 0.5, batch.facLoc[idx],
batch.facScale[idx])
batch.means.gene <- means.gene * batch.facs
if (batch.rmEffect) batch.facs <- rep(1, length(batch.facs))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add brackets to this if statement? i.e.

if (...) {
    do things
}

@lazappi lazappi merged commit 09ebd62 into Oshlack:master Oct 25, 2020
@HelloWorldLTY
Copy link

Hi, thanks for your great work!

I am curious about one case, that is, do I still need to consider batch effect information like batch.loc or batch.scale if I intend to have a dataset with no batch effect using rm.Effect? Thanks.

@lazappi
Copy link
Collaborator

lazappi commented Dec 4, 2023

Please open an issue if you have question and explain more about what you use case is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants