-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ComBat.R #58
base: master
Are you sure you want to change the base?
Update ComBat.R #58
Conversation
ComBat function modification - now it allows the user to adjust a training and a test set for batch effects. This update is useful for machine learning applications.
Thanks for the update!
I’m trying to understand, how is this different from our “reference=” option, that uses one batch as the reference to adjust the other batches? I think that option was designed to solve exactly your problem, and is presented in this paper:
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-018-2263-6
Alternative empirical Bayes models for adjusting for batch effects in genomic studies - BMC Bioinformatics
bmcbioinformatics.biomedcentral.com
… On Mar 2, 2023, at 7:06 AM, elies-ramon ***@***.***> wrote:
@jtleek <https://github.com/jtleek> @zhangyuqing <https://github.com/zhangyuqing>
I am currently working with machine learning models for metabolomic data. In that context, the model is first trained using a training set and then its prediction performance is assessed in an independent test set. Thus, it is key to prevent information leaks from the test to the training set, including data-driven pre-processing or normalization parameters. For this reason, I modified the ComBat function so now it allows the user to adjust separately a training and a test set for batch effects. First, Combat is applied over dat data as usual (dat data is always assumed to be the training set). Then, dat_test (the test set) is adjusted using the coefficients computed from training data. The training and test data should have at least some batches in common. The output of the ComBat function is now a list with 2 elements: first the adjusted training and second the adjusted test data. In case the dat_test is NULL, the output is identical to the current version of ComBat.
If you find this modified ComBat function useful, feel free to merge it into the devel version of SVA or to further modify it.
Thank you!
You can view, comment on, or merge this pull request online at:
#58
Commit Summary
062fad1 <062fad1> Update ComBat.R
File Changes (1 file <https://github.com/jtleek/sva-devel/pull/58/files>)
M R/ComBat.R <https://github.com/jtleek/sva-devel/pull/58/files#diff-f0d3cc497555ea83a75bde43346d6d0c0012a8176ce09e020429a38125cc783c> (532)
Patch Links:
https://github.com/jtleek/sva-devel/pull/58.patch
https://github.com/jtleek/sva-devel/pull/58.diff
—
Reply to this email directly, view it on GitHub <#58>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACMBWPH73I4HENFLV35GNE3W2CSNPANCNFSM6AAAAAAVNNSDGU>.
You are receiving this because you are subscribed to this thread.
|
Thank you for your quick answer! I would try to explain the differences between the reference batch approach and my update, as the two do not serve exactly for the same purposes. As you say in the paper, the reference batch is useful in scenarios where one batch is of better quality or is considered the “baseline”, so the rest of data/batches are adjusted to it. In that case, the reference batch data should be used as the training data and the rest as the test data. With that approach the independence of the test set is warranted. However, sometimes it is not possible to choose a reference batch. For instance, when one observes a striking batch effect (clearly traced to a systematic origin)that affects a whole dataset but there is not a superior batch, as they are all equally “wrong”. Furthermore, if the final goal of the study is to estimate the performance of a machine learning model in some data or to compare the performance of several ml methods, it is very important that the training data has a reasonable size to not underestimate the performance. In ml, usually 75-80% of the data is reserved to train the model. Unless the total amount of data is very large, selecting at random one of the batches as a reference would lead to a very small datasets. I modified the ComBat code after seeing these two issues arise twice in my research group, so I thought the update this could be useful to more people. At the end, I consider that they are two different ways to manage the available data. If you have 50 samples and 5 batches of equal size, you may use one batch as a training set (for example batch 1, i.e. samples 1 to 10) to adjust ComBat. In my approach, the training set will consist of 40 samples (8 samples coming from each batch). Then the ComBat model generated with this data will be used to adjust also for the test data (the remaining 2 samples for batch). Hope that this helps, Elies |
Yes, this is makes sense. Thanks for the update. Reference ComBat assumes that the training set is one batch, and the test set is another. I think you are trying to draw a training set from multiple batches, but keep the integrity of your test set (also from multiple batches). It would be very interesting to see how your approach performs in compared to just adjusting for batch before identifying a training/validation set.
It might be that your method actually performs “worse" on the validation set, not because the method is worse, but because adjusting all the data first might lead to overfitting so the prediction numbers are higher when correcting all the data, but your method might be closer to what the truth should be.
… On Mar 3, 2023, at 3:04 AM, elies-ramon ***@***.***> wrote:
Thank you for your quick answer! I would try to explain the differences between the reference batch approach and my update, as the two do not serve exactly for the same purposes.
As you say in the paper, the reference batch is useful in scenarios where one batch is of better quality or is considered the “baseline”, so the rest of data/batches are adjusted to it. In that case, the reference batch data should be used as the training data and the rest as the test data. With that approach the independence of the test set is warranted.
However, sometimes it is not possible to choose a reference batch. For instance, when one observes a striking batch effect (clearly traced to a systematic origin)that affects a whole dataset but there is not a superior batch, as they are all equally “wrong”. Furthermore, if the final goal of the study is to estimate the performance of a machine learning model in some data or to compare the performance of several ml methods, it is very important that the training data has a reasonable size to not underestimate the performance. In ml, usually 75-80% of the data is reserved to train the model. Unless the total amount of data is very large, selecting at random one of the batches as a reference would lead to a very small datasets. I modified the ComBat code after seeing these two issues arise twice in my research group, so I thought the update this could be useful to more people.
At the end, I consider that they are two different ways to manage the available data. If you have 50 samples and 5 batches of equal size, you may use one batch as a training set (for example batch 1, i.e. samples 1 to 10) to adjust ComBat. In my approach, the training set will consist of 40 samples (8 samples coming from each batch). Then the ComBat model generated with this data will be used to adjust also for the test data (the remaining 2 samples for batch).
Hope that this helps,
Elies
—
Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACMBWPGHXM3DOJ5NZ4A3AOLW2G635ANCNFSM6AAAAAAVNNSDGU>.
You are receiving this because you commented.
|
@jtleek @zhangyuqing
I am currently working with machine learning models for metabolomic data. In that context, the model is first trained using a training set and then its prediction performance is assessed in an independent test set. Thus, it is key to prevent information leaks from the test to the training set, including data-driven pre-processing or normalization parameters. For this reason, I modified the ComBat function so now it allows the user to adjust separately a training and a test set for batch effects. First, Combat is applied over dat data as usual (dat data is always assumed to be the training set). Then, dat_test (the test set) is adjusted using the coefficients computed from training data. The training and test data should have at least some batches in common. The output of the ComBat function is now a list with 2 elements: first the adjusted training and second the adjusted test data. In case the dat_test is NULL, the output is identical to the current version of ComBat.
If you find this modified ComBat function useful, feel free to merge it into the devel version of SVA or to further modify it.
Thank you!