Summary of the code published in "reComBat: Batch effect removal in large-scale, multi-source omics data integration".
All packages have been compiled in the provided requirements.txt file. Simply use this file to install all pachages via "pip install requirements.txt".
We provide all data and code to reproduce Figures 1, 2 and S1-S10 of our recent publication. Simply execute the main script by running
harmonizedDataCreation.py
Here parameter options referring the specific batch correction methods, evaluation metrics and output folders are defined. This script comprises three main parts:
- Data loading and metadata preprocessing
- Batch correction
- Evaluation of the batch correction methods
The relevant data associated with this code is provided as a .zip file and needs to be extracted into the 'data' folder. It comprises >1000 micro array gene expressen samples extracted from the GEO database in October 2020 as indicated by the relevant GSE and GSM identifiers. All data was preprocessed using RMA normalization.
The data annotation (referred to as "metadata") is categorized to reflect the specific PA strain, and culture conditions (temperature, growth medium, culture geometry, antibiotic treatment, growth phase) and each sample is assigned to one of 39 unique metadata subsets (ZeroHops). Only ZeroHops comprising at least 2 batches (GSEs) of at least two samples (GSMs) are kept.
We provide code for the following (optional) batch correction methods:
- Uncorrected data
- Standardized data (Z-scoreing to mean zero and unit variance was applied)
- Marker gene elimination for each of the ZeroHop Clusters (default top 8 marker genes)
- Principal component elimination for each of the ZeroHops
- reComBat For each of the relevant methods overview fiures showing t-SNE embeddings of the corrected adata colored by all metadata categories are created to provide a visual inspection of the batch correction success.
We provide a range of custom evaluation metrics probing different aspects of a successful batch corrected dataframe. These include:
- LDA score
- DRS score
- Cluster purity and Gini impurity
- Minmum Cluster Separation number
- Cluster Cross-distance
- Logistic Regression (or other classifier) classification performance of batch and ZeroHop.
We also provide code to gerenate and evaluate synthetic data in syntheticDataGeneration.py.Here, the user can define their choice of synthetic data properties, the properties of the imposed batch effects and then correct these with the set of possible methods outlined above. The obtained results can be compared to the relevant ground truth.
This code is developed and maintained by members of the Machine Learning and Computational Biology Lab of Prof. Dr. Karsten Borgwardt. Michael F. Adamer Sarah C. Brüningk