TL;DR: This work introduces a novel approach using Conditional Variational Autoencoders (CVAEs) trained on feature vectors extracted from large pre-trained vision foundation models. Foundation models effectively detect and represent complex patterns across diverse domains, allowing the CVAE to faithfully capture the embedding space of a given data distribution to generate (sample) a diverse, privacy-respecting and potentially unbounded set of synthetic feature vectors.
Figure 1: Given an image dataset
conda env create -f environment.yaml
conda activate cvae
More info about the medical datasets
MedMNIST & MedIMeta:
mkdir assets/data/medmnist
mkdir assets/data/medimeta
cd assets/data/medmnist
wget https://zenodo.org/records/10519652/files/breastmnist_224.npz
cd ../medimeta
wget https://zenodo.org/records/7884735/files/organs_axial.zip
wget https://zenodo.org/records/7884735/files/skinl_derm.zip
unzip organs_axial.zip -d .
unzip skinl_derm.zip -d .
cd ../../../
assets/data/octdl/octdl_preprocessing.py
.
python create_db.py --dataset [dataset] --backbone [backbone]
It stores the database under assets/database/[train|val|test].npz
python anonymize.py --dataset [dataset] \
--anonymizer [kSAME|cvae] \
--k [k, set if anonymizer == kSAME] \
--seed [random seed, set if anonymizer == cvae]
It stores the anonymized database under assets/database/train_[anonymizer_id].npz
python probing.py --dataset [dataset] \
--anonymizer [identity|kSAME|cvae] \
--k [k, set if anonymizer == kSAME] \
--seed [random seed] \
--output_root [where to store output logs]
To train and evaluate on noisy test embeddings, use the following instead:
# for kSAME
python probing_noise.py --dataset [dataset] \
--anonymizer [kSAME] \
--k [k] \
--seed [random seed] \
--sigma [standard deviation of the injected noise] \
--output_root [where to store output logs]
# for CVAE - online data generation
python probing_noise_cvae.py --dataset [dataset] \
--anonymizer [cvae-online] \
--variance [sampling variance of CVAE] \
--seed [random seed] \
--sigma [standard deviation of the injected noise] \
--output_root [where to store output logs]
If you find this work useful, please consider citing us:
@inproceedings{disalvo2024privacy,
author = {Francesco Di Salvo and David Tafler and Sebastian Doerrich and Christian Ledig},
title = {Privacy-preserving datasets by capturing feature distributions with Conditional VAEs},
booktitle = {35th British Machine Vision Conference 2024, {BMVC} 2024, Glasgow, UK, November 25-28, 2024},
publisher = {BMVA},
year = {2024},
url = {https://papers.bmvc2024.org/0145.pdf}
}