This official repository contains the implementation and dataset setup for the research paper "Navigating Data Heterogeneity in Federated Learning: A Semi-Supervised Federated Object Detection", presented at NeurIPS 2023 [Link], by Taehyeon Kim, Eric Lin, Junu Lee, Christian Lau, and Vaikkunth Mugunthan.
🎤 TL;DR: We introduce a groundbreaking Semi-Supervised Federated Object Detection (SSFOD) framework where the server has labeled data and all clients have only unlabeled data. We believe that our generic framework could be a game-changer for autonomous driving systems and CCTV stuff!
- This codebase is written for
python3
(usedpython 3.7.6
while implementing). - To install necessary python packages, run
pip install -r requirements.txt
.
BDD100K [Link]
- Description: The BDD100K dataset features 100,000 driving videos from various U.S. locations, covering diverse weather conditions.
- Utilization: We selected 20,000 data points from this dataset, focusing on cloudy, rainy, overcast, and snowy conditions.
- Purpose: This setup helps in investigating the impact of data heterogeneity on our framework and assessing its robustness in realistic conditions.
- Download:
bash ./data/download/download_bdd.sh
SODA10M [Link]
- Description: The SODA10M dataset offers a diverse range of geographies, weather conditions, and object categories.
- Utilization: In an IID setup, 20,000 labeled data points are distributed among one server and three clients. For a more realistic setup, these labeled data are kept on the server, while 100,000 unlabeled data points are distributed across the clients.
- Purpose: This configuration allows for performance evaluation under varying weather conditions like clear, overcast, and rainy, demonstrating the resilience and robustness of our approach.
- Download:
bash ./data/download/download_soda10m.sh
Our setup scripts are designed to prepare the datasets and training environment for various configurations of federated learning and semi-supervised learning. Here's a breakdown of what each script does:
setup_bdd_centralized.sh
&setup_soda_centralized.sh
- Description: These scripts configure the environment for centralized training where all data is stored in a single data source.
- Usage: Employ these scripts when you intend to train your model in a fully supervised manner with all available labeled data.
setup_{dataset}_1ds_ssl_{iid/non-iid}.sh
- Description: For semi-supervised learning scenarios, these scripts set up a single data source on the server with both labeled and unlabeled data.
- Variants:
- IID:
setup_bdd_1ds_ssl_iid.sh
&setup_soda_1ds_ssl_iid.sh
for setups where the data distribution is independent and identically distributed (IID) across labeled and unlabeled data. - Non-IID:
setup_bdd_1ds_ssl_noniid.sh
&setup_soda_1ds_ssl_noniid.sh
for setups where there is heterogeneity in weather conditions between labeled and unlabeled data.
- IID:
setup_bdd_{4ds/100ds}_ssfl_noniid.sh
- Description: These scripts are tailored for federated learning with a semi-supervised approach, dealing with non-IID data (Weather Condition Heterogeneity) across multiple data sources.
- Usage: Use these for simulating a federated learning environment with data heterogeneity.
setup_bdd_ssfl_iid_{4/100}datasources.sh
- Description: Scripts specific to the BDD100K dataset that configure SSFL training across multiple IID data sources.
- Usage: Choose based on the number of data sources (4 or 100) you wish to simulate.
setup_bdd_ssl_{iid/non-iid}_20perc.sh
- Description: 20perc means the percentages for the use of labeled datasets. You can freely control this percentage in the bash file.
- Usage: Select IID for homogeneous and Non-IID for heterogeneous weather conditions across labeled data.
Our scripts are named following the convention: setup_{dataset name}_{# of data sources}_{algorithm}_{iid or non-iid}.sh
for clarity and ease of use. For more detailed instructions on how to use each script, please refer to the individual script headers.
- Still in progress. To be uploaded!! 🏃🏻🏃🏻🏃🏻