Fair Entity Matching (Availability and Reproducibility for VLDB 2024)

A fairness suite for auditing Entity Matching approaches

Companion repository for reproducing the results of the paper "Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching".

Publication(s) to cite:

[1] Nima Shahbazi, Nikola Danevski, Fatemeh Nargesian, Abolfazl Asudeh, and Divesh Srivastava. "Through the Fairness Lens: Experimental Analysis and Evaluation of Entity Matching." Proceedings of the VLDB Endowment 16, no. 11 (2023): 3279-3292.

[VLDB Publication] https://dl.acm.org/doi/abs/10.14778/3611479.3611525

Requirements:

Python 3.8
300GB of storage (for training process)
Cuda-supported machine with NVIDIA GPUs

Instructions:

We tried our best to make the reproducibility process as simple as possible. Please be advised that to run the experiments, you need a Cuda-supported machine with NVIDIA GPUs. Please follow the three steps below:

Step 1: Installation

Clone the repo: git clone https://github.com/UIC-InDeXLab/FairEMRepro.git
Enter the project's main directory: cd FairEMRepro/
Create a virtual environment: python -m venv venv
Activate the virtual environment: source venv/bin/activate
Install required packages: pip install -r requirements.txt
To run Jupyter notebook in local machine: jupyter notebook
To run Jupyter notebbok on server without browser: jupyter notebook --no-browser

⚠️ Notice 1:

Due to the long running time of the matchers, we have provided the prediction results based on a run in the repository. If you want to use the existing predictions and directly move to running the analysis, you can skip step 2. Otherwise, run the bash remove_script.sh script and move to step 2.

Step 2: Generating Matching Results

Make sure that you have docker properly installed with non-root user permissions,
Make sure you have an NVIDIA GPU available and docker has access to GPU. See here for more information.
Run the jupyter notebook train.ipynb to train all the matching models and create the predictions for all datasets.
Please note that when the run is over, it is needed to enter your root password to change permissions to the current user in the notebook.

⚠️ Notice 2:

Please be advised that depending on the matcher, dataset, and the number of epochs each training task could take between a few minutes to a few days. Running all tasks using each matcher with the default parameters (epoch=10) took us about a week to finish (with GNEM being the slowest due to the high Cuda memory requirements). That being said, we have provided the results of a full run (with 10 epochs) in the repository in case anyone needs to skip the tedious training step.

Step 3: Analysis and Visualization of The Results

After step 2 is over, run the jupyter notebook experiments.ipynb. The results for Single and Pairwise Fairness, sensitivity to the matching threshold heatmaps and tables using PPVP and TPRP measures can be observed in the notebook.

Contact

Feel free to contact the authors or leave an issue in case of any complications. We will try to respond as soon as possible.

License

This project is licensed under the MIT License — see the LICENSE.md file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
brm_results		brm_results
data		data
deepmatcher_results		deepmatcher_results
ditto_results		ditto_results
dt_results		dt_results
experiments		experiments
gnem_results		gnem_results
hiermatcher_results		hiermatcher_results
lg_results		lg_results
ln_results		ln_results
mcan_results		mcan_results
nb_results		nb_results
rf_results		rf_results
svm_results		svm_results
threshold_experiments		threshold_experiments
.gitignore		.gitignore
FairEM.py		FairEM.py
README.md		README.md
experiments.ipynb		experiments.ipynb
experiments.py		experiments.py
measures.py		measures.py
remove_script.sh		remove_script.sh
requirements.txt		requirements.txt
threshold_experiment.py		threshold_experiment.py
train.ipynb		train.ipynb
utils.py		utils.py
workloads.py		workloads.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fair Entity Matching (Availability and Reproducibility for VLDB 2024)

A fairness suite for auditing Entity Matching approaches

Publication(s) to cite:

Requirements:

Instructions:

Step 1: Installation

⚠️ Notice 1:

Step 2: Generating Matching Results

⚠️ Notice 2:

Step 3: Analysis and Visualization of The Results

Contact

License

About

Releases

Packages

Languages

UIC-InDeXLab/FairEMRepro

Folders and files

Latest commit

History

Repository files navigation

Fair Entity Matching (Availability and Reproducibility for VLDB 2024)

A fairness suite for auditing Entity Matching approaches

Publication(s) to cite:

Requirements:

Instructions:

Step 1: Installation

⚠️ Notice 1:

Step 2: Generating Matching Results

⚠️ Notice 2:

Step 3: Analysis and Visualization of The Results

Contact

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages