Dependencies are managed with poetry and nix. Installation steps are given below:
poetry lock
poetry install
A default.nix
is provided at the root of the project. Build the environment using:
nix-build
Drop into an interactive shell using:
nix-shell
From this shell, you can start your IDE which should then have access to the packages.
Use case scripts are available to run all steps in one go and see how the whole pipeline can be executed.
poetry run python run_use_case_pra.py
or, if you’re using nix:
nix-shell --run 'python run_use_case_pra.py'
poetry run pytest <test_feature.py>
Generation of anonymous synthetic data can be done with the solution avatar, provided a license to the avatar solution is in place.
poetry run python anonymize_pra.py
The run_many_linkage.py
script can be used to perform linkage between two datasets (anonymized or original). It enables analysis of a specific use case under different linkage settings (different distances and algorithms).
The script generates csv files of the linked data.
poetry run python run_many_linkage.py
nix-shell --run 'python run_many_linkage.py'
The analyze_many_linkage.py script can be used to analyse the results obtained in the previous step (run_many_linkage.py
).
If random
and row_order
distances have been included in the runs, then any other method can be compared to a close-to-ideal linkage (row_order
) and to a bad linkage (random
).
Make sure the selected settings match those of run_many_linkage.py
.
poetry run python analyze_many_linkage.py
nix-shell --run 'python analyze_many_linkage.py'
The run_many_pipelines.py
script can be used to perform linkage between two datasets (anonymized or original) for many scenarios. Scenarios are defined by a dataset (several open source datasets are available), by a set of common variables (sets of different sizes), by the use of original split data or their avatars and under different linkage settings. The script generates a csv file containing linkage metrics.
The analyze_many_pipelines.py
scripts can then be used to generate plots of this metric data.
# Perform many linkages and compute metrics
poetry run python run_many_pipelines.py
nix-shell --run 'python run_many_pipelines.py'
# Analyze
poetry run python analyze_many_pipelines.py
nix-shell --run 'python analyze_many_linkage.py'
The experimental scripts can be used to assess linkage on additional datasets. For each dataset, a data loading function needs to be added to data_loader.py
and this function should return a dictionary with the data (df
) and the minimum and maximum number of columns that could be shared (min_number_of_random_column_in_combinations
and max_number_of_random_column_in_combinations
). Pre-treatment steps can be added to this function.
Once a loader is created, data can be loaded to be avatarized and linked:
number_of_records = 1000 # number of records to load (can be useful on large dataset, keep to None to load all the data)
data = load_dataset(Dataset.MY_NEW_DATASET, number_of_records)
df = data['df']
min_number_of_random_column_in_combinations = data['min_number_of_random_column_in_combinations']
max_number_of_random_column_in_combinations = data['max_number_of_random_column_in_combinations']
The documentation is available in \docs
.