octopize-linkage

Setup

Dependencies are managed with poetry and nix. Installation steps are given below:

poetry

poetry lock
poetry install

nix

A default.nix is provided at the root of the project. Build the environment using:

nix-build

Drop into an interactive shell using:

nix-shell

From this shell, you can start your IDE which should then have access to the packages.

Run use cases

Use case scripts are available to run all steps in one go and see how the whole pipeline can be executed.

poetry run python run_use_case_pra.py

or, if you’re using nix:

nix-shell --run 'python run_use_case_pra.py'

Run tests

poetry run pytest <test_feature.py>

Run anonymization with avatar

Generation of anonymous synthetic data can be done with the solution avatar, provided a license to the avatar solution is in place.

poetry run python anonymize_pra.py

Run linkages

The run_many_linkage.py script can be used to perform linkage between two datasets (anonymized or original). It enables analysis of a specific use case under different linkage settings (different distances and algorithms). The script generates csv files of the linked data.

poetry run python run_many_linkage.py
nix-shell --run 'python run_many_linkage.py'

Analyze linkages

The analyze_many_linkage.py script can be used to analyse the results obtained in the previous step (run_many_linkage.py). If random and row_order distances have been included in the runs, then any other method can be compared to a close-to-ideal linkage (row_order) and to a bad linkage (random).

Make sure the selected settings match those of run_many_linkage.py.

poetry run python analyze_many_linkage.py
nix-shell --run 'python analyze_many_linkage.py'

Run and analyze many scenarios (for experimental purposes)

The run_many_pipelines.py script can be used to perform linkage between two datasets (anonymized or original) for many scenarios. Scenarios are defined by a dataset (several open source datasets are available), by a set of common variables (sets of different sizes), by the use of original split data or their avatars and under different linkage settings. The script generates a csv file containing linkage metrics.

The analyze_many_pipelines.py scripts can then be used to generate plots of this metric data.

# Perform many linkages and compute metrics
poetry run python run_many_pipelines.py
nix-shell --run 'python run_many_pipelines.py'

# Analyze
poetry run python analyze_many_pipelines.py
nix-shell --run 'python analyze_many_linkage.py'

Add datasets

The experimental scripts can be used to assess linkage on additional datasets. For each dataset, a data loading function needs to be added to data_loader.py and this function should return a dictionary with the data (df) and the minimum and maximum number of columns that could be shared (min_number_of_random_column_in_combinations and max_number_of_random_column_in_combinations). Pre-treatment steps can be added to this function.

Once a loader is created, data can be loaded to be avatarized and linked:

number_of_records = 1000  # number of records to load (can be useful on large dataset, keep to None to load all the data)
data = load_dataset(Dataset.MY_NEW_DATASET, number_of_records)
df = data['df']
min_number_of_random_column_in_combinations = data['min_number_of_random_column_in_combinations']
max_number_of_random_column_in_combinations = data['max_number_of_random_column_in_combinations']

Documentation 📚

The documentation is available in \docs.

Name		Name	Last commit message	Last commit date
Latest commit History 76 Commits
data		data
docs		docs
img		img
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
analyze_many_linkage.py		analyze_many_linkage.py
analyze_many_linkage_afi.py		analyze_many_linkage_afi.py
analyze_many_pipelines.py		analyze_many_pipelines.py
anonymize_afi.py		anonymize_afi.py
anonymize_pra.py		anonymize_pra.py
data_loader.py		data_loader.py
default.nix		default.nix
linkage.py		linkage.py
linkage_test.py		linkage_test.py
poetry.lock		poetry.lock
post_linkage_metrics.py		post_linkage_metrics.py
pre_linkage_metrics.py		pre_linkage_metrics.py
pre_linkage_metrics_test.py		pre_linkage_metrics_test.py
pyproject.toml		pyproject.toml
run_many_linkage.py		run_many_linkage.py
run_many_linkage_afi.py		run_many_linkage_afi.py
run_many_pipelines.py		run_many_pipelines.py
run_many_sizes.py		run_many_sizes.py
run_use_case_pra.py		run_use_case_pra.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

octopize-linkage

Setup

poetry

nix

Run use cases

Run tests

Run anonymization with avatar

Run linkages

Analyze linkages

Run and analyze many scenarios (for experimental purposes)

Add datasets

Documentation 📚

About

Releases

Packages

Contributors 3

Languages

License

octopize/octopize-linkage

Folders and files

Latest commit

History

Repository files navigation

octopize-linkage

Setup

poetry

nix

Run use cases

Run tests

Run anonymization with avatar

Run linkages

Analyze linkages

Run and analyze many scenarios (for experimental purposes)

Add datasets

Documentation 📚

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages