MDM module helps you deduplicate data in your Aidbox. This repository has 2 Python modules:
aidbox
— module for communication with Aidboxsplink
— fork of splink with support for Aidbox
You need to have Python 3, poetry, and Jupyter.
Then run
poetry install
poetry shell
python -m ipykernel install --user --name aidbox --display-name Aidbox
Create Aidbox connection:
import aidbox
box = aidbox.Aidbox('https://base-url', 'client-id', 'client-secret')
Check connection:
box.check()
Create empty MDM model:
import aidbox.mdm as mdm
model = mdm.Model('ResourceType')
Set up fields to extract in MDM table:
model['first_name'] = ['name', 0, 'given', 0]
model['last_name'] = ['name', 0, 'family']
Set up term frequencies for needed fields:
model.enable_frequencies('first_name')
Apply model to create MDM table in Aidbox:
model.apply(box)
See splink documentation to learn how to use splink. This guide shows only differences needed for data linkage with Aidbox.
Change id column from unique_id
to id
:
settings = {
# ...
'unique_id_column_name': 'id',
# ...
}
Create linker
linker = PostgresLinker(model, box, settings)
Splink caches intermediate results. If you want to start from scratch (e.g. your data has changed), use
linker.drop_splink_tables()
Train model as usual
Export model as zen-lang edn file for Aidbox configuration project
linker.save_zen_model_edn('filename.edn')