Branch | [Travis-CI] | [Coveralls] |
---|---|---|
Master | ||
Develop |
Welcome to the OneFlorida "De-Duper" tool.
This tool genereates "Unique Identifiers" (UID's) used for patient de-duplication (aka "Entity Resolution", aka "Record Linkage").
The current implementation is using two CSV files as input for two separate scripts as described in the diagram below.
Note: The hashing process insures that "OneFlorida Domain" WILL NOT RECEIVE any data containing PHI.
+- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - | Partner Domain | | (CSV file with PHI) (CSV file with no PHI) | +--------------------------+ +--------------------------+ | | PHI_DATA.csv | ----> hasher.py ----> | HASHES.csv | | | patid, first, last, | | patid, F_L_D_S, F_L_D_R | | | dob, sex, race | | | | +--------------------------+ +--------------------------+ | || +- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - || - - - - - - | OneFlorida Domain \/ | +--------------------------+ | | OneFlorida SFTP Server | | +--------------------------+ | || | || | \/ | +--------------------------+ | | HASHES.csv | | | patid, F_L_D_S, F_L_D_R | | +--------------------------+ | | | ____________ | | / \ | | | /| / | |\_____________/ | / | | | | <------------- linker.py <-------- | | UF Database | | | | |/ | \_____________/ | | (Links between hashes -> UUID's) | _____ O | patid, partner_code, linkage_uuid, linkage_hash / / -+- | 123, UFH, abc..., def... <-- / | | 456, FLM, abc..., def... <-- / \ | 789, FLM, 987..., 012... | | (generate UID's from hashes) | + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Note on PHI: The hasher.py script uses the python implementation of the sha256 algorithm to scramblme the PHI in order to make it imposible to re-identify the patients. The sha256 algorithm is certified by the National Institute of Standards and Technology (NIST)
The two components of the application (hasher, linker) need proper configuration in order to function. For more details please refer to the docs/installation.md and dosc/installation-linker.md.
The format for the input file for the hasher
component is described in the
input-specs.md document.
- NIST Secure Hash Standard - nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.180-4.pdf
- CAPriCORN: Chicago Area Patient-Centered Outcomes Research Network - https://www.ncbi.nlm.nih.gov/pubmed/24821736
- http://infolab.stanford.edu/serf/
- "Swoosh: A Generic Approach to Entity Resolution" - http://link.springer.com/article/10.1007%2Fs00778-008-0098-x