Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Columns names and data are identified incorrectly pii #216

Open
denisbnet opened this issue May 9, 2023 · 1 comment
Open

Columns names and data are identified incorrectly pii #216

denisbnet opened this issue May 9, 2023 · 1 comment

Comments

@denisbnet
Copy link

denisbnet commented May 9, 2023

DatumSpacyDetector:

  1. Code (salt) like '0c065d65-883a-4286-8284-9c2668ee7608' identified as Address
  2. Education code like 'HIGHER' or 'MASTER' identified as Address
  3. Employment code like 'FULL_TIME' identified as Person
  4. Source code like 'MANUAL' or 'CAREER SECTION' identified as Person
  5. Salary like '30000' identified as Birth Date
  6. Skills description identified as Address
    etc.
    Is it possible to fix it?

spaCy version 3.5.2
Platform Linux-5.15.0-70-generic-x86_64-with-glibc2.35
Python version 3.10.6
Pipelines en_core_web_md (3.5.0), en_core_web_sm (3.5.0)

ColumnNameRegexDetector:

  1. Passpord identifed as Password
@nicolepng
Copy link
Contributor

Hi @denisbnet! :)

For the DatumSpacyDetector, we are currently utilizing commonregex-improved python library to carry out regex matching to detect the pii types. To fix the problems raised for DatumSpacyDetector, we can look into generating different regex expressions or looking at utilizing a different method to increase accuracy for DatumSpacyDetector. Feel free to open PRs or suggestions in doing so :)

For ColumnNameRegexDetector, it is possible to update the regex for column matching by changing the regex in scanner.py. If you would like to create a new detector, you can do so by referring to the documentation in detectors.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants