Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance optimizations: bulk anonymisation #44

Open
ghost opened this issue Sep 16, 2021 · 4 comments
Open

Performance optimizations: bulk anonymisation #44

ghost opened this issue Sep 16, 2021 · 4 comments

Comments

@ghost
Copy link

ghost commented Sep 16, 2021

During testing of bulk anonymisation, there seem to be a few areas where performance can be optimized (although there may be correctness / auditing tradeoffs for some of these).

I'll try to provide some supporting statistics on each of these soon - but as a rough preface, I've been aiming to bring a ~12-hour estimated bulk anonymisation down to less than 3 hours (and ideally reduce it further than that).

Modifications applied so far towards this goal have included:

  • Providing for_bulk=True as an argument to the anonymise method (nb: reduces audit logging)
  • Setting the force=True argument to the anonymise method and flipping the order of the self.is_anonymised() and not force conditionals -- so that no DB exists() query is made when force mode is enabled (nb: does this risk introducing incorrect/circular anonymisation?)
  • Optimizing the anonymiser __getattr__ implementation by using dictionary lookups rather than list iterations to retrieve anonymisers (nb: no evidence of improvements here, yet)
@jamesoutterside
Copy link
Contributor

Hi @jayaddison-collabora. Thanks for the PRs relating to this, I've merged down #43 and #45 and added a note to #46.

Going the leave this open issue for more investigation. I'd be interested to know what kind of numbers you were looking at so we could do some benchmarking. There is probably some more improvements we could make to the management command depending on the situation.

Thanks
James

@jamesoutterside
Copy link
Contributor

In relation to this, we've added some small perfomance improvements to the latest release.

Firstly the adding of records to the log table is now bulked, previously only the PrivacyAnonymised objects were bulked if the bulk argument was used. The anonymised objects are still not in bulk, so any signals outside of gdpr would be respected, however, we could look at allowing users to control this via a setting so gdpr as a whole acts in bulk.

Secondly for the purpose of bulk anonymisation we've also added the option to defer/disable the records created to the log table via GDPR_LOG_ON_ANONYMISE (https://django-gdpr-assist.readthedocs.io/en/latest/installation.html#gdpr-log-on-anonymise-true) to give the user control of when this happens, i.e the post_anonymise signal could be used to defer this to celery or batched via processing the values in PrivacyAnonymised later.

@ghost
Copy link
Author

ghost commented Feb 7, 2022

Thanks @jamesoutterside - just (belatedly) acknowledging your comments here. I'll hope to have a bit more of a look at this soon. I did keep a note of some of the anonymisation throughput/benchmark figures when working on the pull requests initially, if I remember correctly, so there may be some data near-ready to provide.

@ghost
Copy link
Author

ghost commented Mar 4, 2022

Some references here:

Analysis / Benchmarking

Deployment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant