Performance optimizations: bulk anonymisation #44

ghost · 2021-09-16T11:33:07Z

During testing of bulk anonymisation, there seem to be a few areas where performance can be optimized (although there may be correctness / auditing tradeoffs for some of these).

I'll try to provide some supporting statistics on each of these soon - but as a rough preface, I've been aiming to bring a ~12-hour estimated bulk anonymisation down to less than 3 hours (and ideally reduce it further than that).

Modifications applied so far towards this goal have included:

Providing for_bulk=True as an argument to the anonymise method (nb: reduces audit logging)
Setting the force=True argument to the anonymise method and flipping the order of the self.is_anonymised() and not force conditionals -- so that no DB exists() query is made when force mode is enabled (nb: does this risk introducing incorrect/circular anonymisation?)
Optimizing the anonymiser __getattr__ implementation by using dictionary lookups rather than list iterations to retrieve anonymisers (nb: no evidence of improvements here, yet)

The text was updated successfully, but these errors were encountered:

jamesoutterside · 2021-12-09T16:07:10Z

Hi @jayaddison-collabora. Thanks for the PRs relating to this, I've merged down #43 and #45 and added a note to #46.

Going the leave this open issue for more investigation. I'd be interested to know what kind of numbers you were looking at so we could do some benchmarking. There is probably some more improvements we could make to the management command depending on the situation.

Thanks
James

jamesoutterside · 2022-01-19T16:45:48Z

In relation to this, we've added some small perfomance improvements to the latest release.

Firstly the adding of records to the log table is now bulked, previously only the PrivacyAnonymised objects were bulked if the bulk argument was used. The anonymised objects are still not in bulk, so any signals outside of gdpr would be respected, however, we could look at allowing users to control this via a setting so gdpr as a whole acts in bulk.

Secondly for the purpose of bulk anonymisation we've also added the option to defer/disable the records created to the log table via GDPR_LOG_ON_ANONYMISE (https://django-gdpr-assist.readthedocs.io/en/latest/installation.html#gdpr-log-on-anonymise-true) to give the user control of when this happens, i.e the post_anonymise signal could be used to defer this to celery or batched via processing the values in PrivacyAnonymised later.

ghost · 2022-02-07T14:58:54Z

Thanks @jamesoutterside - just (belatedly) acknowledging your comments here. I'll hope to have a bit more of a look at this soon. I did keep a note of some of the anonymisation throughput/benchmark figures when working on the pull requests initially, if I remember correctly, so there may be some data near-ready to provide.

ghost · 2022-03-04T10:43:07Z

Some references here:

Analysis / Benchmarking

https://gitlab.collabora.com/tools/chronophage/-/merge_requests/815#note_93443
- adding anonymisation of a string field on a model with ~650k records resulted in a 12h bulk anonymisation duration
- commentary and explanation of changes applied to bring the bulk anonymisation duration down to 3h

Deployment

https://gitlab.collabora.com/tools/chronophage/-/merge_requests/820
- from: https://github.com/jayaddison-collabora/django-gdpr-assist.git@1baf994e21575074d0d9b03afbc07236d8c88061
- to: https://github.com/jayaddison-collabora/django-gdpr-assist.git@77838823cd3c4ac221b428a0e3b093a20e848f19
- results: bulk anonymisation process reduced from 12h to 3h duration

This was referenced Sep 16, 2021

Pass for_bulk=True to 'anonymise' method call from anonymise_db management command #46

Closed

Performance: flip conditional order during model anonymisation in force-mode #45

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance optimizations: bulk anonymisation #44

Performance optimizations: bulk anonymisation #44

ghost commented Sep 16, 2021

jamesoutterside commented Dec 9, 2021

jamesoutterside commented Jan 19, 2022

ghost commented Feb 7, 2022

ghost commented Mar 4, 2022

Performance optimizations: bulk anonymisation #44

Performance optimizations: bulk anonymisation #44

Comments

ghost commented Sep 16, 2021

jamesoutterside commented Dec 9, 2021

jamesoutterside commented Jan 19, 2022

ghost commented Feb 7, 2022

ghost commented Mar 4, 2022