-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance optimizations: bulk anonymisation #44
Comments
Hi @jayaddison-collabora. Thanks for the PRs relating to this, I've merged down #43 and #45 and added a note to #46. Going the leave this open issue for more investigation. I'd be interested to know what kind of numbers you were looking at so we could do some benchmarking. There is probably some more improvements we could make to the management command depending on the situation. Thanks |
In relation to this, we've added some small perfomance improvements to the latest release. Firstly the adding of records to the log table is now bulked, previously only the Secondly for the purpose of bulk anonymisation we've also added the option to defer/disable the records created to the log table via |
Thanks @jamesoutterside - just (belatedly) acknowledging your comments here. I'll hope to have a bit more of a look at this soon. I did keep a note of some of the anonymisation throughput/benchmark figures when working on the pull requests initially, if I remember correctly, so there may be some data near-ready to provide. |
Some references here: Analysis / Benchmarking
Deployment
|
During testing of bulk anonymisation, there seem to be a few areas where performance can be optimized (although there may be correctness / auditing tradeoffs for some of these).
I'll try to provide some supporting statistics on each of these soon - but as a rough preface, I've been aiming to bring a ~12-hour estimated bulk anonymisation down to less than 3 hours (and ideally reduce it further than that).
Modifications applied so far towards this goal have included:
for_bulk=True
as an argument to theanonymise
method (nb: reduces audit logging)force=True
argument to theanonymise
method and flipping the order of theself.is_anonymised() and not force
conditionals -- so that no DBexists()
query is made when force mode is enabled (nb: does this risk introducing incorrect/circular anonymisation?)__getattr__
implementation by using dictionary lookups rather than list iterations to retrieve anonymisers (nb: no evidence of improvements here, yet)The text was updated successfully, but these errors were encountered: