-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: improve delete queries for janitor command #2540
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking good! :)
Great job!
cmd/janitor.go
Outdated
@@ -52,6 +52,8 @@ Janitor can be used in several ways. | |||
RunE: cli.NewHandler().Janitor.RunE, | |||
Args: cli.NewHandler().Janitor.Args, | |||
} | |||
cmd.Flags().Int(cli.Limit, 10000, "Limits the number of records retrieved from database for deletion.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this limit here is 10K instead of 100k as suggested in the comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my mistake. Should I fix the comment or change the default value in the config? What do you think about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can adjust the defaults to what has worked for you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, my limit was 1 million records, but it takes an hour for the select only and delete is a long process (not affecting database resources nor service though). I tought that a lower default would be more reasonable for common use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then we stick with the lower limits :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well then, we just have to decide between 10k and 100k.
In my database setup a single 100 batch delete for consent (the slowest table, I guess because of many cascade deletes) takes 1 second or so. A default limit of 10k would take a total of 100 seconds, while a limit of 100k would take 1000 seconds (i.e. almost 17 minutes). We could stick with one of these (I would take the 100k limit, if I have to choose) or get a value in the middle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aeneasr what is your opinion on this?
Improve delete queries by separating the data extraction from actual delete. Extraction is made with a configurable limit, using --limit flag, then deletes are made in batch mode with a configurable batch size, using --batch-size flag. Default value for limit is 100000 records and default value for batch size is 100 records.
This uses LEFT JOIN to select also login and consent requests which did not result in a complete authentication, i.e. user requested login but timed out or user logged in and timed out at consent.
This splits in two independent SELECTs the extraction of login and consent requests eligible for deletion. This solves a bug in the single SELECT causing deletion of consent requests where matching login requests were eligible for deletion and vice versa. With independent SELECTs we keep consent requests even if matching login request gets deleted and vice versa.
58a8177
to
1c2f4fe
Compare
1c2f4fe
to
1b27346
Compare
This adds a check in janitor command handler to ensure that user is not passing wrong values for limit anch batch flags. This also adds tests for these command line arguments.
2ab5910
to
86bda51
Compare
86bda51
to
b5b0791
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job @flavioleggio ! I have some ideas here :)
persistence/sql/persister_consent.go
Outdated
FROM %[1]s | ||
LEFT JOIN %[2]s ON %[1]s.challenge = %[2]s.challenge | ||
WHERE ( | ||
(%[2]s.challenge IS NULL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can challenge
be null here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, LEFT JOIN selects records matching in both tables and records only present in left table: in that case, column from right tables will we NULL, so that check should mean "give me unhandled requests".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, thank you for the explanation. I'll keep this comment open so that I can re-check the SQL query when I'm back in the office to ensure nothing gets deleted that shouldn't be deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you have a chance to check the query?
This fix avoids unexpected behaviors if the janitor cli is executed in parallel. In that case, if we do not order records to delete, we might incur in duplicate deletes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is looking quite good! Nice job @flavioleggio :)
I just have one request on the tests, maybe there should be a test to verify setting the limit and the batch size to confirm it actually does work, meaning if we have 5 rows in the database and you limit to 2, then we should have 3 rows still in the database - assuming the rows were all invalid/expired. Since testing the batch size isn't really possible on increments, e.g. on batch size 1 we get X, on batch size 2 we get XY. I think we should just have a test setting it to 0 and another setting it to an arbitrary number to verify 0 does not delete anything and some number above 0 does delete the rows according to the limit size.
Let me know when I should take a final look :) |
While the PR is being worked on I will mark it as a draft. That declutters our review backlog :) Once you're done with your changes and would like someone to review them, mark the PR as ready and request a review from one of the maintainers. Thank you! |
c7f79c5
to
9103f42
Compare
Looks like there are problems with the dependency check with some transitive dependencies. The command states the following:
I tried with a simple UPDATE |
Codecov Report
@@ Coverage Diff @@
## master #2540 +/- ##
==========================================
+ Coverage 52.47% 52.73% +0.26%
==========================================
Files 234 234
Lines 13905 13995 +90
==========================================
+ Hits 7296 7380 +84
- Misses 5986 5989 +3
- Partials 623 626 +3
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice job! @flavioleggio
The changes make sense to me and from my side I don't have any major suggestions or change requests :)
As for the dependency issues, I'm not too sure. Maybe @aeneasr knows something about it?
Nice, thank you! I will review it next week together with @Benehiko :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have to verify one thing, otherwise this looks good! :)
Congratulations, thank you for the hard work @flavioleggio !!! |
@flavioleggio would you be open to investigate the test flake? It's unfortunately plaguing almost 50% of runs with a failure |
I'll take a look this weekend, hopefully. If I can't find an answer on my own I will ask for some help on slack chat, maybe, but I will be in touch. |
Thank you!! |
Hello @flavioleggio |
Related issue
#2513
@aeneasr @Benehiko
Proposed changes
Improve delete queries by separating the data extraction from actual delete.
Extraction is made with a configurable limit, using --limit flag, then deletes are made in batch mode with a configurable batch size, using --batch-size flag.
I chose default values for limit and batch size of respectively 100000 records and 100 records.
Checklist
vulnerability. If this pull request addresses a security. vulnerability, I
confirm that I got green light (please contact
security@ory.sh) from the maintainers to push
the changes.
works.
Further comments
I chose to remove the transaction in cleanup for login and consent requests, which I think could be part of the problem with table locking.