-
Notifications
You must be signed in to change notification settings - Fork 25.7k
[ML] optimize delete expired snapshots #63134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] optimize delete expired snapshots #63134
Conversation
When deleting expired snapshots, we do an individual delete action per snapshot per job. We should instead gather the expired snapshots and delete them in a single call. This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup closes elastic#62875
|
Pinging @elastic/ml-core (:ml) |
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I just had one thought about the potential length of the audit message.
| public static final String JOB_AUDIT_SNAPSHOT_STORED = "Job model snapshot with id [{0}] stored"; | ||
| public static final String JOB_AUDIT_REVERTED = "Job model snapshot reverted to ''{0}''"; | ||
| public static final String JOB_AUDIT_SNAPSHOT_DELETED = "Model snapshot [{0}] with description ''{1}'' deleted"; | ||
| public static final String JOB_AUDIT_SNAPSHOTS_DELETED = "Model snapshots {0} with descriptions {1} deleted"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could just say {0} expired model snapshots deleted (where {0} is how many). We could be deleting 24 per day if they've been changed to be every hour, and that would make the message very verbose. The fact that these are the ones that have expired should be OK for the audit log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 diggin' it
| } | ||
| }); | ||
| auditor.info(jobId, msg); | ||
| LOGGER.debug(() -> new ParameterizedMessage("[{}] {}", jobId, msg)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though the audit message can be made less verbose, the debug should continue to list every ID and description, just in case something ever goes wrong with which snapshots are getting deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no doubt!
| searchResponses.add( | ||
| AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.singletonList(snapshotToBeDeleted))); | ||
|
|
||
| SearchHit snapshot2_1 = createModelSnapshotQueryHit("job-1", "snapshots-1_1", eightDaysAndOneMsAgo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how this test ever passed?
I think it is meant to test the mixed retention between two jobs, but it appears to give the same job two of the snapshots?
I adjusted it it for the new format so that this other old snapshot is part of job-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one of those tests where due to the complexity of the mocking it's very unclear what production code is being tested, and what's really being tested is the test itself.
The idea before was that job two had a 17 day retention period, so the search we were doing directly against the results index (without calling the model snapshots endpoint) would return nothing. This was represented by searchResponses.add(AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.emptyList())); in the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 I will update my mocks so that for job-2 nothing is returned
| client.execute(SearchAction.INSTANCE, searchRequest, new ThreadedActionListener<>(LOGGER, threadPool, | ||
| MachineLearning.UTILITY_THREAD_POOL_NAME, expiredSnapshotsListener(job.getId(), deleteAllBeforeMs, listener), false)); | ||
| ActionListener<QueryPage<ModelSnapshot>> snapshotsListener = expiredSnapshotsListener(job, deleteAllBeforeMs, listener); | ||
| jobResultsProvider.modelSnapshots(job.getId(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was running into async callback hell. But, if we just use the underlying modelSnapshots request from here and then filter out the assigned snapshot later, we reduce the number of calls and it flows more cleanly.
| LOGGER.warn("Model snapshot document [{}] has a null timestamp field", hit.getId()); | ||
| List<ModelSnapshot> snapshots = new ArrayList<>(); | ||
| for (ModelSnapshot snapshot: searchResponse.results()) { | ||
| if (snapshot.getSnapshotId().equals(job.getModelSnapshotId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we have to worry about the currently used model snapshot.
dimitris-athanasiou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM2 Much nicer!
|
run elasticsearch-ci/bwc |
1 similar comment
|
run elasticsearch-ci/bwc |
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember from when I put in the thinning out functionality what a nightmare those unit tests are because of the amount of mocking that is used.
Now that the production code has diverged even further from what it was when they were first written I am not convinced that they'd actually find a production problem any more.
I left a couple of comments about why one of them will be mega confusing for future maintainers if left as-is. The other two are also setting up weird responses that don't make sense given the settings, but they were doing this before your changes and they're testing error conditions. Maybe just add a comment to the mock response setup in those two saying the responses are wrong but OK given what's really being tested is the handling of unrelated errors.
The real coverage of this change comes from ModelSnapshotRetentionIT and DeleteExpiredDataIT. One of the tests in the latter is muted, although it was only failing very intermittently: DeleteExpiredDataIT.testDeleteExpiredDataWithStandardThrottle. Please can you try unmuting that and running it on your laptop and confirm it doesn't fail every time with these changes. (If it still passes almost every time then there's no need to fix the intermittent failure as part of this PR - just leave it muted in the committed code.)
| searchResponses.add( | ||
| AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.singletonList(snapshotToBeDeleted))); | ||
|
|
||
| SearchHit snapshot2_1 = createModelSnapshotQueryHit("job-1", "snapshots-1_1", eightDaysAndOneMsAgo); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's one of those tests where due to the complexity of the mocking it's very unclear what production code is being tested, and what's really being tested is the test itself.
The idea before was that job two had a 17 day retention period, so the search we were doing directly against the results index (without calling the model snapshots endpoint) would return nothing. This was represented by searchResponses.add(AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.emptyList())); in the old code.
| createModelSnapshot("job-1", "active", oneDayAgo), | ||
| createModelSnapshot("job-1", "fresh-snapshot", oneDayAgo), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model snapshots function shouldn't be returning snapshots from one day ago because the retention period is 7 days and a cutoff was supplied in https://github.com/elastic/elasticsearch/pull/63134/files#diff-496e9b129f2db0ef429636501b6974baR178
I think it's working because the thinning out functionality is ignoring the extra ones. But I don't think the code path being tested here is the code path that would get executed in production with the configured settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. But I don't think the code path being tested here is the code path that would get executed in production with the configured settings.
I understand. Including it is just a mocking mistake and the actual code path of "return only things before this timestamp" is not exercised. I could capture the query to check.
But I think the proof of this thing working is in the true integration tests (I will unmute the one you mentioned)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the mocks
But I do think it is nice to test that active is not removed (since we are not filtering it in the query anylonger). I will make it 8days old for job-1. An admittedly rare case, but a good one to cover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's good to test active. Good idea to make that old enough that it would have been deleted.
| createModelSnapshot("job-2", "active", oneDayAgo), | ||
| createModelSnapshot("job-2", "snapshots-1_1", eightDaysAndOneMsAgo), | ||
| createModelSnapshot("job-2", "fresh-snapshot", oneDayAgo) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The model snapshots function shouldn't be returning snapshots from one or eight days ago because the retention period is 17 days and a cutoff was supplied in https://github.com/elastic/elasticsearch/pull/63134/files#diff-496e9b129f2db0ef429636501b6974baR178
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, this is a mocking mistake. The change will not change the results of test.
It is nice to know though that even if the query is "wrong" that returns the snapshots, they are still not deleted :)
droberts195
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
When deleting expired snapshots, we do an individual delete action per snapshot per job. We should instead gather the expired snapshots and delete them in a single call. This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup closes elastic#62875 # Conflicts: # x-pack/plugin/ml/qa/native-multi-node-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/DeleteExpiredDataIT.java # x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/retention/ExpiredModelSnapshotsRemover.java
* [ML] optimize delete expired snapshots (#63134) When deleting expired snapshots, we do an individual delete action per snapshot per job. We should instead gather the expired snapshots and delete them in a single call. This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup closes #62875
When deleting expired snapshots, we do an individual delete action per snapshot per job.
We should instead gather the expired snapshots and delete them in a single call.
This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup
closes #62875