Skip to content

Conversation

@benwtrent
Copy link
Member

When deleting expired snapshots, we do an individual delete action per snapshot per job.

We should instead gather the expired snapshots and delete them in a single call.

This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup

closes #62875

When deleting expired snapshots, we do an individual delete action per snapshot per job.

We should instead gather the expired snapshots and delete them in a single call.

This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup

closes elastic#62875
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (:ml)

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I just had one thought about the potential length of the audit message.

public static final String JOB_AUDIT_SNAPSHOT_STORED = "Job model snapshot with id [{0}] stored";
public static final String JOB_AUDIT_REVERTED = "Job model snapshot reverted to ''{0}''";
public static final String JOB_AUDIT_SNAPSHOT_DELETED = "Model snapshot [{0}] with description ''{1}'' deleted";
public static final String JOB_AUDIT_SNAPSHOTS_DELETED = "Model snapshots {0} with descriptions {1} deleted";

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could just say {0} expired model snapshots deleted (where {0} is how many). We could be deleting 24 per day if they've been changed to be every hour, and that would make the message very verbose. The fact that these are the ones that have expired should be OK for the audit log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 diggin' it

}
});
auditor.info(jobId, msg);
LOGGER.debug(() -> new ParameterizedMessage("[{}] {}", jobId, msg));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even though the audit message can be made less verbose, the debug should continue to list every ID and description, just in case something ever goes wrong with which snapshots are getting deleted.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no doubt!

searchResponses.add(
AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.singletonList(snapshotToBeDeleted)));

SearchHit snapshot2_1 = createModelSnapshotQueryHit("job-1", "snapshots-1_1", eightDaysAndOneMsAgo);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how this test ever passed?

I think it is meant to test the mixed retention between two jobs, but it appears to give the same job two of the snapshots?

I adjusted it it for the new format so that this other old snapshot is part of job-2

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of those tests where due to the complexity of the mocking it's very unclear what production code is being tested, and what's really being tested is the test itself.

The idea before was that job two had a 17 day retention period, so the search we were doing directly against the results index (without calling the model snapshots endpoint) would return nothing. This was represented by searchResponses.add(AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.emptyList())); in the old code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 I will update my mocks so that for job-2 nothing is returned

client.execute(SearchAction.INSTANCE, searchRequest, new ThreadedActionListener<>(LOGGER, threadPool,
MachineLearning.UTILITY_THREAD_POOL_NAME, expiredSnapshotsListener(job.getId(), deleteAllBeforeMs, listener), false));
ActionListener<QueryPage<ModelSnapshot>> snapshotsListener = expiredSnapshotsListener(job, deleteAllBeforeMs, listener);
jobResultsProvider.modelSnapshots(job.getId(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was running into async callback hell. But, if we just use the underlying modelSnapshots request from here and then filter out the assigned snapshot later, we reduce the number of calls and it flows more cleanly.

LOGGER.warn("Model snapshot document [{}] has a null timestamp field", hit.getId());
List<ModelSnapshot> snapshots = new ArrayList<>();
for (ModelSnapshot snapshot: searchResponse.results()) {
if (snapshot.getSnapshotId().equals(job.getModelSnapshotId())) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have to worry about the currently used model snapshot.

Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM2 Much nicer!

@benwtrent benwtrent requested a review from droberts195 October 2, 2020 12:03
@benwtrent
Copy link
Member Author

run elasticsearch-ci/bwc

1 similar comment
@benwtrent
Copy link
Member Author

run elasticsearch-ci/bwc

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember from when I put in the thinning out functionality what a nightmare those unit tests are because of the amount of mocking that is used.

Now that the production code has diverged even further from what it was when they were first written I am not convinced that they'd actually find a production problem any more.

I left a couple of comments about why one of them will be mega confusing for future maintainers if left as-is. The other two are also setting up weird responses that don't make sense given the settings, but they were doing this before your changes and they're testing error conditions. Maybe just add a comment to the mock response setup in those two saying the responses are wrong but OK given what's really being tested is the handling of unrelated errors.

The real coverage of this change comes from ModelSnapshotRetentionIT and DeleteExpiredDataIT. One of the tests in the latter is muted, although it was only failing very intermittently: DeleteExpiredDataIT.testDeleteExpiredDataWithStandardThrottle. Please can you try unmuting that and running it on your laptop and confirm it doesn't fail every time with these changes. (If it still passes almost every time then there's no need to fix the intermittent failure as part of this PR - just leave it muted in the committed code.)

searchResponses.add(
AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.singletonList(snapshotToBeDeleted)));

SearchHit snapshot2_1 = createModelSnapshotQueryHit("job-1", "snapshots-1_1", eightDaysAndOneMsAgo);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's one of those tests where due to the complexity of the mocking it's very unclear what production code is being tested, and what's really being tested is the test itself.

The idea before was that job two had a 17 day retention period, so the search we were doing directly against the results index (without calling the model snapshots endpoint) would return nothing. This was represented by searchResponses.add(AbstractExpiredJobDataRemoverTests.createSearchResponseFromHits(Collections.emptyList())); in the old code.

Comment on lines 123 to 124
createModelSnapshot("job-1", "active", oneDayAgo),
createModelSnapshot("job-1", "fresh-snapshot", oneDayAgo),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model snapshots function shouldn't be returning snapshots from one day ago because the retention period is 7 days and a cutoff was supplied in https://github.com/elastic/elasticsearch/pull/63134/files#diff-496e9b129f2db0ef429636501b6974baR178

I think it's working because the thinning out functionality is ignoring the extra ones. But I don't think the code path being tested here is the code path that would get executed in production with the configured settings.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

. But I don't think the code path being tested here is the code path that would get executed in production with the configured settings.

I understand. Including it is just a mocking mistake and the actual code path of "return only things before this timestamp" is not exercised. I could capture the query to check.

But I think the proof of this thing working is in the true integration tests (I will unmute the one you mentioned)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will update the mocks

But I do think it is nice to test that active is not removed (since we are not filtering it in the query anylonger). I will make it 8days old for job-1. An admittedly rare case, but a good one to cover.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's good to test active. Good idea to make that old enough that it would have been deleted.

Comment on lines 129 to 131
createModelSnapshot("job-2", "active", oneDayAgo),
createModelSnapshot("job-2", "snapshots-1_1", eightDaysAndOneMsAgo),
createModelSnapshot("job-2", "fresh-snapshot", oneDayAgo)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model snapshots function shouldn't be returning snapshots from one or eight days ago because the retention period is 17 days and a cutoff was supplied in https://github.com/elastic/elasticsearch/pull/63134/files#diff-496e9b129f2db0ef429636501b6974baR178

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is a mocking mistake. The change will not change the results of test.

It is nice to know though that even if the query is "wrong" that returns the snapshots, they are still not deleted :)

Copy link

@droberts195 droberts195 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@benwtrent benwtrent merged commit 4cfb22e into elastic:master Oct 2, 2020
@benwtrent benwtrent deleted the feature/ml-fix-delete-log-spam branch October 2, 2020 15:23
benwtrent added a commit to benwtrent/elasticsearch that referenced this pull request Oct 2, 2020
When deleting expired snapshots, we do an individual delete action per snapshot per job.

We should instead gather the expired snapshots and delete them in a single call.

This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup

closes elastic#62875

# Conflicts:
#	x-pack/plugin/ml/qa/native-multi-node-tests/src/javaRestTest/java/org/elasticsearch/xpack/ml/integration/DeleteExpiredDataIT.java
#	x-pack/plugin/ml/src/main/java/org/elasticsearch/xpack/ml/job/retention/ExpiredModelSnapshotsRemover.java
benwtrent added a commit that referenced this pull request Oct 2, 2020
* [ML] optimize delete expired snapshots (#63134)

When deleting expired snapshots, we do an individual delete action per snapshot per job.

We should instead gather the expired snapshots and delete them in a single call.

This commit achieves this and a side-effect is there is less audit log spam on nightly cleanup

closes #62875
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[ML] "Model snapshot [X] deleted" message spams the job messages

5 participants