-
Notifications
You must be signed in to change notification settings - Fork 25.6k
[ML] Fix race condition when creating multiple jobs #40049
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Fix race condition when creating multiple jobs #40049
Conversation
If multiple jobs are created together and the anomaly results index does not exist then some of the jobs could fail to update the mappings of the results index. This lead them to fail to write their results correctly later. This change fixes the problem by updating the mappings of the results index if it is found to exist during a creation attempt.
|
Pinging @elastic/ml-core |
| ImmutableOpenMap<String, MappingMetaData> indexMappings = | ||
| response.getMappings().iterator().next().value; | ||
| MappingMetaData typeMappings = indexMappings.iterator().next().value; | ||
| addTermsAndAliases(typeMappings, indexName, termFields, createAliasListener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to me that a race condition around mappings still exists, though it is an equally rare one.
- 3 calls to create the results index
- It gets created with by one of the calls (with those term mappings)
- The other two fail and fall into this clause, pull that single job's mapping, update it with their own fields, and put the whole mapping again
- The first of the two "late" jobs would lose its terms due to the second of the "late" jobs replacing the mapping on with
PutMappingRequest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The first of the two "late" jobs would lose its terms due to the second of the "late" jobs replacing the mapping on with
PutMappingRequest
No, because if you put mappings they always add to the current mappings, they don't replace them. This is how Elasticsearch mappings have always worked and is why we get away with only putting mappings for the new job terms. We don't put our core result field mappings except when the index is initially created and they don't get wiped out.
You are correct there is still a race condition, but it's on the number of fields in the mappings. If you create 3 jobs simultaneously each with 400 new term fields then the one that creates the index will add the first 400. The other two will see that their 400 plus the 400 already in the index is 800, which is OK, so will both try to put mappings. And whichever of these requests arrives second will fail because then the index would have too many field mappings: 1200 plus however many core fields we have. However, this has always been a bug as it could also happen in the code path where the index already exists. The end result of this is that you'd get an unfriendly error message about mapping limits instead of a friendly one telling you you need to use a custom results index for the job, so it's not that bad considering the tiny chance of it happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droberts195 I read
// Put the whole mapping, not just the term fields, otherwise we'll wipe the _meta section of the mapping
try (XContentBuilder termFieldsMapping = ElasticsearchMappings.resultsMapping(mappingType, termFields)) {
incorrectly. I keep forgetting that an UPDATE to a mapping is done via a PUT and not a POST...
The number of fields is still a race condition, but only when it comes to failing early, which is acceptable.
|
Jenkins run elasticsearch-ci/packaging-sample |
If multiple jobs are created together and the anomaly results index does not exist then some of the jobs could fail to update the mappings of the results index. This lead them to fail to write their results correctly later. Although this scenario sounds rare, it is exactly what happens if the user creates their first jobs using the Nginx module in the ML UI. This change fixes the problem by updating the mappings of the results index if it is found to exist during a creation attempt. Fixes #38785
If multiple jobs are created together and the anomaly results index does not exist then some of the jobs could fail to update the mappings of the results index. This lead them to fail to write their results correctly later. Although this scenario sounds rare, it is exactly what happens if the user creates their first jobs using the Nginx module in the ML UI. This change fixes the problem by updating the mappings of the results index if it is found to exist during a creation attempt. Fixes #38785
If multiple jobs are created together and the anomaly results index does not exist then some of the jobs could fail to update the mappings of the results index. This lead them to fail to write their results correctly later. Although this scenario sounds rare, it is exactly what happens if the user creates their first jobs using the Nginx module in the ML UI. This change fixes the problem by updating the mappings of the results index if it is found to exist during a creation attempt. Fixes #38785
If multiple jobs are created together and the anomaly
results index does not exist then some of the jobs could
fail to update the mappings of the results index. This
lead them to fail to write their results correctly later.
Although this scenario sounds rare, it is exactly what
happens if the user creates their first jobs using the
Nginx module in the ML UI.
This change fixes the problem by updating the mappings
of the results index if it is found to exist during a
creation attempt.
Fixes #38785