Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't skip shards in coord rewrite if timestamp is an alias #117271

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

javanna
Copy link
Member

@javanna javanna commented Nov 21, 2024

The coordinator rewrite has logic to skip indices if the provided date range filter is not within the min and max range of all of its shards. This mechanism is enabled for event.ingested and @timestamp fields, against searchable snapshots.

We have basic checks that such fields need to be of date field type, yet if they are defined as alias of a date field, their range will be empty, which indicates that the shards are empty, and the coord rewrite logic resolves the alias and ends up skipping shards that may have matching docs.

This commit adds an explicit check that declares the range UNKNOWN instead of EMPTY in these circumstances. The same check is also performed in the coord rewrite logic, so that shards are no longer skipped by mistake.

The coordinator rewrite has logic to skip indices if the provided date range
filter is not within the min and max range of all of its shards. This mechanism
is enabled for event.ingested and @timestamp fields, against searchable snapshots.

We have basic checks that such fields need to be of date field type, yet if they
are defined as alias of a date field, their range will be empty, which indicates
that the shards are empty, and the coord rewrite logic resolves the alias and
ends up skipping shards that may have matching docs.

This commit adds an explicit check that declares the range UNKNOWN instead of EMPTY
in these circumstances. The same check is also performed in the coord rewrite logic,
so that shards are no longer skipped by mistake.
@elasticsearchmachine elasticsearchmachine added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Nov 21, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search-foundations (Team:Search Foundations)

@elasticsearchmachine
Copy link
Collaborator

Hi @javanna, I've created a changelog YAML for you.

@@ -2274,8 +2274,8 @@ private ShardLongFieldRange determineShardLongFieldRange(String fieldName) {
return ShardLongFieldRange.UNKNOWN; // no mapper service, no idea if the field even exists
}
final MappedFieldType mappedFieldType = mapperService().fieldType(fieldName);
if (mappedFieldType instanceof DateFieldMapper.DateFieldType == false) {
return ShardLongFieldRange.UNKNOWN; // field missing or not a date
if (mappedFieldType instanceof DateFieldMapper.DateFieldType == false || mappedFieldType.name().equals(fieldName) == false) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix is technically not necessary: the change below on the search side is enough to fix the issue, because we no longer look for the range if timestamp is an alias. Yet it seems wrong to mark the range EMPTY in that case, hence I think it makes sense to correct this side too.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner do you think that this introduces possible incompatibility with previous versions, given that for the same situation, newer indices will have UNKNOWN while older indices will have EMPTY in their range info? I believe this is ok as long as the search side is also fixed to do the same check. But I don't know deeply the update mechanism here: will the range info even get updated for older indices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the index metadata contains a range with org.elasticsearch.index.shard.IndexLongFieldRange#isComplete() being true, it shouldn't change again, and it should reach that state after starting all the shards in the index the first time. But I know this area has moved around quite a bit since I last touched it and there was some confusion about EMPTY vs UNKNOWN in the past (e.g. around #106252) whose resolution I didn't follow, so I can't be sure that it does exactly what I think it should do :)

@@ -438,6 +438,7 @@ public String getWriteableName() {
protected MappedFieldType.Relation getRelation(final CoordinatorRewriteContext coordinatorRewriteContext) {
final MappedFieldType fieldType = coordinatorRewriteContext.getFieldType(fieldName);
if (fieldType instanceof final DateFieldMapper.DateFieldType dateFieldType) {
assert fieldName.equals(fieldType.name());
IndexLongFieldRange fieldRange = coordinatorRewriteContext.getFieldRange(fieldName);
if (fieldRange.isComplete() == false || fieldRange == IndexLongFieldRange.EMPTY) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The diff does not include it, but the core of the issue is that we return DISJOINT for an EMPTY range, hence we skip. I wonder if we should rather change this: if we have incomplete data should we rather let the query run instead of skipping and taking the risk of not returning matching data like in this case? On the other hand if EMPTY reliably means empty shard, it makes sense to skip. Not too sure about the isComplete part here.

Also, I cannot think of other situations where timestamp comes back as the right field type and checks all the boxes for coord rewrite, yet perhaps we were not able to store the range in the cluster state, because min and max were not available. In that case we'd end up skipping. It may be safer to not skip instead. It would have saved us from this bug at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this the relevant discussion about that issue from a PR I had that was closed not merged:
#110873 (review)
#110873 (comment)
#110873 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Armin had the same question here: #111419 (comment)
But I never answered him on the PR. I think I referred him to the above discussion outside the PR. I didn't want to change functionality more than necessary since the changes around event.ingested were so complex and I was worried of subtle bugs.

But now might be a time to raise a GH issue around this and get another discussion going.

Copy link
Contributor

@quux00 quux00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production code changes look good to me.
I think there is another test to be written though?

if (mappedFieldType instanceof DateFieldMapper.DateFieldType == false) {
return ShardLongFieldRange.UNKNOWN; // field missing or not a date
if (mappedFieldType instanceof DateFieldMapper.DateFieldType == false || mappedFieldType.name().equals(fieldName) == false) {
return ShardLongFieldRange.UNKNOWN; // field missing or not a date or an alias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code comment is slightly ambiguous "not a date or an alias" could be read to mean "not an alias", so would be clearer to say:

// field is missing, an alias or not a date field

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT but ++ to Michael's wording :)

assertThat(searchResponse.getTotalShards(), equalTo(2));
assertThat(searchResponse.getHits().getTotalHits().value(), equalTo(20L));
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm misunderstanding this test, but I would have thought the test you want to write is one where with @timestamp the shards are skipped where as with the alias the shards are not skipped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe that test can be done in server/src/test/java/org/elasticsearch/action/search/CanMatchPreFilterSearchPhaseTests.java ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that the tests for the skipping of @timestamp and event.ingested are already written, as that isn't the functionality I am adding. What my change does is fixing that when one of the two is an alias, shards are not incorrectly skipped.

@@ -438,6 +438,7 @@ public String getWriteableName() {
protected MappedFieldType.Relation getRelation(final CoordinatorRewriteContext coordinatorRewriteContext) {
final MappedFieldType fieldType = coordinatorRewriteContext.getFieldType(fieldName);
if (fieldType instanceof final DateFieldMapper.DateFieldType dateFieldType) {
assert fieldName.equals(fieldType.name());
IndexLongFieldRange fieldRange = coordinatorRewriteContext.getFieldRange(fieldName);
if (fieldRange.isComplete() == false || fieldRange == IndexLongFieldRange.EMPTY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this the relevant discussion about that issue from a PR I had that was closed not merged:
#110873 (review)
#110873 (comment)
#110873 (comment)

@@ -438,6 +438,7 @@ public String getWriteableName() {
protected MappedFieldType.Relation getRelation(final CoordinatorRewriteContext coordinatorRewriteContext) {
final MappedFieldType fieldType = coordinatorRewriteContext.getFieldType(fieldName);
if (fieldType instanceof final DateFieldMapper.DateFieldType dateFieldType) {
assert fieldName.equals(fieldType.name());
IndexLongFieldRange fieldRange = coordinatorRewriteContext.getFieldRange(fieldName);
if (fieldRange.isComplete() == false || fieldRange == IndexLongFieldRange.EMPTY) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Armin had the same question here: #111419 (comment)
But I never answered him on the PR. I think I referred him to the above discussion outside the PR. I didn't want to change functionality more than necessary since the changes around event.ingested were so complex and I was worried of subtle bugs.

But now might be a time to raise a GH issue around this and get another discussion going.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Luca :)

I do agree though, we should have a discussion on the rewrite-to-none-if-not-yet-available situation in some form. The current behavior feels incredibly confusing to me. If a searchable snapshot shard is ready and covered by the shard resolution logic in the coordinator we shouldn't have some secondary layer that hacks in skipping, the currently implementation has already cost multiple people a lot more time than reasonable to wrap their heads around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Search Foundations/Search Catch all for Search Foundations Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch v8.16.2 v8.17.1 v9.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants