Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[test] Fix IndexShardTests#testScheduledRefresh #110312

Merged
merged 13 commits into from
Jul 16, 2024

Conversation

arteam
Copy link
Contributor

@arteam arteam commented Jun 30, 2024

After we flushed the shard, we only make sure that the refresh call is propagated to the shard engine, but we can't be sure that the call is actually ends up in a shard refresh. The call in InternalEngine#refresh can return false if we couldn't acquire the lock on ElasticsearchDirectoryReader, because it's already being refreshed.

We can wrap the call in assertBusy to retry it in order to make sure that the shard eventually gets refreshed.

Resolves #101008

After we flushed the shard, we only make sure that the refresh call is propagated
to the shard engine, but we can't be sure that the call is actually ends up in a shard refresh.
The call in `InternalEngine#refresh` can return `false` if we couldn't acquire the lock
on `ElasticsearchDirectoryReader`, because it's already being refreshed.

We can wrap the call in `assertBusy` to retry it in order to make sure that the
shard eventually gets refreshed.

Resolves #101008
@arteam arteam added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. >test Issues or PRs that are addressing/adding tests labels Jun 30, 2024
@arteam arteam marked this pull request as ready for review July 1, 2024 06:19
@arteam arteam requested a review from fcofdez July 1, 2024 06:20
@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jul 1, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@arteam arteam requested review from kingherc and tlrx July 1, 2024 11:02
@fcofdez
Copy link
Contributor

fcofdez commented Jul 2, 2024

The call in InternalEngine#refresh can return false if we couldn't acquire the lock on ElasticsearchDirectoryReader, because it's already being refreshed.

But this test concerns the IndexShard alone, meaning that there shouldn't be any background threads trying to refresh the shard (i.e. scheduled refreshes). Could you elaborate on which thread is calling refresh concurrently?

@arteam arteam requested a review from kingherc July 3, 2024 10:25
@arteam arteam requested review from kingherc and removed request for kingherc July 3, 2024 12:34
logger.info("--> scheduledRefresh(future5)");
ensureNoPendingScheduledRefresh();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's some spurious refresh being run, we cannot be sure it is at this exact spot.

Maybe a better approach would be to get the number of external refreshes before the future5, and asserting that after the scheduled refresh, it's incremented by 1.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kingherc That's a great idea. I've replaced the hack with blocking the refresh thread pool with checking the refresh stats.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to understand why the test was failing, in the logs of the test failures you can see that the flush is executed after the assertion trips, so I'm not convinced about the flush being the issue here

@arteam
Copy link
Contributor Author

arteam commented Jul 9, 2024

@elasticmachine update branch

@arteam arteam requested review from fcofdez and kingherc and removed request for fcofdez July 16, 2024 07:30
@arteam
Copy link
Contributor Author

arteam commented Jul 16, 2024

@elasticmachine update branch

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@arteam arteam added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Jul 16, 2024
Copy link
Contributor

@kingherc kingherc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Please handle the two comments I mentioned before merging.

primary.scheduledRefresh(future5);
assertTrue(future5.actionGet()); // make sure we refresh once the shard is inactive
primary.scheduledRefresh(ActionListener.noop());
// We can't check whether scheduledRefresh returns true because it races with a potential
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since @fcofdez approved, I am also fine with the current state and we see if in the future there's any issues with any other concurrent refreshes going on.

I believe this comment may not be up to date now. Since above we assertBusy that the flush has happened, probably the scheduled refresh here will also be true. I'd just remove the comment to avoid confusion.

@@ -3925,11 +3924,16 @@ public void testScheduledRefresh() throws Exception {
logger.info("--> ensure search idle");
assertTrue(primary.isSearchIdle());
assertTrue(primary.searchIdleTime() >= TimeValue.ZERO.millis());
long periodicFlushesBefore = primary.flushStats().getPeriodic();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also can you change the comment above
while shard is search active and ensure scheduleRefresh(...) makes documen visible:
?
because the shard was search idle and that's why scheduled refresh is false there.

@elasticsearchmachine elasticsearchmachine merged commit 199910c into main Jul 16, 2024
16 checks passed
@elasticsearchmachine elasticsearchmachine deleted the fix-test-scheduled-refresh branch July 16, 2024 08:25
@arteam
Copy link
Contributor Author

arteam commented Jul 16, 2024

@kingherc Sorry about the auto merge, I will address your comments in a follow up PR!

arteam added a commit to arteam/elasticsearch that referenced this pull request Jul 16, 2024
* First scheduledRefresh returns false because search is idle
* Remove the comment about the inability to control the result of scheduleRefresh

Follow-up for elastic#110312
arteam added a commit that referenced this pull request Jul 17, 2024
* First scheduledRefresh returns false because search is idle
* Remove the comment about the inability to control the result of scheduleRefresh

Follow-up for #110312
ioanatia pushed a commit to ioanatia/elasticsearch that referenced this pull request Jul 22, 2024
* First scheduledRefresh returns false because search is idle
* Remove the comment about the inability to control the result of scheduleRefresh

Follow-up for elastic#110312
salvatore-campagna pushed a commit to salvatore-campagna/elasticsearch that referenced this pull request Jul 23, 2024
* First scheduledRefresh returns false because search is idle
* Remove the comment about the inability to control the result of scheduleRefresh

Follow-up for elastic#110312
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test Issues or PRs that are addressing/adding tests v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndexShardTests testScheduledRefresh failing
5 participants