Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix SearchableSnapshotsPersistentCacheIntegTests.testCacheSurviveRestart #66578

Closed
wants to merge 1 commit into from

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented Dec 18, 2020

Previous attempt in #66354 wasn't enough to fix this test. The eviction of cache files is executed using a loop in the generic thread pool so it's possible that

assertTrue(internalCluster().getInstance(CacheService.class, dataNode).getPersistentCache().hasDeletions());

succeed but not all cache files are evicted yet, and the last assertion of the test failed.

Closes #66278

@tlrx tlrx added >test Issues or PRs that are addressing/adding tests :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.12.0 v7.11.1 labels Dec 18, 2020
@elasticmachine elasticmachine added the Team:Distributed Meta label for distributed team (obsolete) label Dec 18, 2020
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks like it will likely fix/improve the issue but I wonder if we shouldn't fix this in production code instead. Maybe my understanding of this is wrong, but when trying to reproduce the issue it turned out that:

    private void runIfShardMarkedAsEvictedInCache(ShardEviction shardEviction, Runnable runnable) {
        try (Releasable ignored = shardsEvictionLock.acquire(shardEviction)) {
            boolean success = false;
            try {
                if (evictedShards.remove(shardEviction)) {
                    runnable.run();
                }
                success = true;
            } finally {
                assert success : "shard eviction should be successful: " + shardEviction;
                if (success == false) {
                    final boolean added = evictedShards.add(shardEviction);
                    assert added : shardEviction;
                }
            }
        }
    }

trips the assertion for success, leading to indefinitely waiting on a listener (it trips because the CacheService is already shut down). If we ensure that the evictions go through before we kill the CacheService we fix the test, but the assertion that's causing all of this still doesn't hold.
Shouldn't we either drop that assertion (or maybe better but more involved yet fix the code to not trip the assertion by fixing the shutdown of CacheService) ?

@tlrx
Copy link
Member Author

tlrx commented Dec 18, 2020

@original-brownbear Agreed. Henning already suggested to improve runIfShardMarkedAsEvictedInCache and I have ideas for that. I'll open a PR.

@tlrx
Copy link
Member Author

tlrx commented Jan 7, 2021

I've opened #67160 which should fix the production code.

@tlrx tlrx closed this Jan 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs Team:Distributed Meta label for distributed team (obsolete) >test Issues or PRs that are addressing/adding tests v7.11.1 v7.12.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SearchableSnapshotsPersistentCacheIntegTests. testCacheSurviveRestart failure
4 participants