Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SOLR-16701: Race condition on PRS enabled collection deletion #1460

Merged
merged 18 commits into from
Sep 23, 2023

Conversation

patsonluk
Copy link
Contributor

@patsonluk patsonluk commented Mar 14, 2023

https://issues.apache.org/jira/browse/SOLR-16701

Description

This fixes a race condition on PRS enabled collection deletion, which triggers the exception:

org.apache.solr.common.SolrException: Error fetching per-replica states
    at __randomizedtesting.SeedInfo.seed([C2BFFBF8FE49C1E1:F1C8D9E308D2745]:0)
    at app//org.apache.solr.common.cloud.PerReplicaStatesFetcher.fetch(PerReplicaStatesFetcher.java:49)
    at app//org.apache.solr.common.cloud.PerReplicaStatesFetcher$LazyPrsSupplier.lambda$new$0(PerReplicaStatesFetcher.java:62)
    at app//org.apache.solr.common.cloud.DocCollection$PrsSupplier.get(DocCollection.java:515)
    at app//org.apache.solr.common.cloud.Replica.isLeader(Replica.java:314)
    at app//org.apache.solr.common.cloud.Slice.findLeader(Slice.java:242)
    at app//org.apache.solr.common.cloud.Slice.setPrsSupplier(Slice.java:56)
    at app//org.apache.solr.common.cloud.DocCollection.<init>(DocCollection.java:123)
    at app//org.apache.solr.common.cloud.ClusterState.collectionFromObjects(ClusterState.java:305)
    at app//org.apache.solr.common.cloud.ClusterState.createFromCollectionMap(ClusterState.java:254)
    at app//org.apache.solr.client.solrj.impl.ZkClientClusterStateProvider.createFromJsonSupportingLegacyConfigName(ZkClientClusterStateProvider.java:117)
    at app//org.apache.solr.common.cloud.ZkStateReader.fetchCollectionState(ZkStateReader.java:1695)

This could be triggered by:

  1. fetchCollectionState is called, and the state.json is fetched
  2. But before the fetchCollectionState fetches the PRS entries, the collection state.json/PRS are deleted by someone else
  3. fetchCollectionState would throw below exception when it reaches the PRS fetching logic as the Zk node state.json is no longer around

Solution

Create a specific exception PrsZkNodeNotFoundException (that extends SolrException) when the PRS entries cannot be fetched. Then in ZkStateReader#fetchCollectionState, we have added a new catch clause for this exception, we will use the similar check as the handling for NoNodeException, but in this case if exists is null, then it's the expected race condition (prs entries deleted after state.json is fetched), and fetchCollectionState should just return null (collection gone), otherwise it will rethrow the PrsZkNodeNotFoundException case (unexpected, PRS entries gone but state.json is still around)

Tests

Added ZkStateReaderTest#testDeletePrsCollection which reproduce such race condition, and verify that:

  1. The ZkStateReader#fetchCollectionState should not throw exception, instead, it should eventually return null which indicates the collection is deleted
  2. The PrsZkNodeNotFoundException was indeed triggered

Please take note that the test case was built on the Breakpoint introduced by another PR #1457

Checklist

Please review the following and check all that apply:

  • I have reviewed the guidelines for How to Contribute and my code conforms to the standards described there to the best of my ability.
  • I have created a Jira issue and added the issue ID to my pull request title.
  • I have given Solr maintainers access to contribute to my PR branch. (optional but recommended)
  • I have developed this patch against the main branch.
  • I have run ./gradlew check.
  • I have added tests for my changes.
  • I have added documentation for the Reference Guide

Comment on lines 114 to 117
} else {
log.info(
"Breakpoint with key {} is triggered but there's no implementation set. Skipping...",
key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should log here (and fwiw there is no corresponding logging for absent injectDelay(). The issue is that if you were to enable assertions on a running system (not in a test context), you might get a ton of spam from this. If you want to leave it, maybe set log level from log.info() to log.debug()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes let's change that to debug, i was a bit concerned if someone accidentally used different key names meant for the same Breakpoint for injectBreakpoint and setImplementation, so with this log message they can see whether the breakpoint is triggered.

Though I guess this could be really noisy as you pointed out, and I assume the dev can search whether there's "Breakpoint with key ... is triggered" message instead and absence of that means either the execution path does not reach the breakpoint or the breakpoint has no implementation set

public void setImplementation(String key, Breakpoint implementation) {
if (breakpoints.containsKey(key)) {
throw new IllegalArgumentException(
"Cannot refine Breakpoint implementation with key " + key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "modify" or "replace" rather than "refine"? We'd be replacing, not refining.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh was a typo, supposed to be "redefine"

@patsonluk patsonluk force-pushed the patsonluk/SOLR-16701-delete-prs-race branch from 3b805af to cc5d9c3 Compare September 22, 2023 20:16
@patsonluk patsonluk force-pushed the patsonluk/SOLR-16701-delete-prs-race branch from cc5d9c3 to e5b21a9 Compare September 22, 2023 21:35
@magibney magibney merged commit f22a51c into apache:main Sep 23, 2023
2 of 3 checks passed
magibney pushed a commit that referenced this pull request Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants