Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

Noble/prs coll creation bug #158

Open
wants to merge 12 commits into
base: release/8.8
Choose a base branch
from
Open

Conversation

hiteshk25
Copy link

No description provided.

hiteshk25 and others added 2 commits June 17, 2022 16:11
…e in the end so that other nodes get a single refresh message
} else {
return new ZkWriteCommand(collection, coll.copyWithSlices(newSlices));
}
return new ZkWriteCommand(collection, coll.copyWithSlices(newSlices));
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, will data remove the prs state of it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, will data node remove the prs state of it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

} else{
return new ZkWriteCommand(collectionName, newCollection);
}
return new ZkWriteCommand(collectionName, newCollection);
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that looks good!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patsonluk the race condition is here. solrcore sends message to update the replica status todown. but it doesn't wait for status update. Then solrcore recovers the core and update the prs state to up.

But sometime overseer node updates the prs state to down after that. And then we see core as a down.

With this fix, overseer node will never update the prs state.

@noblepaul noblepaul requested a review from chatman June 22, 2022 05:39
@@ -321,6 +319,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
}
}
if(isPRS) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noblepaul do we need prs check here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes,
The idea is , we write the entire modified state.json in one command if it is PRS

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. can be done same for non-prs?
  2. lint 324 "ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));" it should be same for prs and non-prs?

@@ -321,6 +319,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
}
}
if(isPRS) {
byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, clusterState.getCollection(collectionName)));
ocmh.zkStateReader.getZkClient().setData(collectionPath, data, true);
ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
}

Copy link
Author

@hiteshk25 hiteshk25 Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need below line 330 check?

  if (isPRS) {
          replicas = new ConcurrentHashMap<>();
          newColl.getSlices().stream().flatMap(slice -> slice.getReplicas().stream())
              .filter(r -> coresToCreate.containsKey(r.getCoreName()))       // Only the elements that were asked for...
              .forEach(r -> replicas.putIfAbsent(r.getCoreName(), r)); // ...get added to the map
        } 

Copy link
Collaborator

@noblepaul noblepaul Jun 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. How else do we build that Map in case of PRS ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we don't need special case here as else part is doing for non-prs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

Yeah. Can we look the cases which I mentioned above!

@hiteshk25
Copy link
Author

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

@noblepaul
Copy link
Collaborator

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

@hiteshk25
Copy link
Author

I removed all the prs checks here and tried on playpen. So far looks good 83d7372

@noblepaul
Copy link
Collaborator

noblepaul commented Jun 26, 2022

I don't recommend removing those. Eventually, non-prs should go away

@hiteshk25
Copy link
Author

I don't recommend removing those. Eventually, non-prs should go away

I think point here is do we need those new change? Would be good to understand whole prs flow.

@hiteshk25
Copy link
Author

We have observed that FSPRSTest is failing right now

@hiteshk25
Copy link
Author

@noblepaul did you get chance to look FSPRSTest failure as mentioned above. We want to push this asap. Please prioritize it

@noblepaul
Copy link
Collaborator

noblepaul commented Jul 5, 2022

It's a test framework issue. we can just comment out the object tracker for a while and then investigate it as a different ticket ?

This Object tracker issue has been there for a while

@hiteshk25
Copy link
Author

It's a test framework issue. we can just comment out the object tracker for a while and then investigate it as a different ticket ?

This Object tracker issue has been there for a while

I think we are not removing the prs-state of deleted replica as test deletes the replica before failure. We need to cleanup that state.

@hiteshk25
Copy link
Author

Was debugging that test today. See the prs state and collection state.

image

@noblepaul
Copy link
Collaborator

Sure @hiteshk25 . I shall fix and test this ASAP

if (forcePublish || sendToOverseer(coll, coreNodeName)) {
if (sendToOverseer(coll, coreNodeName)) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you comment on this change (added with 35a511d)?

Copy link

@patsonluk patsonluk Jul 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just adding my observation here. forcePublish is an old flag that force a publish to overseer even if the core is closed.

I don't fully understand why we have added that along with the PRS changes, my guess is that we wanted to preserve the behavior of "publish to os no matter what" (while the exact behavior when it's both true and PRS enabled is not well defined)

The problem with such flag tho, is when during preRegister from ZkController on new cores (which is invoked once per core), it calls publish(cd, Replica.State.DOWN, false, true); which forcePublish is true here, and since true means always send to Overseer, so even for PRS collection, it will publish such state.json to overseer n times, which n is the number of core. So from OS point of view, it will receive n messages, which n is number of shards.

This is largely okay for non PRS collection due to the "batching" behavior of ZkStateWriter which it will only write the "latest" state.json update (bound by certain refresh interval), hence the actual write ops to ZK is minimal. However, this is NOT okay for PRS collections, as batching in ZkStateWriter is essentially disabled for PRS collections, hence n number of writes will be sent to ZK.

Now the removal of such forcePublish will work around this problem as it will no longer submit messages for each core registration for PRS collection (hence less Zk writes), however, there are just some items to consider:

  1. Probably want to double confirm the logic in sendToOverseer as some messages that used to rely on forcePublish flag to get pushed to overseer, will now have to run the logic in sendToOverseer to determine where the message should go
  2. We probably want to better define the behavior of PRS collection here. This is publishing the Core info. If such core does not exist in zk. we might want to do both? (Send to ZK to add entry in state.json AND set PRS entries) . As for core that does exist in zk state.json already, it's probably okay to just set PRS entries. The existing logic seems to want to handle the case that is PRS and replica/core is not yet in ZK (by looking at logic in sendToOverseer), however, it's probably not working well as it does not update the PRS entries in that case. It might not be a big deal now as the state.json for such replica is being inserted in place such as CreateCollectionCmd Though we still probably want to better define the behavior to cover all cases.

And in longer run, if we still want state.json to keep info of the "existence" of a core/replica, then we really should fix the batching for PRS collection in ZkStateWriter and probably eliminate individual direct writes on state.json for PRS collections.

Which the latter could still be a performance issue (which i assume caused collection creation issue at first place?) From my brief debug, it appears that this https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java#L282 might try to write the state.json n times to ZK, which n = number of shards

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noblepaul @chatman I think keep this PR for just collection creation issue only. We can revisit this separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@patsonluk

I don't fully understand why we have added that along with the PRS changes, my guess is that we wanted to preserve the behavior of "publish to os no matter what" (while the exact behavior when it's both true and PRS enabled is not well defined)

The original motivation during PRS impl was not to eliminate overseer messages 100% . Even if we eliminated 90% (state messages) it was good enough. In other words, we were trying to be a little conservative (by preserving the old behavior to a little extent) to ensure correctness.

Now we have reached a point where it can to be eliminated 100% for efficiency and safety (from race conditions).

This is largely okay for non PRS collection due to the "batching" behavior of ZkStateWriter which it will only write the "latest" state.json update (bound by certain refresh interval), hence the actual write ops to ZK is minimal. However, this is NOT okay for PRS collections, as batching in ZkStateWriter is essentially disabled for PRS collections, hence n number of writes will be sent to ZK.

This is true. We got rid of batching in case of PRS collections, since messages that come to overseer for PRS collections are fewer and far between

Probably want to double confirm the logic in sendToOverseer as some messages that used to rely on forcePublish flag to get pushed to overseer, will now have to run the logic in sendToOverseer to determine where the message should go

We should audit sendToOverseer() again and ensure that no messages go to overseer for PRS collections

We probably want to better define the behavior of PRS collection here. This is publishing the Core info. If such core does not exist in zk. we might want to do both?

This actually was a legacy behavior where a core can just startup and could register itself as a replica by sending a message to overseer. Today, if core info does not exist in state.json, the replicas should not even start.

And in longer run, if we still want state.json to keep info of the "existence" of a core/replica, then we really should fix the batching for PRS collection in ZkStateWriter and probably eliminate individual direct writes on state.json for PRS collections.

Yes,

No messages should go to overseer for PRS collections for state/leader
All modification operations on PRS collections (e.g. CREATE, ADDREPLICA, DELETEREPLICA) should be done as single atomic operations and avoid multiple updates.

Which the latter could still be a performance issue ....

The perf issue arose from a race condition which caused the CREATE command wait indefinitely for all replicas to be ACTIVE. So, it was actually not a perf bug, but it was a “deadlock”

@noblepaul
Copy link
Collaborator

@hiteshk25 @patsonluk

Thanks for your comments

The collection creation slowness that we observe is not due to some bug in CreateCollectionCmd . This is a manifestation of a race condition between the state updates that happen from data node AND overseer node. (A data node changes the states and overseer overwrites it). So the state becomes wrong .We are moving to a model where we will

  • ONLY update PRS states from data nodes
  • only update state.json from overseer

This ensures that there will be no race condition at all in updating PRS .

@hiteshk25 hiteshk25 force-pushed the noble/prsCollCreationBug branch from 7b18df5 to 314a97c Compare July 13, 2022 21:11
@hiteshk25
Copy link
Author

@noblepaul @chatman Can you please look following test failure; likely just test issue as I have merged the aggregated metrics changes

Tests with failures [seed: AB00425489678D50]:
   [junit4]   - org.apache.solr.core.TestCoreContainer.testNoCores
   [junit4]   - org.apache.solr.core.TestCoreContainer.testDeleteBadCores
   [junit4]   - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin
   [junit4]   - org.apache.solr.core.FSPRSTest.testShardSplit

@patsonluk
Copy link

patsonluk commented Jul 13, 2022

Thank you for the explanation @noblepaul

So is the goal of this PR to achieve the state of :

  • ONLY update PRS states from data nodes
  • only update state.json from overseer

Or is this PR simply just fixing the race condition for collection creation?

@@ -1701,7 +1701,7 @@ public void publish(final CoreDescriptor cd, final Replica.State state, boolean
cd.getCloudDescriptor().setLastPublished(state);
}
DocCollection coll = zkStateReader.getCollection(collection);
if (forcePublish || sendToOverseer(coll, coreNodeName)) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noblepaul @chatman this change is causing test failures. Following tests are failing. Please do run solr tests as I feel there are some more failures

Tests with failures [seed: AB00425489678D50]:
   [junit4]   - org.apache.solr.core.TestCoreContainer.testNoCores
   [junit4]   - org.apache.solr.core.TestCoreContainer.testDeleteBadCores
   [junit4]   - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin
   [junit4]   - org.apache.solr.core.FSPRSTest.testShardSplit

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure @hiteshk25

…rom overseer. There is a follow up PR to avoid updates to state.json for all updates
@noblepaul
Copy link
Collaborator

@hiteshk25 this fixes the collection creation bug

@hiteshk25
Copy link
Author

@noblepaul @chatman here are the test failings ...

 [junit4] Tests with failures [seed: 697A3852C6E2A11D]:
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.cloud.LeaderTragicEventTest.testLeaderFailsOver
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitMixedReplicaTypes
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitStaticIndexReplicationLink
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitStaticIndexReplication
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitMixedReplicaTypesLink
   [junit4]   - org.apache.solr.core.TestCoreContainer.testNoCores
   [junit4]   - org.apache.solr.core.TestCoreContainer.testDeleteBadCores
   [junit4]   - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest
   [junit4]   - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin

@@ -1703,7 +1703,8 @@ public void publish(final CoreDescriptor cd, final Replica.State state, boolean
DocCollection coll = zkStateReader.getCollection(collection);
if (forcePublish || sendToOverseer(coll, coreNodeName)) {
overseerJobQueue.offer(Utils.toJSON(m));
} else {
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@noblepaul what is the motivation to remove else part here? to publish the prs state of replica?

Q: solrcore will update the replica status only? if so then for prs collection we don't need to update the state.json file.

@noblepaul
Copy link
Collaborator

what is the motivation to remove else part here? to publish the prs state of replica?

@noblepaul noblepaul closed this Jul 20, 2022
@noblepaul
Copy link
Collaborator

what is the motivation to remove else part here? to publish the prs state of replica?

If the collection is PRS :

  • then the data node MUST update the state and
  • the overseer MUST NOT update the per replica state. This is done to avoid the race condition/deadlock

However, we are sending the message to overseer (when forcePublish == true) because shard split relies on this message. We need to avoid that in a subsequent PR

@hiteshk25
Copy link
Author

I think if collection is prs then there is no need to update state.json file. data node just need to replica status. do you think it will update any other attribute?

Also, did you closed this PR by mistake?

@noblepaul noblepaul reopened this Jul 20, 2022
@noblepaul
Copy link
Collaborator

I think if collection is prs then there is no need to update state.json file. data node just need to replica status. do you think it will update any other attribute?

Yes. but. that seems to cause some failures in the shard split test and we are hunting down that problem

Also, did you closed this PR by mistake?

yep. I closed it accidentally

@hiteshk25
Copy link
Author

I can see It tries to update the shard's state with replica state. For non-prs it is atomic operation.
I think then we have to make this atomic for prs as well. Special case, I think we should wait there to finish both the operations.

Will look code more closely!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants