Noble/prs coll creation bug #158

hiteshk25 · 2022-06-21T23:37:14Z

No description provided.

…e in the end so that other nodes get a single refresh message

hiteshk25 · 2022-06-21T23:38:33Z

solr/core/src/java/org/apache/solr/cloud/overseer/SliceMutator.java

-    } else {
-      return new ZkWriteCommand(collection, coll.copyWithSlices(newSlices));
-    }
+    return new ZkWriteCommand(collection, coll.copyWithSlices(newSlices));


In this case, will data remove the prs state of it?

In this case, will data node remove the prs state of it?

hiteshk25 · 2022-06-21T23:38:49Z

solr/core/src/java/org/apache/solr/cloud/overseer/ReplicaMutator.java

-    } else{
-      return new ZkWriteCommand(collectionName, newCollection);
-    }
+    return new ZkWriteCommand(collectionName, newCollection);


that looks good!

@patsonluk the race condition is here. solrcore sends message to update the replica status todown. but it doesn't wait for status update. Then solrcore recovers the core and update the prs state to up.

But sometime overseer node updates the prs state to down after that. And then we see core as a down.

With this fix, overseer node will never update the prs state.

hiteshk25 · 2022-06-22T23:09:04Z

solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java

@@ -321,6 +319,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
        }
      }
      if(isPRS) {


@noblepaul do we need prs check here?

yes,
The idea is , we write the entire modified state.json in one command if it is PRS

can be done same for non-prs?

lint 324 "ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));" it should be same for prs and non-prs?

hiteshk25 · 2022-06-22T23:10:37Z

solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java

@@ -321,6 +319,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin
        }
      }
      if(isPRS) {
+        byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, clusterState.getCollection(collectionName)));
+        ocmh.zkStateReader.getZkClient().setData(collectionPath, data, true);
        ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));
      }



Also, do we need below line 330 check?

if (isPRS) { replicas = new ConcurrentHashMap<>(); newColl.getSlices().stream().flatMap(slice -> slice.getReplicas().stream()) .filter(r -> coresToCreate.containsKey(r.getCoreName())) // Only the elements that were asked for... .forEach(r -> replicas.putIfAbsent(r.getCoreName(), r)); // ...get added to the map }

yes. How else do we build that Map in case of PRS ?

it seems we don't need special case here as else part is doing for non-prs?

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

Yeah. Can we look the cases which I mentioned above!

hiteshk25 · 2022-06-24T22:13:31Z

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

noblepaul · 2022-06-24T22:52:49Z

@noblepaul @chatman would be good avoid special prs cases, if not require. Let me what you think about it?

Yes, I would like to avoid as many special cases as possible. But PRS requires special handling

hiteshk25 · 2022-06-25T01:26:23Z

I removed all the prs checks here and tried on playpen. So far looks good 83d7372

noblepaul · 2022-06-26T05:53:37Z

I don't recommend removing those. Eventually, non-prs should go away

hiteshk25 · 2022-06-26T23:41:19Z

I don't recommend removing those. Eventually, non-prs should go away

I think point here is do we need those new change? Would be good to understand whole prs flow.

hiteshk25 · 2022-06-29T00:09:16Z

We have observed that FSPRSTest is failing right now

hiteshk25 · 2022-07-05T17:37:35Z

@noblepaul did you get chance to look FSPRSTest failure as mentioned above. We want to push this asap. Please prioritize it

noblepaul · 2022-07-05T18:05:19Z

It's a test framework issue. we can just comment out the object tracker for a while and then investigate it as a different ticket ?

This Object tracker issue has been there for a while

hiteshk25 · 2022-07-07T23:08:39Z

It's a test framework issue. we can just comment out the object tracker for a while and then investigate it as a different ticket ?

This Object tracker issue has been there for a while

I think we are not removing the prs-state of deleted replica as test deletes the replica before failure. We need to cleanup that state.

hiteshk25 · 2022-07-07T23:09:55Z

Was debugging that test today. See the prs state and collection state.

noblepaul · 2022-07-08T13:40:13Z

Sure @hiteshk25 . I shall fix and test this ASAP

magibney · 2022-07-12T15:01:26Z

solr/core/src/java/org/apache/solr/cloud/ZkController.java

-      if (forcePublish || sendToOverseer(coll, coreNodeName)) {
+      if (sendToOverseer(coll, coreNodeName)) {


Can you comment on this change (added with 35a511d)?

Just adding my observation here. forcePublish is an old flag that force a publish to overseer even if the core is closed.

I don't fully understand why we have added that along with the PRS changes, my guess is that we wanted to preserve the behavior of "publish to os no matter what" (while the exact behavior when it's both true and PRS enabled is not well defined)

The problem with such flag tho, is when during preRegister from ZkController on new cores (which is invoked once per core), it calls publish(cd, Replica.State.DOWN, false, true); which forcePublish is true here, and since true means always send to Overseer, so even for PRS collection, it will publish such state.json to overseer n times, which n is the number of core. So from OS point of view, it will receive n messages, which n is number of shards.

This is largely okay for non PRS collection due to the "batching" behavior of ZkStateWriter which it will only write the "latest" state.json update (bound by certain refresh interval), hence the actual write ops to ZK is minimal. However, this is NOT okay for PRS collections, as batching in ZkStateWriter is essentially disabled for PRS collections, hence n number of writes will be sent to ZK.

Now the removal of such forcePublish will work around this problem as it will no longer submit messages for each core registration for PRS collection (hence less Zk writes), however, there are just some items to consider:

Probably want to double confirm the logic in sendToOverseer as some messages that used to rely on forcePublish flag to get pushed to overseer, will now have to run the logic in sendToOverseer to determine where the message should go

We probably want to better define the behavior of PRS collection here. This is publishing the Core info. If such core does not exist in zk. we might want to do both? (Send to ZK to add entry in state.json AND set PRS entries) . As for core that does exist in zk state.json already, it's probably okay to just set PRS entries. The existing logic seems to want to handle the case that is PRS and replica/core is not yet in ZK (by looking at logic in sendToOverseer), however, it's probably not working well as it does not update the PRS entries in that case. It might not be a big deal now as the state.json for such replica is being inserted in place such as CreateCollectionCmd Though we still probably want to better define the behavior to cover all cases.

And in longer run, if we still want state.json to keep info of the "existence" of a core/replica, then we really should fix the batching for PRS collection in ZkStateWriter and probably eliminate individual direct writes on state.json for PRS collections.

Which the latter could still be a performance issue (which i assume caused collection creation issue at first place?) From my brief debug, it appears that this https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java#L282 might try to write the state.json n times to ZK, which n = number of shards

@noblepaul @chatman I think keep this PR for just collection creation issue only. We can revisit this separately.

@patsonluk

I don't fully understand why we have added that along with the PRS changes, my guess is that we wanted to preserve the behavior of "publish to os no matter what" (while the exact behavior when it's both true and PRS enabled is not well defined)

The original motivation during PRS impl was not to eliminate overseer messages 100% . Even if we eliminated 90% (state messages) it was good enough. In other words, we were trying to be a little conservative (by preserving the old behavior to a little extent) to ensure correctness.

Now we have reached a point where it can to be eliminated 100% for efficiency and safety (from race conditions).

This is largely okay for non PRS collection due to the "batching" behavior of ZkStateWriter which it will only write the "latest" state.json update (bound by certain refresh interval), hence the actual write ops to ZK is minimal. However, this is NOT okay for PRS collections, as batching in ZkStateWriter is essentially disabled for PRS collections, hence n number of writes will be sent to ZK.

This is true. We got rid of batching in case of PRS collections, since messages that come to overseer for PRS collections are fewer and far between

Probably want to double confirm the logic in sendToOverseer as some messages that used to rely on forcePublish flag to get pushed to overseer, will now have to run the logic in sendToOverseer to determine where the message should go

We should audit sendToOverseer() again and ensure that no messages go to overseer for PRS collections

We probably want to better define the behavior of PRS collection here. This is publishing the Core info. If such core does not exist in zk. we might want to do both?

This actually was a legacy behavior where a core can just startup and could register itself as a replica by sending a message to overseer. Today, if core info does not exist in state.json, the replicas should not even start.

And in longer run, if we still want state.json to keep info of the "existence" of a core/replica, then we really should fix the batching for PRS collection in ZkStateWriter and probably eliminate individual direct writes on state.json for PRS collections.

Yes,

No messages should go to overseer for PRS collections for state/leader
All modification operations on PRS collections (e.g. CREATE, ADDREPLICA, DELETEREPLICA) should be done as single atomic operations and avoid multiple updates.

Which the latter could still be a performance issue ....

The perf issue arose from a race condition which caused the CREATE command wait indefinitely for all replicas to be ACTIVE. So, it was actually not a perf bug, but it was a “deadlock”

solr/core/src/java/org/apache/solr/core/CoreContainer.java

noblepaul · 2022-07-13T03:02:45Z

@hiteshk25 @patsonluk

Thanks for your comments

The collection creation slowness that we observe is not due to some bug in CreateCollectionCmd . This is a manifestation of a race condition between the state updates that happen from data node AND overseer node. (A data node changes the states and overseer overwrites it). So the state becomes wrong .We are moving to a model where we will

ONLY update PRS states from data nodes
only update state.json from overseer

This ensures that there will be no race condition at all in updating PRS .

…e in the end so that other nodes get a single refresh message

hiteshk25 · 2022-07-13T23:12:38Z

@noblepaul @chatman Can you please look following test failure; likely just test issue as I have merged the aggregated metrics changes

Tests with failures [seed: AB00425489678D50]:
   [junit4]   - org.apache.solr.core.TestCoreContainer.testNoCores
   [junit4]   - org.apache.solr.core.TestCoreContainer.testDeleteBadCores
   [junit4]   - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin
   [junit4]   - org.apache.solr.core.FSPRSTest.testShardSplit

patsonluk · 2022-07-13T23:12:58Z

Thank you for the explanation @noblepaul

So is the goal of this PR to achieve the state of :

ONLY update PRS states from data nodes

only update state.json from overseer

Or is this PR simply just fixing the race condition for collection creation?

solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java

hiteshk25 · 2022-07-14T23:03:30Z

solr/core/src/java/org/apache/solr/cloud/ZkController.java

@@ -1701,7 +1701,7 @@ public void publish(final CoreDescriptor cd, final Replica.State state, boolean
        cd.getCloudDescriptor().setLastPublished(state);
      }
      DocCollection coll = zkStateReader.getCollection(collection);
-      if (forcePublish || sendToOverseer(coll, coreNodeName)) {


@noblepaul @chatman this change is causing test failures. Following tests are failing. Please do run solr tests as I feel there are some more failures

Tests with failures [seed: AB00425489678D50]: [junit4] - org.apache.solr.core.TestCoreContainer.testNoCores [junit4] - org.apache.solr.core.TestCoreContainer.testDeleteBadCores [junit4] - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest [junit4] - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud [junit4] - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin [junit4] - org.apache.solr.core.FSPRSTest.testShardSplit

Sure @hiteshk25

…cene-solr into noble/prsCollCreationBug

…rom overseer. There is a follow up PR to avoid updates to state.json for all updates

noblepaul · 2022-07-19T13:14:58Z

@hiteshk25 this fixes the collection creation bug

hiteshk25 · 2022-07-19T18:05:53Z

@noblepaul @chatman here are the test failings ...

 [junit4] Tests with failures [seed: 697A3852C6E2A11D]:
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.cloud.LeaderTragicEventTest.testLeaderFailsOver
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitMixedReplicaTypes
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitStaticIndexReplicationLink
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitStaticIndexReplication
   [junit4]   - org.apache.solr.cloud.api.collections.ShardSplitTest.testSplitMixedReplicaTypesLink
   [junit4]   - org.apache.solr.core.TestCoreContainer.testNoCores
   [junit4]   - org.apache.solr.core.TestCoreContainer.testDeleteBadCores
   [junit4]   - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest
   [junit4]   - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin

hiteshk25 · 2022-07-20T00:57:46Z

solr/core/src/java/org/apache/solr/cloud/ZkController.java

@@ -1703,7 +1703,8 @@ public void publish(final CoreDescriptor cd, final Replica.State state, boolean
      DocCollection coll = zkStateReader.getCollection(collection);
      if (forcePublish || sendToOverseer(coll, coreNodeName)) {
        overseerJobQueue.offer(Utils.toJSON(m));
-      } else {
+      }


@noblepaul what is the motivation to remove else part here? to publish the prs state of replica?

Q: solrcore will update the replica status only? if so then for prs collection we don't need to update the state.json file.

noblepaul · 2022-07-20T04:10:03Z

what is the motivation to remove else part here? to publish the prs state of replica?

noblepaul · 2022-07-20T04:13:31Z

what is the motivation to remove else part here? to publish the prs state of replica?

If the collection is PRS :

then the data node MUST update the state and
the overseer MUST NOT update the per replica state. This is done to avoid the race condition/deadlock

However, we are sending the message to overseer (when forcePublish == true) because shard split relies on this message. We need to avoid that in a subsequent PR

hiteshk25 · 2022-07-20T05:51:31Z

I think if collection is prs then there is no need to update state.json file. data node just need to replica status. do you think it will update any other attribute?

Also, did you closed this PR by mistake?

noblepaul · 2022-07-20T12:42:35Z

I think if collection is prs then there is no need to update state.json file. data node just need to replica status. do you think it will update any other attribute?

Yes. but. that seems to cause some failures in the shard split test and we are hunting down that problem

Also, did you closed this PR by mistake?

yep. I closed it accidentally

hiteshk25 · 2022-07-20T15:40:01Z

I can see It tries to update the shard's state with replica state. For non-prs it is atomic operation.
I think then we have to make this atomic for prs as well. Special case, I think we should wait there to finish both the operations.

Will look code more closely!

hiteshk25 and others added 2 commits June 17, 2022 16:11

updated

922455e

Avoid one write per replica to state.json and just do one single writ…

b8c0aa4

…e in the end so that other nodes get a single refresh message

hiteshk25 commented Jun 21, 2022

View reviewed changes

noblepaul requested a review from chatman June 22, 2022 05:39

hiteshk25 commented Jun 22, 2022

View reviewed changes

patsonluk mentioned this pull request Jun 27, 2022

Now overseer node doesn't update prs state #157

Open

magibney reviewed Jul 12, 2022

View reviewed changes

solr/core/src/java/org/apache/solr/core/CoreContainer.java Outdated Show resolved Hide resolved

hiteshk25 and others added 7 commits July 13, 2022 14:07

updated

95b0cd5

Avoid one write per replica to state.json and just do one single writ…

2d6dae4

…e in the end so that other nodes get a single refresh message

avoid PRS change from overseer in slice mutator

7174c3e

avoid PRS change from overseer in replica mutator

46b1baa

ensure that PRS states are cleaned uup when unloading core

6cafbc3

added tests

555076d

formatting

314a97c

hiteshk25 force-pushed the noble/prsCollCreationBug branch from 7b18df5 to 314a97c Compare July 13, 2022 21:11

patsonluk reviewed Jul 13, 2022

View reviewed changes

solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java Show resolved Hide resolved

hiteshk25 commented Jul 14, 2022

View reviewed changes

noblepaul added 2 commits July 19, 2022 20:15

Merge branch 'noble/prsCollCreationBug' of github.com:fullstorydev/lu…

3cc23e0

…cene-solr into noble/prsCollCreationBug

This is a fix for the race condition by avoiding all updates to PRS f…

aae1175

…rom overseer. There is a follow up PR to avoid updates to state.json for all updates

hiteshk25 commented Jul 20, 2022

View reviewed changes

noblepaul closed this Jul 20, 2022

Null checks added

fddf6c5

noblepaul reopened this Jul 20, 2022

noblepaul mentioned this pull request Sep 2, 2022

Removing PRS related operations from overseer for ADDREPLICA, DELETER… #183

Merged

		if (forcePublish \|\| sendToOverseer(coll, coreNodeName)) {
		if (sendToOverseer(coll, coreNodeName)) {

Noble/prs coll creation bug #158

Are you sure you want to change the base?

Noble/prs coll creation bug #158

Conversation

hiteshk25 commented Jun 21, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiteshk25 Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

noblepaul Jun 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hiteshk25 commented Jun 24, 2022

noblepaul commented Jun 24, 2022

hiteshk25 commented Jun 25, 2022

noblepaul commented Jun 26, 2022 • edited Loading

hiteshk25 commented Jun 26, 2022

hiteshk25 commented Jun 29, 2022

hiteshk25 commented Jul 5, 2022

noblepaul commented Jul 5, 2022 • edited Loading

hiteshk25 commented Jul 7, 2022

hiteshk25 commented Jul 7, 2022

noblepaul commented Jul 8, 2022

Choose a reason for hiding this comment

patsonluk Jul 12, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noblepaul commented Jul 13, 2022

hiteshk25 commented Jul 13, 2022

patsonluk commented Jul 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

noblepaul commented Jul 19, 2022

hiteshk25 commented Jul 19, 2022

Choose a reason for hiding this comment

noblepaul commented Jul 20, 2022

noblepaul commented Jul 20, 2022

hiteshk25 commented Jul 20, 2022

noblepaul commented Jul 20, 2022

hiteshk25 commented Jul 20, 2022

hiteshk25 Jun 22, 2022 •

edited

Loading

noblepaul Jun 22, 2022 •

edited

Loading

noblepaul commented Jun 26, 2022 •

edited

Loading

noblepaul commented Jul 5, 2022 •

edited

Loading

patsonluk Jul 12, 2022 •

edited

Loading

patsonluk commented Jul 13, 2022 •

edited

Loading