-
Notifications
You must be signed in to change notification settings - Fork 1
Noble/prs coll creation bug #158
base: release/8.8
Are you sure you want to change the base?
Conversation
…e in the end so that other nodes get a single refresh message
} else { | ||
return new ZkWriteCommand(collection, coll.copyWithSlices(newSlices)); | ||
} | ||
return new ZkWriteCommand(collection, coll.copyWithSlices(newSlices)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, will data remove the prs state of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, will data node remove the prs state of it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
} else{ | ||
return new ZkWriteCommand(collectionName, newCollection); | ||
} | ||
return new ZkWriteCommand(collectionName, newCollection); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that looks good!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@patsonluk the race condition is here. solrcore sends message to update the replica status todown
. but it doesn't wait for status update. Then solrcore recovers the core and update the prs state to up
.
But sometime overseer node updates the prs state to down
after that. And then we see core as a down.
With this fix, overseer node will never update the prs state.
@@ -321,6 +319,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin | |||
} | |||
} | |||
if(isPRS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noblepaul do we need prs
check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes,
The idea is , we write the entire modified state.json in one command if it is PRS
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can be done same for non-prs?
- lint 324 "ocmh.overseer.submit(new RefreshCollectionMessage(collectionName));" it should be same for prs and non-prs?
@@ -321,6 +319,8 @@ public void call(ClusterState clusterState, ZkNodeProps message, @SuppressWarnin | |||
} | |||
} | |||
if(isPRS) { | |||
byte[] data = Utils.toJSON(Collections.singletonMap(collectionName, clusterState.getCollection(collectionName))); | |||
ocmh.zkStateReader.getZkClient().setData(collectionPath, data, true); | |||
ocmh.overseer.submit(new RefreshCollectionMessage(collectionName)); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do we need below line 330 check?
if (isPRS) {
replicas = new ConcurrentHashMap<>();
newColl.getSlices().stream().flatMap(slice -> slice.getReplicas().stream())
.filter(r -> coresToCreate.containsKey(r.getCoreName())) // Only the elements that were asked for...
.forEach(r -> replicas.putIfAbsent(r.getCoreName(), r)); // ...get added to the map
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. How else do we build that Map
in case of PRS ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems we don't need special case here as else part is doing for non-prs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noblepaul @chatman would be good avoid special
prs
cases, if not require. Let me what you think about it?Yes, I would like to avoid as many special cases as possible. But PRS requires special handling
@noblepaul @chatman would be good avoid special
prs
cases, if not require. Let me what you think about it?Yes, I would like to avoid as many special cases as possible. But PRS requires special handling
@noblepaul @chatman would be good avoid special
prs
cases, if not require. Let me what you think about it?Yes, I would like to avoid as many special cases as possible. But PRS requires special handling
Yeah. Can we look the cases which I mentioned above!
@noblepaul @chatman would be good avoid special |
Yes, I would like to avoid as many special cases as possible. But PRS requires special handling |
I removed all the prs checks here and tried on playpen. So far looks good 83d7372 |
I don't recommend removing those. Eventually, non-prs should go away |
I think point here is do we need those new change? Would be good to understand whole prs flow. |
We have observed that FSPRSTest is failing right now |
@noblepaul did you get chance to look |
It's a test framework issue. we can just comment out the object tracker for a while and then investigate it as a different ticket ? This Object tracker issue has been there for a while |
I think we are not removing the prs-state of deleted replica as test deletes the replica before failure. We need to cleanup that state. |
Sure @hiteshk25 . I shall fix and test this ASAP |
if (forcePublish || sendToOverseer(coll, coreNodeName)) { | ||
if (sendToOverseer(coll, coreNodeName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you comment on this change (added with 35a511d)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just adding my observation here. forcePublish
is an old flag that force a publish to overseer even if the core is closed.
I don't fully understand why we have added that along with the PRS changes, my guess is that we wanted to preserve the behavior of "publish to os no matter what" (while the exact behavior when it's both true
and PRS enabled is not well defined)
The problem with such flag tho, is when during preRegister
from ZkController on new cores (which is invoked once per core), it calls publish(cd, Replica.State.DOWN, false, true);
which forcePublish
is true here, and since true
means always send to Overseer, so even for PRS collection, it will publish such state.json
to overseer n times, which n is the number of core. So from OS point of view, it will receive n messages, which n is number of shards.
This is largely okay for non PRS collection due to the "batching" behavior of ZkStateWriter
which it will only write the "latest" state.json update (bound by certain refresh interval), hence the actual write ops to ZK is minimal. However, this is NOT okay for PRS collections, as batching in ZkStateWriter
is essentially disabled for PRS collections, hence n number of writes will be sent to ZK.
Now the removal of such forcePublish
will work around this problem as it will no longer submit messages for each core registration for PRS collection (hence less Zk writes), however, there are just some items to consider:
- Probably want to double confirm the logic in
sendToOverseer
as some messages that used to rely onforcePublish
flag to get pushed to overseer, will now have to run the logic insendToOverseer
to determine where the message should go - We probably want to better define the behavior of PRS collection here. This is publishing the Core info. If such core does not exist in zk. we might want to do both? (Send to ZK to add entry in state.json AND set PRS entries) . As for core that does exist in zk state.json already, it's probably okay to just set PRS entries. The existing logic seems to want to handle the case that is PRS and replica/core is not yet in ZK (by looking at logic in
sendToOverseer
), however, it's probably not working well as it does not update the PRS entries in that case. It might not be a big deal now as the state.json for such replica is being inserted in place such asCreateCollectionCmd
Though we still probably want to better define the behavior to cover all cases.
And in longer run, if we still want state.json to keep info of the "existence" of a core/replica, then we really should fix the batching for PRS collection in ZkStateWriter
and probably eliminate individual direct writes on state.json for PRS collections.
Which the latter could still be a performance issue (which i assume caused collection creation issue at first place?) From my brief debug, it appears that this https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java#L282 might try to write the state.json n times to ZK, which n = number of shards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noblepaul @chatman I think keep this PR for just collection creation
issue only. We can revisit this separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why we have added that along with the PRS changes, my guess is that we wanted to preserve the behavior of "publish to os no matter what" (while the exact behavior when it's both true and PRS enabled is not well defined)
The original motivation during PRS impl was not to eliminate overseer messages 100% . Even if we eliminated 90% (state messages) it was good enough. In other words, we were trying to be a little conservative (by preserving the old behavior to a little extent) to ensure correctness.
Now we have reached a point where it can to be eliminated 100% for efficiency and safety (from race conditions).
This is largely okay for non PRS collection due to the "batching" behavior of ZkStateWriter which it will only write the "latest" state.json update (bound by certain refresh interval), hence the actual write ops to ZK is minimal. However, this is NOT okay for PRS collections, as batching in ZkStateWriter is essentially disabled for PRS collections, hence n number of writes will be sent to ZK.
This is true. We got rid of batching in case of PRS collections, since messages that come to overseer for PRS collections are fewer and far between
Probably want to double confirm the logic in sendToOverseer as some messages that used to rely on forcePublish flag to get pushed to overseer, will now have to run the logic in sendToOverseer to determine where the message should go
We should audit sendToOverseer() again and ensure that no messages go to overseer for PRS collections
We probably want to better define the behavior of PRS collection here. This is publishing the Core info. If such core does not exist in zk. we might want to do both?
This actually was a legacy behavior where a core can just startup and could register itself as a replica by sending a message to overseer. Today, if core info does not exist in state.json, the replicas should not even start.
And in longer run, if we still want state.json to keep info of the "existence" of a core/replica, then we really should fix the batching for PRS collection in ZkStateWriter and probably eliminate individual direct writes on state.json for PRS collections.
Yes,
No messages should go to overseer for PRS collections for state/leader
All modification operations on PRS collections (e.g. CREATE, ADDREPLICA, DELETEREPLICA) should be done as single atomic operations and avoid multiple updates.
Which the latter could still be a performance issue ....
The perf issue arose from a race condition which caused the CREATE command wait indefinitely for all replicas to be ACTIVE. So, it was actually not a perf bug, but it was a “deadlock”
Thanks for your comments The collection creation slowness that we observe is not due to some bug in
This ensures that there will be no race condition at all in updating PRS . |
…e in the end so that other nodes get a single refresh message
7b18df5
to
314a97c
Compare
@noblepaul @chatman Can you please look following test failure; likely just test issue as I have merged the aggregated metrics changes
|
Thank you for the explanation @noblepaul So is the goal of this PR to achieve the state of :
Or is this PR simply just fixing the race condition for collection creation? |
solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java
Show resolved
Hide resolved
@@ -1701,7 +1701,7 @@ public void publish(final CoreDescriptor cd, final Replica.State state, boolean | |||
cd.getCloudDescriptor().setLastPublished(state); | |||
} | |||
DocCollection coll = zkStateReader.getCollection(collection); | |||
if (forcePublish || sendToOverseer(coll, coreNodeName)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noblepaul @chatman this change is causing test failures. Following tests are failing. Please do run solr tests as I feel there are some more failures
Tests with failures [seed: AB00425489678D50]:
[junit4] - org.apache.solr.core.TestCoreContainer.testNoCores
[junit4] - org.apache.solr.core.TestCoreContainer.testDeleteBadCores
[junit4] - org.apache.solr.cloud.DeleteInactiveReplicaTest.deleteInactiveReplicaTest
[junit4] - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
[junit4] - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin
[junit4] - org.apache.solr.core.FSPRSTest.testShardSplit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure @hiteshk25
…cene-solr into noble/prsCollCreationBug
…rom overseer. There is a follow up PR to avoid updates to state.json for all updates
@hiteshk25 this fixes the collection creation bug |
@noblepaul @chatman here are the test failings ...
|
@@ -1703,7 +1703,8 @@ public void publish(final CoreDescriptor cd, final Replica.State state, boolean | |||
DocCollection coll = zkStateReader.getCollection(collection); | |||
if (forcePublish || sendToOverseer(coll, coreNodeName)) { | |||
overseerJobQueue.offer(Utils.toJSON(m)); | |||
} else { | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@noblepaul what is the motivation to remove else
part here? to publish the prs state of replica?
Q: solrcore will update the replica status only? if so then for prs collection we don't need to update the state.json file.
|
If the collection is PRS :
However, we are sending the message to overseer (when |
I think if collection is Also, did you closed this PR by mistake? |
Yes. but. that seems to cause some failures in the shard split test and we are hunting down that problem
yep. I closed it accidentally |
I can see It tries to update the Will look code more closely! |
No description provided.