Now overseer node doesn't update prs state #157

hiteshk25 · 2022-06-17T23:12:36Z

In general prs state should be updated by data node only. I tested this fix 10 times on playpen but unable to reproduce collection creation issue with prs. Without it was able to reproduce this in 2/3 times.

We need look more throughly prs related code

noblepaul · 2022-06-18T00:50:59Z

Hi @hiteshk25 . The fix looks correct.

However, I tested it and I still see timeout. Maybe there are other places where we need to fix .I'll do further testing

…e in the end so that other nodes get a single refresh message

patsonluk · 2022-06-27T19:16:53Z

Cool to see that we are calling the update once per collection instead of per replica!

Some minor thoughts (not directly related to this PR) for @noblepaul :

~~I think you might have mentioned during the meeting that this one update for all replicas can be applied to non PRS collections too? Or perhaps I got it wrong 😓~~ Sorry, I read this PR before Noble/prs coll creation bug #158 and @hiteshk25 has already raised similar question there!
If I understand it correctly, before PRS, all state.json update is handled by the overseer state updater queue (except our uberseer, but within java solr, it's all done by that single threaded state updater) ie ocmh.overseer.offerStateUpdate. For PRS, we started doing direct updates ocmh.zkStateReader.getZkClient().create, would it be safer if we stick with the single threaded overseer queue to avoid any race conditions? I assume there are reasons for using direct ZK state.json modifications instead for PRS?

noblepaul · 2022-06-28T04:11:56Z

For PRS, we started doing direct updates ocmh.zkStateReader.getZkClient().create, would it be safer if we stick with the single threaded overseer queue to avoid any race conditions? I assume there are reasons for using direct ZK state.json modifications instead for PRS?

Even the updates to ZK done by ZkStateWriter happen in the same thread. However, none of the operations update ZK directly because the in memory copy of ClusterState in overseer can have stale data. In the case of Create collection case we do not have to worry about that because there is no "stale" state. In the PRS impl, we already force overseer to refresh the in-memory version of cluster state .

I think you might have mentioned during the meeting that this one update for all replicas can be applied to non PRS collections too?

This is actually true. We can use the same mechanism for non-PRS collections too. As PRS was newly introduced we didn't want to affect the critical path

patsonluk · 2022-06-28T17:19:54Z

Thank you @noblepaul ! I want to make sure I understand your answer correctly: 😊

Even the updates to ZK done by ZkStateWriter happen in the same thread.

I assume you are referring to the underlying SolrZkClient (shared by Overseer, ZkStateReader, ZkStateWriter etc), which uses SolrZkClient->SolrZooKeeper -> ZooKeeper which ZooKeeper is thread-safe?

However, none of the operations update ZK directly because the in memory copy of ClusterState in overseer can have stale data.

I could see that when state updates are guarded by Overseer$ClusterStateUpdater, then it can ensure ClusterState copy kept in the run loop variable of Overseer$ClusterStateUpdater and the one in the ZkStateWriter in sync. However, there's also PRS code make calls on SolrZkClient directly, which bypasses the Overseer$ClusterStateUpdater and even ZkStateWriter. I could see that we do already make attempt to later on refresh the cluster state in Overseer$ClusterStateUpdater using RefreshCollectionMessage as mentioned in your next comment. This should work as intended, but it does demonstrate that using SolrZkClient` directly can induce inconsistent states - even momentarily, and that's why we need to force refresh. 😓

In the case of Create collection case we do not have to worry about that because there is no "stale" state. In the PRS impl, we already force overseer to refresh the in-memory version of cluster state .

There's no bad state for collection that does not exist (if I read it correctly), but would the cluster state mistakenly regard a collection as non-existent, while it has been created elsewhere using SolrZkClient#create? Let's not drill down into every single case though! My questions were to bring up some discussions of the current approach, and your replies definitely help me with my thought process!

If I understand correctly, using direct SolrZkClient access on state.json related updates might create certain inconsistencies (between the actual state of zk node and what ClusterStateUpdater/ZkStateWriter has), which are then addressed by force refresh. This is probably working fine as is, though the code logic might become more susceptible to potential race condition such as delay of forced refresh (momentarily inconsistent), or even missed forced refresh due to unexpected conditions. (exceptions, developer mistakes etc)

Are there any strong reasons that we cannot use Overseer$ClusterStateUpdater for PRS state updates? Is it because we want to make sure the state.json node does exist (created by overseer) before any of the PRS entries are added (could be from data node) to such state.json node? Following the same pattern as non PRS (uses Overseer$ClusterStateUpdater) seems to be a safer option if possible? 🤔

noblepaul · 2022-06-29T13:30:36Z

but would the cluster state mistakenly regard a collection as non-existent, while it has been created elsewhere using SolrZkClient#create?

look at the following set of ops

create coll (coll-A) command sent to overseer with perresplicaState=true
1.a) overseer write the state.json to ZK
client sends a command to overseer to perform some operation on (coll-A) errors out with "no collection"
1.c) create collection command sends a refresh collection message for coll-A and returns
1.d) overseer processes refresh collection message . coll-A is now visible to overseer and all the subsequent commands
client sends a command to overseer to perform some operation on coll-A. This command is queued up after the refresh coll command and the command succeeds

in the above example, until 1.d is done the client should not assume that coll-A exists or the collection creation succeeded. ideally operation #2 should fail and #3 should succeed

for non-PRS collections , it sends as many add-replica ops as there are replicas. It is sub-optimal .

However, there is a new way to update the internal clusterstate without relying on a message as done in MODIFY

patsonluk · 2022-06-29T17:35:04Z

Thank you for the explanations!

The flow appears to be simpler for non PRS collection which all state.json update is taken care of by a single thread and single updater (Overseer$ClusterStateUpdater). Once we have other code logic that can directly modify the state.json outside of such updater, we will need to do more handlings/workarounds (for example to force sync) and there could be unforeseen edge cases (race condition etc).

@noblepaul would be great for me to understand a bit more on the higher level design regarding:

Are there any strong reasons that we cannot use Overseer$ClusterStateUpdater for PRS state updates? Is it because we want to make sure the state.json node does exist (created by overseer) before any of the PRS entries are added (could be from data node) to such state.json node? Following the same pattern as non PRS (uses Overseer$ClusterStateUpdater) seems to be a safer option if possible? 🤔

updated

922455e

Avoid one write per replica to state.json and just do one single writ…

b8c0aa4

…e in the end so that other nodes get a single refresh message

noblepaul closed this Jun 28, 2022

noblepaul reopened this Jun 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Now overseer node doesn't update prs state #157

Now overseer node doesn't update prs state #157

hiteshk25 commented Jun 17, 2022 •

edited

Loading

noblepaul commented Jun 18, 2022

patsonluk commented Jun 27, 2022 •

edited

Loading

noblepaul commented Jun 28, 2022

patsonluk commented Jun 28, 2022 •

edited

Loading

noblepaul commented Jun 29, 2022

patsonluk commented Jun 29, 2022 •

edited

Loading

Now overseer node doesn't update prs state #157

Are you sure you want to change the base?

Now overseer node doesn't update prs state #157

Conversation

hiteshk25 commented Jun 17, 2022 • edited Loading

noblepaul commented Jun 18, 2022

patsonluk commented Jun 27, 2022 • edited Loading

noblepaul commented Jun 28, 2022

patsonluk commented Jun 28, 2022 • edited Loading

noblepaul commented Jun 29, 2022

patsonluk commented Jun 29, 2022 • edited Loading

hiteshk25 commented Jun 17, 2022 •

edited

Loading

patsonluk commented Jun 27, 2022 •

edited

Loading

patsonluk commented Jun 28, 2022 •

edited

Loading

patsonluk commented Jun 29, 2022 •

edited

Loading