Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

Now overseer node doesn't update prs state #157

Open
wants to merge 2 commits into
base: release/8.8
Choose a base branch
from

Conversation

hiteshk25
Copy link

@hiteshk25 hiteshk25 commented Jun 17, 2022

In general prs state should be updated by data node only. I tested this fix 10 times on playpen but unable to reproduce collection creation issue with prs. Without it was able to reproduce this in 2/3 times.

We need look more throughly prs related code

@noblepaul
Copy link
Collaborator

Hi @hiteshk25 . The fix looks correct.

However, I tested it and I still see timeout. Maybe there are other places where we need to fix .I'll do further testing

…e in the end so that other nodes get a single refresh message
@patsonluk
Copy link

patsonluk commented Jun 27, 2022

Cool to see that we are calling the update once per collection instead of per replica!

Some minor thoughts (not directly related to this PR) for @noblepaul :

  1. I think you might have mentioned during the meeting that this one update for all replicas can be applied to non PRS collections too? Or perhaps I got it wrong 😓 Sorry, I read this PR before Noble/prs coll creation bug #158 and @hiteshk25 has already raised similar question there!
  2. If I understand it correctly, before PRS, all state.json update is handled by the overseer state updater queue (except our uberseer, but within java solr, it's all done by that single threaded state updater) ie ocmh.overseer.offerStateUpdate. For PRS, we started doing direct updates ocmh.zkStateReader.getZkClient().create, would it be safer if we stick with the single threaded overseer queue to avoid any race conditions? I assume there are reasons for using direct ZK state.json modifications instead for PRS?

@noblepaul
Copy link
Collaborator

For PRS, we started doing direct updates ocmh.zkStateReader.getZkClient().create, would it be safer if we stick with the single threaded overseer queue to avoid any race conditions? I assume there are reasons for using direct ZK state.json modifications instead for PRS?

Even the updates to ZK done by ZkStateWriter happen in the same thread. However, none of the operations update ZK directly because the in memory copy of ClusterState in overseer can have stale data. In the case of Create collection case we do not have to worry about that because there is no "stale" state. In the PRS impl, we already force overseer to refresh the in-memory version of cluster state .

I think you might have mentioned during the meeting that this one update for all replicas can be applied to non PRS collections too?

This is actually true. We can use the same mechanism for non-PRS collections too. As PRS was newly introduced we didn't want to affect the critical path

@noblepaul noblepaul closed this Jun 28, 2022
@noblepaul noblepaul reopened this Jun 28, 2022
@patsonluk
Copy link

patsonluk commented Jun 28, 2022

Thank you @noblepaul ! I want to make sure I understand your answer correctly: 😊

Even the updates to ZK done by ZkStateWriter happen in the same thread.

I assume you are referring to the underlying SolrZkClient (shared by Overseer, ZkStateReader, ZkStateWriter etc), which uses SolrZkClient->SolrZooKeeper -> ZooKeeper which ZooKeeper is thread-safe?

However, none of the operations update ZK directly because the in memory copy of ClusterState in overseer can have stale data.

I could see that when state updates are guarded by Overseer$ClusterStateUpdater, then it can ensure ClusterState copy kept in the run loop variable of Overseer$ClusterStateUpdater and the one in the ZkStateWriter in sync. However, there's also PRS code make calls on SolrZkClient directly, which bypasses the Overseer$ClusterStateUpdater and even ZkStateWriter. I could see that we do already make attempt to later on refresh the cluster state in Overseer$ClusterStateUpdater using RefreshCollectionMessage as mentioned in your next comment. This should work as intended, but it does demonstrate that using SolrZkClient` directly can induce inconsistent states - even momentarily, and that's why we need to force refresh. 😓

In the case of Create collection case we do not have to worry about that because there is no "stale" state. In the PRS impl, we already force overseer to refresh the in-memory version of cluster state .

There's no bad state for collection that does not exist (if I read it correctly), but would the cluster state mistakenly regard a collection as non-existent, while it has been created elsewhere using SolrZkClient#create? Let's not drill down into every single case though! My questions were to bring up some discussions of the current approach, and your replies definitely help me with my thought process!

If I understand correctly, using direct SolrZkClient access on state.json related updates might create certain inconsistencies (between the actual state of zk node and what ClusterStateUpdater/ZkStateWriter has), which are then addressed by force refresh. This is probably working fine as is, though the code logic might become more susceptible to potential race condition such as delay of forced refresh (momentarily inconsistent), or even missed forced refresh due to unexpected conditions. (exceptions, developer mistakes etc)

Are there any strong reasons that we cannot use Overseer$ClusterStateUpdater for PRS state updates? Is it because we want to make sure the state.json node does exist (created by overseer) before any of the PRS entries are added (could be from data node) to such state.json node? Following the same pattern as non PRS (uses Overseer$ClusterStateUpdater) seems to be a safer option if possible? 🤔

@noblepaul
Copy link
Collaborator

but would the cluster state mistakenly regard a collection as non-existent, while it has been created elsewhere using SolrZkClient#create?

look at the following set of ops

  1. create coll (coll-A) command sent to overseer with perresplicaState=true
    1.a) overseer write the state.json to ZK
  2. client sends a command to overseer to perform some operation on (coll-A) errors out with "no collection"
    1.c) create collection command sends a refresh collection message for coll-A and returns
    1.d) overseer processes refresh collection message . coll-A is now visible to overseer and all the subsequent commands
  3. client sends a command to overseer to perform some operation on coll-A. This command is queued up after the refresh coll command and the command succeeds

in the above example, until 1.d is done the client should not assume that coll-A exists or the collection creation succeeded. ideally operation #2 should fail and #3 should succeed

for non-PRS collections , it sends as many add-replica ops as there are replicas. It is sub-optimal .

However, there is a new way to update the internal clusterstate without relying on a message as done in MODIFY

@patsonluk
Copy link

patsonluk commented Jun 29, 2022

Thank you for the explanations!

The flow appears to be simpler for non PRS collection which all state.json update is taken care of by a single thread and single updater (Overseer$ClusterStateUpdater). Once we have other code logic that can directly modify the state.json outside of such updater, we will need to do more handlings/workarounds (for example to force sync) and there could be unforeseen edge cases (race condition etc).

@noblepaul would be great for me to understand a bit more on the higher level design regarding:

Are there any strong reasons that we cannot use Overseer$ClusterStateUpdater for PRS state updates? Is it because we want to make sure the state.json node does exist (created by overseer) before any of the PRS entries are added (could be from data node) to such state.json node? Following the same pattern as non PRS (uses Overseer$ClusterStateUpdater) seems to be a safer option if possible? 🤔

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants