Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

Test listeners for state.json and prs state update #165

Open
wants to merge 10 commits into
base: release/8.8
Choose a base branch
from

Conversation

hiteshk25
Copy link

@hiteshk25 hiteshk25 commented Jul 13, 2022

  1. OverseerMessageListener for overseer update of state.json
  2. PrsUpdateListener for prs state update
  3. Added test for prsCollection creation

Eventually need to write test for each prs workflow https://docs.google.com/document/d/17fCMTMV3P1XoB8HatbVsuwavuOorAVLdOinCR6k_HYs/edit

@@ -104,6 +104,8 @@ public class Overseer implements SolrCloseable {

private static final Logger log = LoggerFactory.getLogger(MethodHandles.lookup().lookupClass());

public volatile OverseerMessageListener testOverseerMessageListener = null;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏼 nice to validate overseer activities to catch unexpected calls. I assume this is only added temporarily for verification purpose?

Also we might want to consider to go even lower level (sub in a testing Zk client), as some of the existing zk ops on state.json is done directly to ZK bypassing this overseer processing. Having such test zk client can even cast a wider net to ensure we don't miss any sneaky 🐟

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Do we see any issue if that remains there (or any other way we can capture this)? We want to write tests for each prs-workflow (captured in other doc)
  2. As of now, we want all state.json update from overseer!

Copy link

@patsonluk patsonluk Jul 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I don't think it's a huge issue, personally I would rather not see any testing code in the execution flow if possible. If such listener is useful, perhaps we can consider introducing it as an actual listener (non-test) with proper setter/getter method/init in constructor as final field etc
  2. Yes, adding such listener might be sufficient to catch most state.json updates. However, there are also direct writes of state.json, such as this that bypasses both Overseer and ZkStateReader. Therefore we might want to go lower level at ZkClient so we do not miss any of those writes. It might be critical to ensure:
    1. No direct writes to state.json outside of overseer
    2. No writes of state.json in any form - whether via overseer the class or not, from data node

So we might need ZkClient level interceptor as well to verify the above

@hiteshk25 hiteshk25 changed the title WIP: Test listeners for state.json and prs state update Test listeners for state.json and prs state update Aug 3, 2022
@hiteshk25 hiteshk25 force-pushed the hitesh/prsStateInstrumentation branch from 8af9e51 to bcd4e5c Compare August 4, 2022 00:02
@hiteshk25 hiteshk25 force-pushed the hitesh/prsStateInstrumentation branch from bcd4e5c to f7b4f2c Compare August 29, 2022 22:20
@hiteshk25 hiteshk25 changed the base branch from noble/prsCollCreationBug to release/8.8 August 29, 2022 22:40
@@ -80,6 +80,7 @@

public class ZkStateReader implements SolrCloseable {
public static final int STATE_UPDATE_DELAY = Integer.getInteger("solr.OverseerStateUpdateDelay", 2000); // delay between cloud state updates
public volatile PerReplicaStateUpdateListener testPerReplicaStateUpdateListener = null;
Copy link

@patsonluk patsonluk Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we can move this to PerReplicaStatesOps as a private static volatile PerReplicaStateUpdateListener listener; add a public static setter method there, then we don't need to modify the method signature of PerReplicaStatesOps#persist (and all the callers 😊 ). In PerReplicaStatesOps#persist we can simply check listener != null as you have done similarly in ClusterStateUpdater with testOverseerMessageListener ?

We can then set it PerReplicaStatesOps.setListener(...) from FSPRSTest#testPRSCollectionCreation instead, which makes it default as null for normal execution flow? 😄

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ZkStateReader is singleton class at coreContainer level. that's why I kept variable at that level. I didn't want static variable as we create multiple coreContainer in test.

Copy link

@patsonluk patsonluk Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static variable is also guaranteed a singleton (within the same classloader - well same as ZkStateReader singleton anyway)and it's closer to where it's used (PerReplicaStateOps), which method persist is defined.

The major benefit of that is it avoids any method signature change, hence we do not have to change any of the 5 invocations 😊 . And it's also consistent with your Overseer change with the user of such listener is defined at the user level and no need for method signature changes for the actual invocation.

Copy link

@patsonluk patsonluk Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see what u mean now! 💡 since in ur test cases u started 2 JettySolrRunner so u r setting it per runner/coreContainer, as it is necessary to make such differentiation even within the same classloader/JVM!

Can we embed such testPerReplicaStateUpdateListener into the ZkClient instead? My concern is that the new method signature public void persist(String znode, SolrZkClient zkClient, PerReplicaStateUpdateListener stateUpdateListener) is not entirely clear on how to obtain such stateUpdateListener - dev probably wouldn't know there's an instance embedded in ZkStateReader that easily. If we can hide that as much as we can (put it in zkClient), then dev probably wouldn't need to deal with that? 😊

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Let's keep ZkClient for separate PR if require (Or we can use zkversion of node in some cases ).
  2. I'm not sure it is using classloader. I think it just creating different instances
  3. ok, persist api issue. This is api argument to listen the events. Unfortunately, I'm not finding any better way to solve this issue as PerReplicaStatesOps instance is created for each update.

Copy link

@patsonluk patsonluk Aug 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting such listener as a part of ZkClient should solve 3. here? TBH, ZkClient has already provided whole bunch of other "info" so adding a PRS op listener shouldn't be too bad? 😅
image

And we can even keep the current setter in ZkStateReader but instead of setting it to the current field, simply do zkClient.setPrsListener(...) ?

.process(cluster.getSolrClient());
cluster.waitForActiveCollection(COLL, 1, 1);

assertTrue("overseer node should not update prs state " + overseerNodePrsListener, overseerNodePrsListener.events.size() == 1);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here to indicate that it is still being updated by overseer node for now? otherwise reader might get confused with the error message and the actual check condition (which does allow 1 update)

cluster.waitForActiveCollection(COLL, 1, 1);

assertTrue("overseer node should not update prs state " + overseerNodePrsListener, overseerNodePrsListener.events.size() == 1);
assertTrue("data node should have 3 prs state update " + dataNodePrsListener , dataNodePrsListener.events.size() == 3);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice that we can compare the data node update vs os node update now! 💪🏼

This has major advantage over my other test testZkNodeVersions which checks the znode version and child version but does not know who updates it.

We can consider removing testZkNodeVersions in the future. However, there's one thing that testZkNodeVersions catches while this test case does not - there are direct updates on state.json performed by overseer node, which NOT caught by dataNodeOverseerListener as it does not go through the overseer q 😞 (so it's actually 2 updates to state.json from overseer instead of 1). This is the update : https://github.com/fullstorydev/lucene-solr/blob/release/8.8/solr/core/src/java/org/apache/solr/cloud/api/collections/CreateCollectionCmd.java#L323 . in order to catch it we might need to instrument even lower level (zkclient) as discussed earlier

Copy link

@patsonluk patsonluk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

As discussed, changing method signature persist is hard to get around unless we do refactoring, for example having a PerReplicaStatesOps factory.

We can probably improve those in the future!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants