Skip to content
This repository has been archived by the owner on Mar 29, 2024. It is now read-only.

Implement metrics for Zookeeper client #164

Open
wants to merge 16 commits into
base: release/8.8
Choose a base branch
from

Conversation

noblepaul
Copy link
Collaborator

@noblepaul noblepaul commented Jul 7, 2022

Do not commit, WIP

please let me know if we wish to add more variables

How to use?

  • start your servers
  • hit the end point http://localhost:8983/solr/admin/metrics?key=solr.node:CONTAINER.zkClient

sample output

{
  "metrics":{
    "solr.node:CONTAINER.zkClient":{
      "watchesFired":29,
      "reads":452,
      "writes":171,
      "bytesRead":1260447,
      "bytesWritten":483353,
      "multiOps":26,
      "cumulativeMultiOps":34,
      "childFetches":139,
      "cumulativeChildrenFetched":388,
      "existsChecks":427,
      "deletes":3}}}

Copy link

@magibney magibney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aside from the handful of minor comments/questions/suggestions, I wonder if a bit more consideration would be warranted wrt at a high level what we're seeking to capture with these metrics? One concern I have is that based on where the stats are actually incremented, we're currently not capturing any info about connection errors/retries (some stats are incremented before operations, some after -- a relevant distinction in the event of an exception being thrown by the actual operation; and metrics don't currently capture anything wrt retryOnConnLoss, which might be useful).

@noblepaul
Copy link
Collaborator Author

I wonder if a bit more consideration would be warranted wrt at a high level what we're seeking to capture with these metrics?

This is why I have marked this as a WIP. We need to refine the scope of this ticket and add/remove more metrics . Maybe, @hiteshk25 will be able to add more here

Copy link

@magibney magibney left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @noblepaul, this looks good. I think the remaining questions are more about the significance/utility of individual metrics (will discuss with @hiteshk25 et al).

}
metrics.reads.increment();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here!

@noblepaul
Copy link
Collaborator Author

@hiteshk25

some of these metrics are collected after the operation is done because we also need to collect data from output

for instance, when we do data reads , we need to collect both the number of reads and no:of bytes. the no:of bytes are only available after the operation. OTOH , if you are writing data to ZK , we know the payload size before hand.

@hiteshk25
Copy link

@hiteshk25

some of these metrics are collected after the operation is done because we also need to collect data from output

for instance, when we do data reads , we need to collect both the number of reads and no:of bytes. the no:of bytes are only available after the operation. OTOH , if you are writing data to ZK , we know the payload size before hand.

right. But look their main ops counter, which incremented inconsistently.

@hiteshk25
Copy link

In general, better to increment counter before their ops. Here is solr request example

@noblepaul
Copy link
Collaborator Author

noblepaul commented Jul 11, 2022

In general, better to increment counter before their ops. Here is solr request example

Ideally, I would do the increments before the operation. if we also wish to increment the result data , what choice do we have? In places where we don't need to collect output data , we just increment it in the beginning

 public byte[] getData(final String path, final Watcher watcher, final Stat stat, boolean retryOnConnLoss)
      throws KeeperException, InterruptedException {
    byte[] result = null;
    if (retryOnConnLoss) {
      result = zkCmdExecutor.retryOperation(() -> keeper.getData(path, wrapWatcher(watcher), stat));
    } else {
      result = keeper.getData(path, wrapWatcher(watcher), stat);
    }
    metrics.reads.increment();
    if (result != null) {
      metrics.bytesRead.add(result.length);
    }
    return result;
  }

look at the above method. We want to keep track of the bytesRead as well which is only available after the call

@hiteshk25
Copy link

In general, better to increment counter before their ops. Here is solr request example

Ideally, I would do the increments before the operation. if we also wish to increment the result data , what choice do we have? In places where we don't need to collect output data , we just increment it in the beginning

 public byte[] getData(final String path, final Watcher watcher, final Stat stat, boolean retryOnConnLoss)
      throws KeeperException, InterruptedException {
    byte[] result = null;
    if (retryOnConnLoss) {
      result = zkCmdExecutor.retryOperation(() -> keeper.getData(path, wrapWatcher(watcher), stat));
    } else {
      result = keeper.getData(path, wrapWatcher(watcher), stat);
    }
    metrics.reads.increment();
    if (result != null) {
      metrics.bytesRead.add(result.length);
    }
    return result;
  }

look at the above method. We want to keep track of the bytesRead as well which is only available after the call

I think those are two different metrics,

  1. how many calls
  2. How much data we fetched.
  3. (errors, which requires different metrics(zk-errors).)

@noblepaul
Copy link
Collaborator Author

@hiteshk25 The JUnit test is added . lemme know if there is anything more required

@hiteshk25
Copy link

These are the test failing in CI
https://app.circleci.com/pipelines/github/cowpaths/fs-solr/1684/workflows/0151f952-1d5c-4231-b64e-e23869575b2e/jobs/4036/parallel-runs/0/steps/0-104

   [junit4] 
   [junit4] Tests with failures [seed: AA7A22597286676]:
   [junit4]   - org.apache.solr.core.FSPRSTest (suite)
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.handler.admin.AdminHandlersProxyTest.proxyMetricsHandlerAllNodes
   [junit4]   - org.apache.solr.metrics.reporters.solr.SolrCloudReportersTest.testDefaultP

@hiteshk25
Copy link

rebased from release/8.8, Now these tests are failing
https://app.circleci.com/pipelines/github/cowpaths/fs-solr/1692/workflows/2bdd2469-c700-4b70-ac5a-ba0d6aff24d9/jobs/4075

 Tests with failures [seed: 52AD923C32304CBE]:
   [junit4]   - org.apache.solr.pkg.TestPackages.testCoreReloadingPlugin
   [junit4]   - org.apache.solr.handler.admin.AdminHandlersProxyTest.proxyMetricsHandlerAllNodes
   [junit4]   - org.apache.solr.cloud.LegacyCloudClusterPropTest.testCreateCollectionSwitchLegacyCloud
   [junit4]   - org.apache.solr.cloud.LeaderTragicEventTest.testLeaderFailsOver
   [junit4]   - org.apache.solr.cloud.TestWithCollection.testNodeAdded
   [junit4]   - org.apache.solr.handler.RequestHandlerMetricsTest.testAggregateNodeLevelMetrics

@hiteshk25
Copy link

@noblepaul @chatman we need to look above test failures

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants