Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

deepthi912
Copy link
Collaborator

@deepthi912 deepthi912 commented Dec 2, 2024

Context:
We have an alert to show the percent of replicas that are healthy but for Replica Group Setting when strictReplicaGroup config is enabled for upsert/dedup tables it is hard to detect the table going into bad state just by PERCENT_OF_REPLICAS metric as different segments going down in different replica groups can cause low availability for the table.

Added a map for server to replica group id, then tried to checked the statuses of servers from external view of segments. Added a map to track the replica group id -> alive/dead and that determined the status of a replica group in the end to display the percent of replica groups that are live (considered both COMPLETED and CONSUMING)

Example:
RG1 -> 1, 2, 3
RG2 -> 1, 2, 3

If segment 1 from RG1 and segment 2 from RG2 are in error state, and this can cause issue when getting the data from server for strictReplicaGroup setting in upsert/dedup tables.

Solution:
Added a PERCENT_OF_REPLICA_GROUPS metric to track those cases where instancePartitions are specified in ReplicaGroupPartitionConfig
If the below config is not specified, Metric is not emitted currently:

  "instanceAssignmentConfigMap": {
    "CONSUMING": {
      ...
      "replicaGroupPartitionConfig": {
        "replicaGroupBased": true,
        "numReplicaGroups": 3,
        "numInstancesPerReplicaGroup": 4
      }
    }
  }

TODO: Unit test to be added

@codecov-commenter
Copy link

codecov-commenter commented Dec 2, 2024

Codecov Report

Attention: Patch coverage is 71.73913% with 13 lines in your changes missing coverage. Please review.

Project coverage is 63.80%. Comparing base (59551e4) to head (7592e72).
Report is 1528 commits behind head on master.

Files with missing lines Patch % Lines
...e/pinot/controller/helix/SegmentStatusChecker.java 64.86% 7 Missing and 6 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #14581      +/-   ##
============================================
+ Coverage     61.75%   63.80%   +2.05%     
- Complexity      207     1607    +1400     
============================================
  Files          2436     2703     +267     
  Lines        133233   150778   +17545     
  Branches      20636    23307    +2671     
============================================
+ Hits          82274    96208   +13934     
- Misses        44911    47357    +2446     
- Partials       6048     7213    +1165     
Flag Coverage Δ
custom-integration1 100.00% <ø> (+99.99%) ⬆️
integration 100.00% <ø> (+99.99%) ⬆️
integration1 100.00% <ø> (+99.99%) ⬆️
integration2 0.00% <ø> (ø)
java-11 63.78% <71.73%> (+2.07%) ⬆️
java-21 63.70% <71.73%> (+2.07%) ⬆️
skip-bytebuffers-false 63.80% <71.73%> (+2.06%) ⬆️
skip-bytebuffers-true 34.16% <71.73%> (+6.43%) ⬆️
temurin 63.80% <71.73%> (+2.05%) ⬆️
unittests 63.80% <71.73%> (+2.05%) ⬆️
unittests1 56.22% <11.11%> (+9.32%) ⬆️
unittests2 34.17% <71.73%> (+6.44%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

InstancePartitionsType.CONSUMING.toString()));
String completedPath = ZKMetadataProvider.constructPropertyStorePathForInstancePartitions(
InstancePartitionsUtils.getInstancePartitionsName(tableNameWithType,
InstancePartitionsType.COMPLETED.toString()));
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considered ONLINE COMPLETED segments as well for Real time table

@Jackie-Jiang
Copy link
Contributor

Can you take a look at #14536 and see if you are trying to solve the same problem?

@deepthi912
Copy link
Collaborator Author

deepthi912 commented Jan 2, 2025

#14536 is trying to solve the issue of having different number of replicas for OFFLINE & CONSUMING segments but where as I am adding a metric to alert what percent of replica groups are in good state. I would be just dependent on that PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants