Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581

deepthi912 · 2024-12-02T23:15:32Z

Context:
We have an alert to show the percent of replicas that are healthy but for Replica Group Setting when strictReplicaGroup config is enabled for upsert/dedup tables it is hard to detect the table going into bad state just by PERCENT_OF_REPLICAS metric as different segments going down in different replica groups can cause low availability for the table.

Added a map for server to replica group id, then tried to checked the statuses of servers from external view of segments. Added a map to track the replica group id -> alive/dead and that determined the status of a replica group in the end to display the percent of replica groups that are live (considered both COMPLETED and CONSUMING)

Example:
RG1 -> 1, 2, 3
RG2 -> 1, 2, 3

If segment 1 from RG1 and segment 2 from RG2 are in error state, and this can cause issue when getting the data from server for strictReplicaGroup setting in upsert/dedup tables.

Solution:
Added a PERCENT_OF_REPLICA_GROUPS metric to track those cases where instancePartitions are specified in ReplicaGroupPartitionConfig
If the below config is not specified, Metric is not emitted currently:

  "instanceAssignmentConfigMap": {
    "CONSUMING": {
      ...
      "replicaGroupPartitionConfig": {
        "replicaGroupBased": true,
        "numReplicaGroups": 3,
        "numInstancesPerReplicaGroup": 4
      }
    }
  }

TODO: Unit test to be added

reverse merge

merge master

codecov-commenter · 2024-12-02T23:55:26Z

Codecov Report

Attention: Patch coverage is 71.73913% with 13 lines in your changes missing coverage. Please review.

Project coverage is 63.80%. Comparing base (59551e4) to head (7592e72).
Report is 1528 commits behind head on master.

Files with missing lines	Patch %	Lines
...e/pinot/controller/helix/SegmentStatusChecker.java	64.86%	7 Missing and 6 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master   #14581      +/-   ##
============================================
+ Coverage     61.75%   63.80%   +2.05%     
- Complexity      207     1607    +1400     
============================================
  Files          2436     2703     +267     
  Lines        133233   150778   +17545     
  Branches      20636    23307    +2671     
============================================
+ Hits          82274    96208   +13934     
- Misses        44911    47357    +2446     
- Partials       6048     7213    +1165

Flag	Coverage Δ
custom-integration1	`100.00% <ø> (+99.99%)`	⬆️
integration	`100.00% <ø> (+99.99%)`	⬆️
integration1	`100.00% <ø> (+99.99%)`	⬆️
integration2	`0.00% <ø> (ø)`
java-11	`63.78% <71.73%> (+2.07%)`	⬆️
java-21	`63.70% <71.73%> (+2.07%)`	⬆️
skip-bytebuffers-false	`63.80% <71.73%> (+2.06%)`	⬆️
skip-bytebuffers-true	`34.16% <71.73%> (+6.43%)`	⬆️
temurin	`63.80% <71.73%> (+2.05%)`	⬆️
unittests	`63.80% <71.73%> (+2.05%)`	⬆️
unittests1	`56.22% <11.11%> (+9.32%)`	⬆️
unittests2	`34.17% <71.73%> (+6.44%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

merge master

deepthi912 · 2024-12-28T02:54:12Z

pinot-controller/src/main/java/org/apache/pinot/controller/helix/SegmentStatusChecker.java

+            InstancePartitionsType.CONSUMING.toString()));
+    String completedPath = ZKMetadataProvider.constructPropertyStorePathForInstancePartitions(
+        InstancePartitionsUtils.getInstancePartitionsName(tableNameWithType,
+            InstancePartitionsType.COMPLETED.toString()));


Considered ONLINE COMPLETED segments as well for Real time table

Jackie-Jiang · 2024-12-30T00:56:18Z

Can you take a look at #14536 and see if you are trying to solve the same problem?

deepthi912 · 2025-01-02T18:33:25Z

#14536 is trying to solve the issue of having different number of replicas for OFFLINE & CONSUMING segments but where as I am adding a metric to alert what percent of replica groups are in good state. I would be just dependent on that PR.

merge master

deepthi912 and others added 14 commits March 25, 2024 22:02

Merge pull request #1 from apache/master

7c0a954

reverse merge

Merge pull request #3 from apache/master

361b729

reverse merge

Merge pull request #4 from apache/master

b764dfb

reverse merge

Merge pull request #8 from apache/master

5406079

reverse merge

Update AsyncInstanceTable.tsx

87e62d7

Merge pull request #12 from apache/master

7d86ef9

reverse merge

Merge pull request #13 from apache/master

9cb2bc2

reverse merge

Merge branch 'apache:master' into master

77699ad

Merge pull request #15 from apache/master

dd9a1b7

reverse merge

Merge pull request #28 from apache/master

794df35

merge master

Merge pull request #31 from apache/master

f4e4904

merge master

Merge pull request #35 from apache/master

5fb8a8b

merge master

Replica groups Metric to detect when a replica group is down

2132426

Raise alert only when instancePartitions are configured

408c9f4

null checks

d114108

yashmayya added metrics enhancement labels Dec 3, 2024

deepthi912 and others added 9 commits December 9, 2024 17:40

Merge pull request #47 from apache/master

7079c56

merge master

Merge pull request #48 from apache/master

f953f2b

merge master

Merge pull request #52 from apache/master

e9024fd

merge master

Merge branch 'master' into BadSegments_RG

6eb4128

Add the instancePartitionType in the Replica Group status checker

35da586

Add the description for dedup expired primary key ms

5ca0878

description fix

957c0f4

Online and Completed states for InstancePartitions

df57faa

remove the added constants

600f420

deepthi912 commented Dec 28, 2024

View reviewed changes

Jackie-Jiang added the observability label Dec 30, 2024

deepthi912 and others added 2 commits January 2, 2025 11:31

Merge pull request #53 from apache/master

f2eb62b

merge master

Merge branch 'master' into BadSegments_RG

7592e72

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581

Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581

deepthi912 commented Dec 2, 2024 •

edited

Loading

codecov-commenter commented Dec 2, 2024 •

edited

Loading

deepthi912 Dec 28, 2024

Jackie-Jiang commented Dec 30, 2024

deepthi912 commented Jan 2, 2025 •

edited

Loading

Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581

Are you sure you want to change the base?

Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581

Conversation

deepthi912 commented Dec 2, 2024 • edited Loading

codecov-commenter commented Dec 2, 2024 • edited Loading

Codecov Report

deepthi912 Dec 28, 2024

Choose a reason for hiding this comment

Jackie-Jiang commented Dec 30, 2024

deepthi912 commented Jan 2, 2025 • edited Loading

deepthi912 commented Dec 2, 2024 •

edited

Loading

codecov-commenter commented Dec 2, 2024 •

edited

Loading

deepthi912 commented Jan 2, 2025 •

edited

Loading