-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise alert for replicas in ERROR state when Replica Group is configured in table Config #14581
base: master
Are you sure you want to change the base?
Conversation
reverse merge
reverse merge
reverse merge
reverse merge
reverse merge
reverse merge
reverse merge
merge master
merge master
merge master
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14581 +/- ##
============================================
+ Coverage 61.75% 63.80% +2.05%
- Complexity 207 1607 +1400
============================================
Files 2436 2703 +267
Lines 133233 150778 +17545
Branches 20636 23307 +2671
============================================
+ Hits 82274 96208 +13934
- Misses 44911 47357 +2446
- Partials 6048 7213 +1165
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
merge master
merge master
merge master
InstancePartitionsType.CONSUMING.toString())); | ||
String completedPath = ZKMetadataProvider.constructPropertyStorePathForInstancePartitions( | ||
InstancePartitionsUtils.getInstancePartitionsName(tableNameWithType, | ||
InstancePartitionsType.COMPLETED.toString())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considered ONLINE COMPLETED segments as well for Real time table
Can you take a look at #14536 and see if you are trying to solve the same problem? |
#14536 is trying to solve the issue of having different number of replicas for OFFLINE & CONSUMING segments but where as I am adding a metric to alert what percent of replica groups are in good state. I would be just dependent on that PR. |
Context:
We have an alert to show the percent of replicas that are healthy but for Replica Group Setting when strictReplicaGroup config is enabled for upsert/dedup tables it is hard to detect the table going into bad state just by PERCENT_OF_REPLICAS metric as different segments going down in different replica groups can cause low availability for the table.
Added a map for server to replica group id, then tried to checked the statuses of servers from external view of segments. Added a map to track the replica group id -> alive/dead and that determined the status of a replica group in the end to display the percent of replica groups that are live (considered both COMPLETED and CONSUMING)
Example:
RG1 -> 1, 2, 3
RG2 -> 1, 2, 3
If segment 1 from RG1 and segment 2 from RG2 are in error state, and this can cause issue when getting the data from server for strictReplicaGroup setting in upsert/dedup tables.
Solution:
Added a PERCENT_OF_REPLICA_GROUPS metric to track those cases where instancePartitions are specified in ReplicaGroupPartitionConfig
If the below config is not specified, Metric is not emitted currently:
TODO: Unit test to be added