-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhance SegmentStatusChecker to honor no-queryable servers and instance assignment config #14536
base: master
Are you sure you want to change the base?
Conversation
664d301
to
5697b33
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #14536 +/- ##
============================================
+ Coverage 61.75% 63.94% +2.18%
- Complexity 207 1573 +1366
============================================
Files 2436 2688 +252
Lines 133233 147716 +14483
Branches 20636 22635 +1999
============================================
+ Hits 82274 94452 +12178
- Misses 44911 46326 +1415
- Partials 6048 6938 +890
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
if (numEVConsumingReplicas == 0) { // it's a immutable segment | ||
minEVImmutableReplicas = Math.min(minEVImmutableReplicas, numEVOnlineReplicas); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is problematic because there might be a scenario in which all replicas for segment with IS in CONSUMING state are OFFLINE or ERROR. For this scenario, numEVConsumingReplicas is indeed zero and we should report it as zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing L365 as
if (numISConsumingReplicas == 0 && numEVConsumingReplicas == 0) { // it's a immutable segment
This should also cover the scenario that a mutable segment just transit to a immutable segment:
numISConsumingReplicas == 0 && numEVConsumingReplicas == 0
: it's must be a immutable segmentnumISConsumingReplicas != 0 || numEVConsumingReplicas != 0
: it's possible that this segment just transit to a immutable sgement, but we treat it as a mutable segment
5697b33
to
1f10cbf
Compare
if (minEVReplicas < maxISReplicas) { | ||
LOGGER.warn("Table {} has at least one segment running with only {} replicas, below replication threshold :{}", | ||
tableNameWithType, minEVReplicas, maxISReplicas); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sajjad-moradi This log need to be deleted, as
- This log will always (and falsely) being printed out when online and consuming segments have different replica group.
- L388-L392 already log the partial online segments. This log is redundant
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some new logging instead? Meanwhile, it would be great if we have a list of segments having min replicas and print them out as well? I feel sometimes the debuggability of alerts on this metric is not so great. nvm [L388-L392] should be good
4c5224a
to
bf098d3
Compare
Discussed internally that we should have an instance config cache and fetch on demand. Meanwhile, do we want to have 2 versions of the metric that maybe config-controlled or both available? In case anyone relies on the old. |
86898be
to
a39185a
Compare
a39185a
to
7ef229c
Compare
2a90e0e
to
f13578b
Compare
f13578b
to
57e8bdc
Compare
_serverStarters.add(startOneServer(i)); | ||
BaseServerStarter serverStarter = startOneServer(i); | ||
_serverStarters.add(serverStarter); | ||
_helixAdmin.enableInstance(getHelixClusterName(), serverStarter.getInstanceId(), true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ControllerPeriodicTasksIntegrationTest.testSegmentStatusChecker()
will fail without this.
all other testing succeeded. Only
full log
|
* This is a helper class that fetch server information from Helix/ZK. It caches the server information to avoid | ||
* repeated ZK access. This class is NOT thread-safe. | ||
*/ | ||
public class ServerInfoFetcher { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name is too generic. I suggest to rename it to something like ServerQueryStateFetcher
.
Also this is only used in SegmentStatusChecker class. Let's make it private to that class (move it to that class), and later if there's any use elsewhere, it can be made public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It not only contains state (helix_enabled, query_disabled, shutdown_in_progress), but also contains some other information (tables on that server, instance tags, and potentially @jasperjiaguo will add more). So I rename it as ServerQueryInfoFetcher
.
In addition, @jasperjiaguo will use this ServerQueryInfoFetcher
very soon in his other PRs, and it's under util
package, it's OK to be public. There are some other util class such as SegmentIntervalUtils
, AutoAddInvertedIndex
, they are used by only one class, but also be public.
@@ -80,6 +82,8 @@ public class SegmentStatusChecker extends ControllerPeriodicTask<SegmentStatusCh | |||
|
|||
private long _lastDisabledTableLogTimestamp = 0; | |||
|
|||
private ServerInfoFetcher _serverInfoFetcher; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not have this as a class member. The reason is the cache you have in the class is not cleared. If you create an ServerInfoFetcher
object in each run of the segment status checker, there's no need to worry about the cache eviction.
d4d5962
to
24660fb
Compare
@sajjad-moradi @jasperjiaguo comments addresses. PTAL. |
24660fb
to
bbba882
Compare
numEVReplicas++; | ||
String serverInstanceId = entry.getKey(); | ||
String segmentState = entry.getValue(); | ||
if (isServerQueryable(serverQueryInfoFetcher.getServerQueryInfo(serverInstanceId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(totally optional) maybe encapsulate this in the serverQueryInfoFetcher as well
String serverInstanceId = entry.getKey(); | ||
String segmentState = entry.getValue(); | ||
if (isServerQueryable(serverQueryInfoFetcher.getServerQueryInfo(serverInstanceId)) | ||
&& (segmentState.equals(SegmentStateModel.ONLINE) || segmentState.equals(SegmentStateModel.CONSUMING))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(totally optional) We can probably make this a set called acceptableStates and do acceptableStates.contains(segmentState)
@Jackie-Jiang could you please review this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this PR still WIP? I don't see the change related to checking instance assignment config
private PinotHelixResourceManager _pinotHelixResourceManager; | ||
private Map<String, ServerQueryInfo> _cache; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) They can be final
return new ServerQueryInfo(instanceId, tags, null, helixEnabled, queriesDisabled, shutdownInProgress); | ||
} | ||
|
||
public static class ServerQueryInfo { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't need setters for this class, and we can make all fields final
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems most of the fields are not used. Are you planning to use them?
private boolean _helixEnabled; | ||
private boolean _queriesDisabled; | ||
private boolean _shutdownInProgress; | ||
private ServerQueryInfo(String instanceName, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(format) Add an empty line, and reformat the constructor per Pinot Style
boolean helixEnabled = record.getBooleanField( | ||
InstanceConfig.InstanceConfigProperty.HELIX_ENABLED.name(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use instanceConfig.getInstanceEnabled()
.
We can simplify the ServerQueryInfo
to combine all 3 booleans, and call it queryEnabled
_cache = new HashMap<>(); | ||
} | ||
|
||
public ServerQueryInfo getServerQueryInfo(String instanceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Annotate the return as @Nullable
return _cache.computeIfAbsent(instanceId, this::getServerQueryInfoOndemand); | ||
} | ||
|
||
private ServerQueryInfo getServerQueryInfoOndemand(String instanceId) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private ServerQueryInfo getServerQueryInfoOndemand(String instanceId) { | |
@Nullable | |
private ServerQueryInfo fetchServerQueryInfo(String instanceId) { |
numEVReplicas++; | ||
String serverInstanceId = entry.getKey(); | ||
String segmentState = entry.getValue(); | ||
if (isServerQueryable(serverQueryInfoFetcher.getServerQueryInfo(serverInstanceId)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check segment state first because the overhead for that is much smaller
ref: #14538
This PR fix enhance SegmentStatusChecker in two scenarios:
If user has instance assignment config, a user may make offline segments and consuming segments have different replica groups (e.g. use more replicas for consuming segments, but less replicas for online segments).
In this situation, when calculate
PERCENT_OF_REPLICAS
, it's problematic to use globalmaxISReplicas
as denominator and globalminEVReplicas
as nominator. For example, if user config replica groups for immutable segments as 3, and 5 for mutable segments,PERCENT_OF_REPLICAS
will always be 60% even if all replicas are up, which will cause some false negative alerts.This PR change the logic to calculate EVReplicasUpPercent for each segment then emit the minimal percentage.
When a sever is configured as
queriesDisabled
orshutdownInProgress
, broker does not send queries to the server. We need technically treat those replicas that on the no-queryable server as "OFFLINE" even if it shows "ONLINE" in helix