-
Notifications
You must be signed in to change notification settings - Fork 8.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Chan.Part.API: channel participation relation and status metrics #2025
Conversation
2eea787
to
55d8c9d
Compare
Generally, I am a bit confused by the design of the metrics here. A channel can have cluster-relation: In addition, the status can be Labeling with Can't we have a gauge that directly spans the enum values of status? for example: |
Adjusted the metrics and updated the places I was already setting the metrics in this PR. Need to add reporting of the metrics to at least the config-tracker codepath (and consider whether we care to report cluster relation "none" for non-raft orderers). |
case types.StatusActive: | ||
s = 2 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a default here that would catch if we changed/added to the enums?
This way if we change the enum it translates to 0=inactive...
say -1? that would get the admins attention.
func (m *Metrics) reportClusterRelation(channel string, relation types.ClusterRelation) { | ||
var r int | ||
switch relation { | ||
case types.ClusterRelationNone: | ||
r = 0 | ||
case types.ClusterRelationConfigTracker: | ||
r = 1 | ||
case types.ClusterRelationFollower: | ||
r = 2 | ||
case types.ClusterRelationMember: | ||
r = 3 | ||
} | ||
m.ClusterRelation.With("channel", channel).Set(float64(r)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding a default here that would catch if we changed/added to the enums?
This way if we change the enum it translates to 0=None...
say -1? that would get the admins attention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or maybe a more robust approach would be to add to the types
package a couple of methods that convert the types.ClusterRelation*
and type.Status*
into integers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's likely we end up with additional cluster relations/statuses? I do think we should have a default, but maybe panic instead stating it's a programming error and a new metric conversion is needed in the switch statement?
I ordered these in ascending order to go along with the highest number being the most active in terms of participating in the cluster/channel, but if we anticipate more, maybe it'd be best to have 0 remain none/inactive, 1 = member/active, and then the rest. That way we keep it stable and there isn't a presumed progression of the values. Thoughts?
So maybe for cluster relation:
0 = none
1 = member
2 = follower
3 = config-tracker
And status:
0 = inactive
1 = active
2 = onboarding
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ordering looks good. I like the fact that <1,1> is the "normal". I agree that panic() is a good solution for default. This way we would hopefully catch inconsistencies in IT.
I think one additional status that is already in the cards is types.StatusFailure
- see FAB-18106, and also my comment here.
Maybe adding it (types.StatusFailure
) in the context of this task is not a bad idea?
e608eab
to
910ca0b
Compare
Namespace: "cluster", | ||
Subsystem: "", | ||
Name: "relation", | ||
Help: "The cluster relation of the node: 0 if none, 1 if member, 2 if follower, 3 if config-tracker.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
member -> consenter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
switch relation { | ||
case types.ClusterRelationNone: | ||
r = 0 | ||
case types.ClusterRelationMember: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wasn't this supposed to be types.ClusterRelationConsenter
? probably needs a rebase as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
Looks good, except that the transition of cluster-relation member->consenter needs to be applied here as well. |
Thanks for catching that! Forgot that this PR was also affected by the rename somehow. :) |
Namespace: "cluster", | ||
Subsystem: "", | ||
Name: "status", | ||
Help: "The status of the cluster node: 0 if inactive, 1 if active, 2 if onboarding.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally convinced here, but wouldn't it be more natural to go 0 inactive, 1 onboarding, 2 active? (ie, typically you transition from 0 to 1 to 2?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's how I originally implemented this but decided against it due to the fact that we may end up adding different options in the future. For the sake of keeping a stable base set of statuses and allowing new ones in the future, I thought it might be best to move away from the concept of a progression in the enum.
|
||
var ( | ||
clusterStatusOpts = metrics.GaugeOpts{ | ||
Namespace: "cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we already have a "cluster" namespace, used by orderer/common/cluster
, but it does specify a subsystem of comm
. Perhaps we should add a subsystem of participation
here? Or pick a different namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely like the idea of adding "participation" as the subsystem. And I wasn't in love with using "cluster" as the namespace. Maybe it should really be "channel"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went with "cluster" and "participation" for now (and updated the help text) since the channel name is included as a label.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think on this, the more convinced I am that 'cluster' isn't right, and I'm a little confused why we call it cluster status at all. It has nothing to do with the cluster? Being a follower does not make you a member of the cluster... why not a vanilla channel_participation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting we change the clusterRelation in general reported by the channel participation API in addition to the metric? We could rename that to Relation if it's not cluster-related and then the metric could be namepsace "channel" subsystem "participation" and name "relation" for channel_participation_relation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to what I mentioned in my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to "participation" for the namespace and no subsystem.
clusterStatusOpts = metrics.GaugeOpts{ | ||
Namespace: "cluster", | ||
Subsystem: "", | ||
Name: "status", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name status
is pretty hopelessly vague, unless we add more context with a subsystem or different namespace. A cluster status of 'active' for instance does not imply that the other nodes are reachable, etc. It is the participation_status
or similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. It's participation_status now with the namespace change.
clusterRelationOpts = metrics.GaugeOpts{ | ||
Namespace: "cluster", | ||
Subsystem: "", | ||
Name: "relation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also wonder if relation
is the best we can do here? I understand that's what it's referred to in the code, but it's not exactly obvious to someone looking at a metrics dashboard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created follow-up task https://jira.hyperledger.org/browse/FAB-18327 for renaming this (and the underlying types.ClusterRelation) to consensus relation.
c.InactiveChainRegistry.TrackChain(support.ChannelID(), support.Block(0), func() { | ||
c.ChainManager.CreateChain(support.ChannelID()) | ||
}) | ||
c.ChainManager.ReportClusterRelationAndStatusMetrics(support.ChannelID(), types.ClusterRelationConfigTracker, types.StatusInactive) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really do not like this. And this is one of the strong reasons why I asked that we please ditch the notion of the raft consenter switching the chain out, and delegate this back up to the multichannel registrar. Consensus plugins have no business setting cluster relation status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I did not like doing this but it's the only way I found with the current code flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have an outstanding JIRA to fix this flow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I do not see a JIRA for this. I can try to create one but I'm not sure if I have enough context in this area to get it completely right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is a Jira for it, or a variation on this theme: https://jira.hyperledger.org/browse/FAB-18152
However, the big landmine we have to dismantle is the pattern with the InactiveChainRegistry
etc., and we decided back then to finish the channel participation functionality first, and only then start refactoring. Moreover, the dependencies around the InactiveChainRegistry
and BlockPuller
are pretty complex, so this should be done very carefully... (if at all - "if it ain't broke don't fix it" comes to mind). Hopefully when we eventually deprecate the system channel this task will be a lot easier.
0e87fde
to
a51e57e
Compare
dcd5cb3
to
c73e26f
Compare
FAB-18087 Signed-off-by: Will Lahti <wtlahti@us.ibm.com>
Type of change
Description
Add metrics to track the channel participation relation and status per channel for the orderer.
Related issues
FAB-18087