Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chan.Part.API: channel participation relation and status metrics #2025

Merged
merged 1 commit into from
Nov 9, 2020

Conversation

wlahti
Copy link
Contributor

@wlahti wlahti commented Oct 19, 2020

Type of change

  • New feature

Description

Add metrics to track the channel participation relation and status per channel for the orderer.

Related issues

FAB-18087

@wlahti wlahti requested a review from a team as a code owner October 19, 2020 20:38
@wlahti wlahti force-pushed the fab-18087 branch 2 times, most recently from 2eea787 to 55d8c9d Compare October 19, 2020 21:33
@wlahti wlahti requested a review from a team as a code owner October 19, 2020 21:33
@tock-ibm
Copy link
Contributor

tock-ibm commented Oct 21, 2020

Generally, I am a bit confused by the design of the metrics here.

A channel can have cluster-relation: {member, follower, config-tracker, none}
so why only the boolean is_member is_follower ?

In addition, the status can be {onboarding, active, inactive} (and in the future also error or failed).

Labeling with channel is fine, but labeling with status will create |status| x |clusterRelation| = 12 boolean gauges.
Is this what we want?

Can't we have a gauge that directly spans the enum values of status? for example:
status.%channel = 0 (=onboarding) / 1 (=active) / 2 (=inactive) / 3 (=error)

@wlahti wlahti changed the title Chan.Part.API: Add IsMember and IsFollower metrics Chan.Part.API: Add cluster relation and status metrics Oct 29, 2020
@wlahti wlahti marked this pull request as draft October 29, 2020 21:43
@wlahti
Copy link
Contributor Author

wlahti commented Oct 29, 2020

Adjusted the metrics and updated the places I was already setting the metrics in this PR. Need to add reporting of the metrics to at least the config-tracker codepath (and consider whether we care to report cluster relation "none" for non-raft orderers).

Comment on lines 68 to 72
case types.StatusActive:
s = 2
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a default here that would catch if we changed/added to the enums?
This way if we change the enum it translates to 0=inactive...
say -1? that would get the admins attention.

Comment on lines 74 to 92
func (m *Metrics) reportClusterRelation(channel string, relation types.ClusterRelation) {
var r int
switch relation {
case types.ClusterRelationNone:
r = 0
case types.ClusterRelationConfigTracker:
r = 1
case types.ClusterRelationFollower:
r = 2
case types.ClusterRelationMember:
r = 3
}
m.ClusterRelation.With("channel", channel).Set(float64(r))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about adding a default here that would catch if we changed/added to the enums?
This way if we change the enum it translates to 0=None...
say -1? that would get the admins attention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or maybe a more robust approach would be to add to the types package a couple of methods that convert the types.ClusterRelation* and type.Status* into integers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's likely we end up with additional cluster relations/statuses? I do think we should have a default, but maybe panic instead stating it's a programming error and a new metric conversion is needed in the switch statement?

I ordered these in ascending order to go along with the highest number being the most active in terms of participating in the cluster/channel, but if we anticipate more, maybe it'd be best to have 0 remain none/inactive, 1 = member/active, and then the rest. That way we keep it stable and there isn't a presumed progression of the values. Thoughts?

So maybe for cluster relation:
0 = none
1 = member
2 = follower
3 = config-tracker

And status:
0 = inactive
1 = active
2 = onboarding

Copy link
Contributor

@tock-ibm tock-ibm Nov 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ordering looks good. I like the fact that <1,1> is the "normal". I agree that panic() is a good solution for default. This way we would hopefully catch inconsistencies in IT.

I think one additional status that is already in the cards is types.StatusFailure - see FAB-18106, and also my comment here.

Maybe adding it (types.StatusFailure) in the context of this task is not a bad idea?

@wlahti wlahti force-pushed the fab-18087 branch 5 times, most recently from e608eab to 910ca0b Compare November 5, 2020 12:53
@wlahti wlahti marked this pull request as ready for review November 5, 2020 12:55
@wlahti wlahti marked this pull request as draft November 5, 2020 13:57
@wlahti wlahti marked this pull request as ready for review November 5, 2020 18:08
docs/source/metrics_reference.rst Outdated Show resolved Hide resolved
docs/source/metrics_reference.rst Outdated Show resolved Hide resolved
Namespace: "cluster",
Subsystem: "",
Name: "relation",
Help: "The cluster relation of the node: 0 if none, 1 if member, 2 if follower, 3 if config-tracker.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

member -> consenter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

switch relation {
case types.ClusterRelationNone:
r = 0
case types.ClusterRelationMember:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wasn't this supposed to be types.ClusterRelationConsenter? probably needs a rebase as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tock-ibm
Copy link
Contributor

tock-ibm commented Nov 6, 2020

Looks good, except that the transition of cluster-relation member->consenter needs to be applied here as well.

@wlahti
Copy link
Contributor Author

wlahti commented Nov 6, 2020

Looks good, except that the transition of cluster-relation member->consenter needs to be applied here as well.

Thanks for catching that! Forgot that this PR was also affected by the rename somehow. :)

orderer/common/follower/follower_chain.go Outdated Show resolved Hide resolved
Namespace: "cluster",
Subsystem: "",
Name: "status",
Help: "The status of the cluster node: 0 if inactive, 1 if active, 2 if onboarding.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not totally convinced here, but wouldn't it be more natural to go 0 inactive, 1 onboarding, 2 active? (ie, typically you transition from 0 to 1 to 2?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's how I originally implemented this but decided against it due to the fact that we may end up adding different options in the future. For the sake of keeping a stable base set of statuses and allowing new ones in the future, I thought it might be best to move away from the concept of a progression in the enum.


var (
clusterStatusOpts = metrics.GaugeOpts{
Namespace: "cluster",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we already have a "cluster" namespace, used by orderer/common/cluster, but it does specify a subsystem of comm. Perhaps we should add a subsystem of participation here? Or pick a different namespace?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely like the idea of adding "participation" as the subsystem. And I wasn't in love with using "cluster" as the namespace. Maybe it should really be "channel"?

Copy link
Contributor Author

@wlahti wlahti Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went with "cluster" and "participation" for now (and updated the help text) since the channel name is included as a label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more I think on this, the more convinced I am that 'cluster' isn't right, and I'm a little confused why we call it cluster status at all. It has nothing to do with the cluster? Being a follower does not make you a member of the cluster... why not a vanilla channel_participation?

Copy link
Contributor Author

@wlahti wlahti Nov 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting we change the clusterRelation in general reported by the channel participation API in addition to the metric? We could rename that to Relation if it's not cluster-related and then the metric could be namepsace "channel" subsystem "participation" and name "relation" for channel_participation_relation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to what I mentioned in my previous comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to "participation" for the namespace and no subsystem.

clusterStatusOpts = metrics.GaugeOpts{
Namespace: "cluster",
Subsystem: "",
Name: "status",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name status is pretty hopelessly vague, unless we add more context with a subsystem or different namespace. A cluster status of 'active' for instance does not imply that the other nodes are reachable, etc. It is the participation_status or similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. It's participation_status now with the namespace change.

clusterRelationOpts = metrics.GaugeOpts{
Namespace: "cluster",
Subsystem: "",
Name: "relation",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if relation is the best we can do here? I understand that's what it's referred to in the code, but it's not exactly obvious to someone looking at a metrics dashboard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created follow-up task https://jira.hyperledger.org/browse/FAB-18327 for renaming this (and the underlying types.ClusterRelation) to consensus relation.

c.InactiveChainRegistry.TrackChain(support.ChannelID(), support.Block(0), func() {
c.ChainManager.CreateChain(support.ChannelID())
})
c.ChainManager.ReportClusterRelationAndStatusMetrics(support.ChannelID(), types.ClusterRelationConfigTracker, types.StatusInactive)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do not like this. And this is one of the strong reasons why I asked that we please ditch the notion of the raft consenter switching the chain out, and delegate this back up to the multichannel registrar. Consensus plugins have no business setting cluster relation status.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did not like doing this but it's the only way I found with the current code flow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have an outstanding JIRA to fix this flow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I do not see a JIRA for this. I can try to create one but I'm not sure if I have enough context in this area to get it completely right.

Copy link
Contributor

@tock-ibm tock-ibm Nov 8, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there is a Jira for it, or a variation on this theme: https://jira.hyperledger.org/browse/FAB-18152

However, the big landmine we have to dismantle is the pattern with the InactiveChainRegistry etc., and we decided back then to finish the channel participation functionality first, and only then start refactoring. Moreover, the dependencies around the InactiveChainRegistry and BlockPuller are pretty complex, so this should be done very carefully... (if at all - "if it ain't broke don't fix it" comes to mind). Hopefully when we eventually deprecate the system channel this task will be a lot easier.

@wlahti wlahti force-pushed the fab-18087 branch 2 times, most recently from 0e87fde to a51e57e Compare November 7, 2020 13:45
@wlahti wlahti changed the title Chan.Part.API: Add cluster relation and status metrics Chan.Part.API: channel participation relation and status metrics Nov 7, 2020
@wlahti wlahti force-pushed the fab-18087 branch 2 times, most recently from dcd5cb3 to c73e26f Compare November 9, 2020 13:06
FAB-18087

Signed-off-by: Will Lahti <wtlahti@us.ibm.com>
@jyellick jyellick merged commit d54ed92 into hyperledger:master Nov 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants