Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] - KES metrics are not being reported on tag 1.19.0 #1742

Closed
MarcelKlammer opened this issue Aug 20, 2020 · 16 comments
Closed

[BUG] - KES metrics are not being reported on tag 1.19.0 #1742

MarcelKlammer opened this issue Aug 20, 2020 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@MarcelKlammer
Copy link

Exernal

Summary
KES reporting not available on v1.19.0

Steps to reproduce
Update your block producer to v1.19.0. See metrics. No KES

Expected behavior
KES should be reported.

System info (please complete the following information):

  • OS: Ubuntu
  • Version 20.04

cardano-node 1.19.0 - linux-x86_64 - ghc-8.6
git rev 4814003
cardano-cli 1.19.0 - linux-x86_64 - ghc-8.6
git rev 4814003

@MarcelKlammer MarcelKlammer added the bug Something isn't working label Aug 20, 2020
@MarcelKlammer
Copy link
Author

Alongside my 1.19.0 test I rotated the KES keys using a local Docker Ubuntu container running the same v1.19.0.

Those newly created KES keys seem to be wrong, as they result in KES not reporting.

Uploading the old KES keys to a v1.19.0 block producer results in KES reporting normally.

So something is fishy with KES creation in v1.19.0

@MarcelKlammer
Copy link
Author

Old KES keys produce this trace:

Aug 20 20:22:59 ip-172-31-33-176 cardano-node[88751]: [ip-172-3:cardano.node.Forge:Debug:52] [2020-08-20 20:22:59.71 UTC] OptSkip (OptCons (WrapForgeStateInfo {unwrapForgeStateInfo = KESInfo {kesStartPeriod = KESPeriod 40, kesEndPeriod = KESPeriod 102, kesEvolution = 9}}) OptNil)

New KES keys don't produce a cardano.node.Forge:Debug trace.

@disassembler
Copy link
Contributor

That debug log was there for testing a specific problem with KES rotation. Debug logs are not expected to be stable and are expected to be added/removed at any point. I can confirm the KES metrics that we monitor show correctly after redeploying with a new KES key:

cardano_node_Forge_metrics_operationalCertificateStartKESPeriod_int 0 0
cardano_node_Forge_metrics_operationalCertificateExpiryKESPeriod_int 62
cardano_node_Forge_metrics_currentKESPeriod_int 2
cardano_node_Forge_metrics_remainingKESPeriods_int 60

@disassembler
Copy link
Contributor

I also regenerated KES keys and deployed them before checking this (they were initially deployed at KES period 0)

@kevinhammond
Copy link
Contributor

I will close the issue. There appears to be no problem.

@mrBliss
Copy link
Contributor

mrBliss commented Aug 21, 2020

Old KES keys produce this trace:

Aug 20 20:22:59 ip-172-31-33-176 cardano-node[88751]: [ip-172-3:cardano.node.Forge:Debug:52] [2020-08-20 20:22:59.71 UTC] OptSkip (OptCons (WrapForgeStateInfo {unwrapForgeStateInfo = KESInfo {kesStartPeriod = KESPeriod 40, kesEndPeriod = KESPeriod 102, kesEvolution = 9}}) OptNil)

New KES keys don't produce a cardano.node.Forge:Debug trace.

The key evolution code was refactored in 1.19.0. Now, such a debug message is only traced when the key actually evolved instead of once at the start of each slot (if the node is configured to produce blocks). As @disassembler said, debug logs are not expected to be stable.

@MarcelKlammer
Copy link
Author

I don't care about the debug logs. It was an attempt to give you some insight.

The metrics are not present. Plain and simple. I uploaded new KES and opcert with the start KES period being the present KES period. And the KES metrics don't show up.

@mrBliss
Copy link
Contributor

mrBliss commented Aug 21, 2020

I don't care about the debug logs. It was an attempt to give you some insight.

The metrics are not present. Plain and simple. I uploaded new KES and opcert with the start KES period being the present KES period. And the KES metrics don't show up.

Ah, reading your messages, I had the impression that you were talking about the missing trace messages. Is #1748 what you mean?

@MarcelKlammer
Copy link
Author

No,

EKG > Prometheus > Grafana.

cardano_node_Forge_metrics_remainingKESPeriods_int is not present.

Could it be, that this trace is actually necessary to get a KES metric at all.
Could it be, that since I regenerated the KES keys with the current KES period, that it didn't evolve yet, and so no trace is printed and thus nothing shows up?

@MarcelKlammer
Copy link
Author

I'm still testing and I'm getting closer. I will keep you updated.

@mrBliss
Copy link
Contributor

mrBliss commented Aug 21, 2020

No,

EKG > Prometheus > Grafana.

cardano_node_Forge_metrics_remainingKESPeriods_int is not present.

Ah

Could it be, that this trace is actually necessary to get a KES metric at all.
Could it be, that since I regenerated the KES keys with the current KES period, that it didn't evolve yet, and so no trace is printed and thus nothing shows up?

Yeah, I'm now thinking that that's indeed the reason. That's also the cause of #1748. Thanks for the info! I'll see what I can do about it (e.g., trace it once on startup, even when the key didn't have to evolve).

@MarcelKlammer
Copy link
Author

Yeah, I'm now thinking that that's indeed the reason. That's also the cause of #1748. Thanks for the info! I'll see what I can do about it (e.g., trace it once on startup, even when the key didn't have to evolve).

@kevinhammond Closing a bug report without proper investigation shouldn't be the norm, just saying.

@matiwinnetou
Copy link
Contributor

I agree with @MarcelKlammer , I was also a bit shocked how quickly it was closed without looking into details.

@MarcelKlammer
Copy link
Author

@mrBliss I can confirm the behavior now.

Just regenerated KES with --kes-period 48 instead of 49 (current) and node v1.19.0 now has KES metrics EKG/Prometheus/Grafana showing up.

@mrBliss
Copy link
Contributor

mrBliss commented Aug 21, 2020

A fix for this: IntersectMBO/ouroboros-network#2529.

@mrBliss mrBliss reopened this Aug 21, 2020
iohk-bors bot added a commit to IntersectMBO/ouroboros-network that referenced this issue Aug 21, 2020
2529: Always trace ForgeStateInfo, even when it didn't change r=mrBliss a=mrBliss

In `cardano-node`, the KES information (grafana metrics and LiveView) will only
be shown when `ForgeStateInfo` has been traced. When (re)starting a node, that
won't happen until the KES key has evolved, which, for mainnet, can take up to
36 hours. Related issues: IntersectMBO/cardano-node#1742,
IntersectMBO/cardano-node#1748.

Fix it by tracing the `ForgeStateInfo`, whether it was updated or unchanged.
This means it will be traced once every slot, i.e., second, if the node is
configured to produce blocks. It is up to the tracers in `cardano-node` to
suppress duplicate messages.

Co-authored-by: Thomas Winant <thomas@well-typed.com>
@kevinhammond
Copy link
Contributor

Hi Marcel. We did do the investigation (we were concerned that there might be a KES key acceptance problem, which would have needed urgent attention). I was told it was expected behaviour, but of course the issue was with logging rather than generation, as you have noted. We've seen quite a few things coming up with LiveView, unfortunately.

Straightpool added a commit to Straightpool/guild-operators that referenced this issue Aug 21, 2020
On KES rotation the starting period is now "current period -1" to accomomate for the following 1.19.0 issue, so the KES period has already evolved to keep the KES metrics: IntersectMBO/cardano-node#1742
rdlrt added a commit to cardano-community/guild-operators that referenced this issue Aug 21, 2020
* Update cntools.library

On KES rotation the starting period is now "current period -1" to accomomate for the following 1.19.0 issue, so the KES period has already evolved to keep the KES metrics: IntersectMBO/cardano-node#1742

* Update patch version number

* Update cntools-changelog.md

Co-authored-by: Priyank <3169068+rdlrt@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

7 participants