Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4346: Add metrics for informer #129160

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

xigang
Copy link
Contributor

@xigang xigang commented Dec 11, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

  1. Adds reflector metrics
  2. Adds informer metrics
  3. Expose informer reflector/queue/eventHandler metrics

KEP-4346
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4346-informer-metrics

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

This PR is backward-compatible and introduces no breaking changes. 
Users will automatically gain visibility into informer metrics without additional configuration.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

[KEP]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/4346-informer-metrics

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 11, 2024
@k8s-ci-robot
Copy link
Contributor

Please note that we're already in Test Freeze for the release-1.32 branch. This means every merged PR will be automatically fast-forwarded via the periodic ci-fast-forward job to the release branch of the upcoming v1.32.0 release.

Fast forwards are scheduled to happen every 6 hours, whereas the most recent run was: Wed Dec 11 12:08:11 UTC 2024.

@k8s-ci-robot k8s-ci-robot added do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Dec 11, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @xigang. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Dec 11, 2024
p.metrics.processDuration.Observe(time.Since(startTime).Seconds())
//TODO: This requires implementing Len() and Capacity() for ring growing
// p.metrics.numberOfPendingNotifications.Set(float64(p.pendingNotifications.Len()))
// p.metrics.sizeOfRingGrowing.Set(float64(p.pendingNotifications.Capacity()))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to wait for the Len() and Capacity() methods in the ring growing package to be merged.
PR: kubernetes/utils#321

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this single-threaded? (is calling Len and Capacity independently and not under lock safe here, given the pendingNotifications is not thread-safe?)

Copy link
Contributor Author

@xigang xigang Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, pendingNotifications is not thread-safe. The pop() and run() goroutines will concurrently read and write. We can add a pendingNotificationsLock sync.RWMutex here.

It can be fixed as follows:

p.pendingNotificationsLock.RLock()
length := float64(p.pendingNotifications.Len())
capacity := float64(p.pendingNotifications.Capacity())
p.pendingNotificationsLock.RUnlock()

p.metrics.numberOfPendingNotifications.Set(length)
p.metrics.sizeOfRingGrowing.Set(capacity)

Copy link
Contributor Author

@xigang xigang Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. but it requires the ring buffer PR to be merged.

@xigang xigang changed the title [WIP] clent-go: Add metrics for informer clent-go: Add metrics for informer Dec 12, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 12, 2024
@xigang xigang changed the title clent-go: Add metrics for informer client-go: Add metrics for informer Dec 12, 2024
@xigang
Copy link
Contributor Author

xigang commented Dec 12, 2024

/sig api-machinery
/sig scalability

@k8s-ci-robot k8s-ci-robot added the sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. label Dec 12, 2024
@xigang xigang changed the title client-go: Add metrics for informer KEP-4346: Add metrics for informer Dec 12, 2024
@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Dec 12, 2024
@dgrisonnet
Copy link
Member

for sig-instrumentation review

/assign

@Jefftree
Copy link
Member

/cc @richabanker
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Dec 12, 2024
@@ -268,6 +273,11 @@ func NewReflectorWithOptions(lw ListerWatcher, expectedType interface{}, store S
return r
}

func makeValidPromethusMetricName(in string) string {
// this isn't perfect, but it removes our common characters
return strings.NewReplacer("/", "_", ".", "_", "-", "_").Replace(in)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of dropping specific bad characters, shouldn't this be inverted and replace any character not in the set of allowed characters for valid prometheus names?

Copy link
Contributor Author

@xigang xigang Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, The current approach of replacing specific characters is less robust than enforcing the Prometheus metric naming rules directly. According to Prometheus documentation, metric names must match the regex [a-zA-Z_:][a-zA-Z0-9_:]* I'll adjust this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt done

name: options.Name,
name: options.Name,
// we need this to be unique per process (some names are still the same)but obvious who it belongs to
metrics: newReflectorMetrics(makeValidPromethusMetricName(fmt.Sprintf("reflector_"+options.Name+"_expectedType_"+reflect.TypeOf(expectedType).String()+"_%07d", rand.Intn(1000000)))),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't the randomized suffix explode metrics cardinality?

Copy link
Contributor Author

@xigang xigang Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The randomized suffix can cause metric cardinality explosion, worsening Prometheus' storage and query performance. The randomized suffix can be removed here.

When the Name is not specified, Reflector will automatically generate a name by calling the naming.GetNameFromCallsite() function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt done

p.metrics.processDuration.Observe(time.Since(startTime).Seconds())
//TODO: This requires implementing Len() and Capacity() for ring growing
// p.metrics.numberOfPendingNotifications.Set(float64(p.pendingNotifications.Len()))
// p.metrics.sizeOfRingGrowing.Set(float64(p.pendingNotifications.Capacity()))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this single-threaded? (is calling Len and Capacity independently and not under lock safe here, given the pendingNotifications is not thread-safe?)

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 18, 2024
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 19, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: xigang
Once this PR has been reviewed and has the lgtm label, please ask for approval from dgrisonnet and additionally assign deads2k for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@xigang xigang force-pushed the informer_metrics branch 4 times, most recently from f352336 to daa45ba Compare December 19, 2024 12:39
@xigang
Copy link
Contributor Author

xigang commented Dec 19, 2024

@liggitt The issues have been fixed. thanks.😄

Signed-off-by: xigang <wangxigang2014@gmail.com>
@xigang
Copy link
Contributor Author

xigang commented Jan 1, 2025

/assign @deads2k

@xigang
Copy link
Contributor Author

xigang commented Jan 10, 2025

/priority important-soon

@k8s-ci-robot k8s-ci-robot added priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. and removed needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Jan 10, 2025
@xigang
Copy link
Contributor Author

xigang commented Jan 15, 2025

/cc @deads2k @dgrisonnet @wojtek-t @richabanker PTAL. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/instrumentation Categorizes an issue or PR as relevant to SIG Instrumentation. sig/scalability Categorizes an issue or PR as relevant to SIG Scalability. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants