-
Notifications
You must be signed in to change notification settings - Fork 346
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
additional metrics #579
additional metrics #579
Conversation
8b96197
to
cf1f5af
Compare
24e5a70
to
306d40d
Compare
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly minor suggestions
} | ||
return nil | ||
} | ||
|
||
// TestCapacityController checks that the controller handles the initial state and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Update function name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
metricsManager := metrics.NewCSIMetricsManager("" /* driverName */) | ||
metricsManager := metrics.NewCSIMetricsManagerWithOptions("", /* driverName */ | ||
// Will be provided via default gatherer. | ||
metrics.WithProcessStartTime(false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this affect existing CSI cumulative metrics that rely on process start time, like csi_sidecar_operations_seconds for example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The process start time metric is registered in the legacy registry. By gathering from there, we still get the same value. If we enabled it here, we would have the same metric from two different places.
metricsManager := metrics.NewCSIMetricsManagerWithOptions("", /* driverName */ | ||
// Will be provided via default gatherer. | ||
metrics.WithProcessStartTime(false), | ||
metrics.WithSubsystem(metrics.SubsystemSidecar), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this just to make the default explicit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
@@ -227,18 +237,18 @@ func main() { | |||
} | |||
} | |||
|
|||
// Prepare http endpoint for metrics + leader election healthz | |||
// Prepare http endpoint for metrics + leader election healthz + other metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: "other metrics" is redundant since "metrics" is mentioned in the beginning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
mux.Handle(*metricsPath, | ||
promhttp.InstrumentMetricHandler( | ||
reg, | ||
promhttp.HandlerFor(gatherers, promhttp.HandlerOpts{}))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RegisterToServer()
includes a HandlerOpt of ErrorHandling: metrics.ContinueOnError
. Is that needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think our gatherers ever return an error, so this shouldn't make a difference.
Even if they did, I'd prefer to get a clear indication that something was broken instead continuing with incomplete data. But I don't feel strongly about this and could add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I also don't have a strong preference :)
pkg/capacity/capacity_test.go
Outdated
@@ -96,6 +123,9 @@ func TestController(t *testing.T) { | |||
modify func(ctx context.Context, clientSet *fakeclientset.Clientset, expected []testCapacity) (modifiedExpected []testCapacity, err error) | |||
capacityChange func(ctx context.Context, storage *mockCapacity, expected []testCapacity) (modifiedExpected []testCapacity) | |||
topologyChange func(ctx context.Context, topology *topology.Mock, expected []testCapacity) (modifiedExpected []testCapacity) | |||
|
|||
expectedObjectsPrepared objects | |||
expectedObjectsProcessed objects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the case that in expectedObjectsProcessed
, goal
always == current
and obsolete
always == 0? If so, it might be useful to either add this here as a comment, or just use a single int as the expectation representing both the expected goal
and current
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, using objects
as type here is unnecessary complex. I've switched to int.
pkg/capacity/capacity.go
Outdated
) | ||
objectsObsoleteDesc = metrics.NewDesc( | ||
"csistoragecapacities_obsolete", | ||
"Number of CSIStorageCapacity objects that exist and will be deleted automatically.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add to the description mentioning that modified objects are not considered obsolete. I wasn't sure from the current descriptions alone what the behavior of these metrics are with updated objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no such thing as a "modified object" 😁
There are objects that are already scheduled for a refresh, which then may or may not lead to them being updated. The length of the work queue is an indicator for that.
I added "Objects that exist and may need an update are not considered obsolete and therefore not included in this value." as clarification. But I don't think it actually says something that wasn't already in that existing description.
@@ -115,6 +140,8 @@ type CSICapacityClient interface { | |||
} | |||
|
|||
// NewController creates a new controller for CSIStorageCapacity objects. | |||
// It implements metrics.StableCollector and thus can be registered in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Consider having var _ metrics.StableCollector = &Controller{}
instead of or in addition to this comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the type assertion. That's always useful and not adding it was an oversight.
But as it wouldn't show up in generated documentation, therefore I kept the comment.
@@ -358,8 +385,11 @@ func (c *Controller) addWorkItem(segment *topology.Segment, sc *storagev1.Storag | |||
storageClassName: sc.Name, | |||
} | |||
// Ensure that we have an entry for it... | |||
capacity := c.capacities[item] | |||
c.capacities[item] = capacity | |||
_, found := c.capacities[item] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious: do the two versions do the same thing and the newer version just makes it more explicit, or are there actual functional differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change here alone is not functionally identical: the new line only checks for the capacity without adding it.
But directly underneath it we set the entry to nil if not found, and together with that it is functionally equivalent because capacity := c.capacities[item]
will also be nil when not found.
pkg/capacity/capacity.go
Outdated
|
||
func (c *Controller) isObsolete(capacity *storagev1alpha1.CSIStorageCapacity) bool { | ||
for item, _ := range c.capacities { | ||
if item.storageClassName == capacity.StorageClassName && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: define an workItem.equals()
method, which can be easily identified and reused by future devs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is useful. I already had this comparison in two places.
The external-provisioner has several named work queue which all can provide metrics data. We just need to - initialize gathering through the default legacy registry by importing the helper package for that - configure the mux such that the default legacy registry is also included Leader election can be enabled the same way. The default legacy registry also includes information about processes and the Go runtime. Here is some example data without leader election (wasn't enabled): # HELP csi_sidecar_operations_seconds [ALPHA] Container Storage Interface operation duration with gRPC error code status total # TYPE csi_sidecar_operations_seconds histogram ... csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/ControllerGetCapabilities"} 0.001358078 csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/ControllerGetCapabilities"} 1 ... csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/GetCapacity"} 0.025071011 csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/GetCapacity"} 25 ... csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginCapabilities"} 0.001320474 csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginCapabilities"} 1 ... csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeGetInfo"} 0.000994998 csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeGetInfo"} 1 ... csi_sidecar_operations_seconds_sum{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginInfo"} 0.000752024 csi_sidecar_operations_seconds_count{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginInfo"} 1 ... csi_sidecar_operations_seconds_sum{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/Probe"} 0.002666977 csi_sidecar_operations_seconds_count{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/Probe"} 1 # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles. # TYPE go_gc_duration_seconds summary go_gc_duration_seconds{quantile="0"} 7.5671e-05 go_gc_duration_seconds{quantile="0.25"} 9.2559e-05 go_gc_duration_seconds{quantile="0.5"} 9.9967e-05 go_gc_duration_seconds{quantile="0.75"} 0.000135593 go_gc_duration_seconds{quantile="1"} 0.000194901 go_gc_duration_seconds_sum 0.0012424 go_gc_duration_seconds_count 11 # HELP go_goroutines Number of goroutines that currently exist. # TYPE go_goroutines gauge go_goroutines 113 # HELP go_info Information about the Go environment. # TYPE go_info gauge go_info{version="go1.16"} 1 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. # TYPE go_memstats_alloc_bytes gauge go_memstats_alloc_bytes 4.669544e+06 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. # TYPE go_memstats_alloc_bytes_total counter go_memstats_alloc_bytes_total 1.5654752e+07 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. # TYPE go_memstats_buck_hash_sys_bytes gauge go_memstats_buck_hash_sys_bytes 1.45326e+06 # HELP go_memstats_frees_total Total number of frees. # TYPE go_memstats_frees_total counter go_memstats_frees_total 108160 # HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started. # TYPE go_memstats_gc_cpu_fraction gauge go_memstats_gc_cpu_fraction 6.767118038487869e-06 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. # TYPE go_memstats_gc_sys_bytes gauge go_memstats_gc_sys_bytes 5.536336e+06 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. # TYPE go_memstats_heap_alloc_bytes gauge go_memstats_heap_alloc_bytes 4.669544e+06 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. # TYPE go_memstats_heap_idle_bytes gauge go_memstats_heap_idle_bytes 5.7565184e+07 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. # TYPE go_memstats_heap_inuse_bytes gauge go_memstats_heap_inuse_bytes 7.512064e+06 # HELP go_memstats_heap_objects Number of allocated objects. # TYPE go_memstats_heap_objects gauge go_memstats_heap_objects 25904 # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS. # TYPE go_memstats_heap_released_bytes gauge go_memstats_heap_released_bytes 5.6336384e+07 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. # TYPE go_memstats_heap_sys_bytes gauge go_memstats_heap_sys_bytes 6.5077248e+07 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. # TYPE go_memstats_last_gc_time_seconds gauge go_memstats_last_gc_time_seconds 1.6152977257743602e+09 # HELP go_memstats_lookups_total Total number of pointer lookups. # TYPE go_memstats_lookups_total counter go_memstats_lookups_total 0 # HELP go_memstats_mallocs_total Total number of mallocs. # TYPE go_memstats_mallocs_total counter go_memstats_mallocs_total 134064 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. # TYPE go_memstats_mcache_inuse_bytes gauge go_memstats_mcache_inuse_bytes 43200 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. # TYPE go_memstats_mcache_sys_bytes gauge go_memstats_mcache_sys_bytes 49152 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. # TYPE go_memstats_mspan_inuse_bytes gauge go_memstats_mspan_inuse_bytes 298928 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. # TYPE go_memstats_mspan_sys_bytes gauge go_memstats_mspan_sys_bytes 311296 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. # TYPE go_memstats_next_gc_bytes gauge go_memstats_next_gc_bytes 9.140384e+06 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. # TYPE go_memstats_other_sys_bytes gauge go_memstats_other_sys_bytes 3.742956e+06 # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. # TYPE go_memstats_stack_inuse_bytes gauge go_memstats_stack_inuse_bytes 2.031616e+06 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. # TYPE go_memstats_stack_sys_bytes gauge go_memstats_stack_sys_bytes 2.031616e+06 # HELP go_memstats_sys_bytes Number of bytes obtained from system. # TYPE go_memstats_sys_bytes gauge go_memstats_sys_bytes 7.8201864e+07 # HELP go_threads Number of OS threads created. # TYPE go_threads gauge go_threads 31 # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # TYPE process_cpu_seconds_total counter process_cpu_seconds_total 1.7 # HELP process_max_fds Maximum number of open file descriptors. # TYPE process_max_fds gauge process_max_fds 1.048576e+06 # HELP process_open_fds Number of open file descriptors. # TYPE process_open_fds gauge process_open_fds 11 # HELP process_resident_memory_bytes Resident memory size in bytes. # TYPE process_resident_memory_bytes gauge process_resident_memory_bytes 4.8300032e+07 # HELP process_start_time_seconds Start time of the process since unix epoch in seconds. # TYPE process_start_time_seconds gauge process_start_time_seconds 1.6152970493e+09 # HELP process_virtual_memory_bytes Virtual memory size in bytes. # TYPE process_virtual_memory_bytes gauge process_virtual_memory_bytes 7.61479168e+08 # HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes. # TYPE process_virtual_memory_max_bytes gauge process_virtual_memory_max_bytes -1 # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served. # TYPE promhttp_metric_handler_requests_in_flight gauge promhttp_metric_handler_requests_in_flight 1 # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code. # TYPE promhttp_metric_handler_requests_total counter promhttp_metric_handler_requests_total{code="200"} 11 promhttp_metric_handler_requests_total{code="500"} 0 promhttp_metric_handler_requests_total{code="503"} 0 # HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue # TYPE workqueue_adds_total counter workqueue_adds_total{name="claims"} 0 workqueue_adds_total{name="csistoragecapacity"} 25 workqueue_adds_total{name="unsavedpvs"} 0 workqueue_adds_total{name="volumes"} 0 # HELP workqueue_depth [ALPHA] Current depth of workqueue # TYPE workqueue_depth gauge workqueue_depth{name="claims"} 0 workqueue_depth{name="csistoragecapacity"} 0 workqueue_depth{name="unsavedpvs"} 0 workqueue_depth{name="volumes"} 0 # HELP workqueue_longest_running_processor_seconds [ALPHA] How many seconds has the longest running processor for workqueue been running. # TYPE workqueue_longest_running_processor_seconds gauge workqueue_longest_running_processor_seconds{name="claims"} 0 workqueue_longest_running_processor_seconds{name="csistoragecapacity"} 0 workqueue_longest_running_processor_seconds{name="unsavedpvs"} 0 workqueue_longest_running_processor_seconds{name="volumes"} 0 # HELP workqueue_queue_duration_seconds [ALPHA] How long in seconds an item stays in workqueue before being requested. # TYPE workqueue_queue_duration_seconds histogram ... workqueue_queue_duration_seconds_sum{name="claims"} 0 workqueue_queue_duration_seconds_count{name="claims"} 0 ... workqueue_queue_duration_seconds_sum{name="csistoragecapacity"} 0.229351696 workqueue_queue_duration_seconds_count{name="csistoragecapacity"} 25 ... workqueue_queue_duration_seconds_sum{name="unsavedpvs"} 0 workqueue_queue_duration_seconds_count{name="unsavedpvs"} 0 ... workqueue_queue_duration_seconds_sum{name="volumes"} 0 workqueue_queue_duration_seconds_count{name="volumes"} 0 # HELP workqueue_retries_total [ALPHA] Total number of retries handled by workqueue # TYPE workqueue_retries_total counter workqueue_retries_total{name="claims"} 0 workqueue_retries_total{name="csistoragecapacity"} 0 workqueue_retries_total{name="unsavedpvs"} 0 workqueue_retries_total{name="volumes"} 0 # HELP workqueue_unfinished_work_seconds [ALPHA] How many seconds of work has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases. # TYPE workqueue_unfinished_work_seconds gauge workqueue_unfinished_work_seconds{name="claims"} 0 workqueue_unfinished_work_seconds{name="csistoragecapacity"} 0 workqueue_unfinished_work_seconds{name="unsavedpvs"} 0 workqueue_unfinished_work_seconds{name="volumes"} 0 # HELP workqueue_work_duration_seconds [ALPHA] How long in seconds processing an item from workqueue takes. # TYPE workqueue_work_duration_seconds histogram ... workqueue_work_duration_seconds_sum{name="claims"} 0 workqueue_work_duration_seconds_count{name="claims"} 0 ... workqueue_work_duration_seconds_sum{name="csistoragecapacity"} 0.04584993 workqueue_work_duration_seconds_count{name="csistoragecapacity"} 25 ... workqueue_work_duration_seconds_sum{name="unsavedpvs"} 0 workqueue_work_duration_seconds_count{name="unsavedpvs"} 0 ... workqueue_work_duration_seconds_sum{name="volumes"} 0 workqueue_work_duration_seconds_count{name="volumes"} 0
This adds gauges that describe how many objects the external-provisioner manages, how many of those already exist, and how many are to be deleted. Example: # HELP csistoragecapacities_desired_current [ALPHA] Number of CSIStorageCapacity objects that exist and are supposed to be managed automatically. # TYPE csistoragecapacities_desired_current gauge csistoragecapacities_desired_current 2 # HELP csistoragecapacities_desired_goal [ALPHA] Number of CSIStorageCapacity objects that are supposed to be managed automatically. # TYPE csistoragecapacities_desired_goal gauge csistoragecapacities_desired_goal 2 # HELP csistoragecapacities_obsolete [ALPHA] Number of CSIStorageCapacity objects that exist and will be deleted automatically. # TYPE csistoragecapacities_obsolete gauge csistoragecapacities_obsolete 0
The workItem against capacity comparison was done in two places. A helper function makes the code easier to read.
306d40d
to
2b79a45
Compare
@verult : updated and rebased - please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/assign @msau42
for approval
@@ -96,6 +123,9 @@ func TestController(t *testing.T) { | |||
modify func(ctx context.Context, clientSet *fakeclientset.Clientset, expected []testCapacity) (modifiedExpected []testCapacity, err error) | |||
capacityChange func(ctx context.Context, storage *mockCapacity, expected []testCapacity) (modifiedExpected []testCapacity) | |||
topologyChange func(ctx context.Context, topology *topology.Mock, expected []testCapacity) (modifiedExpected []testCapacity) | |||
|
|||
expectedObjectsPrepared objects | |||
expectedTotalProcessed int64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: name it something like expectedTotalFinal
. "Total Processed" could also imply the number of objects processed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Processed" refers to the step when the check runs (= after processing the work queue). If the tests ever get extended to do more work, the "final" might become misleading. I therefore prefer "Processed".
) | ||
objectsObsoleteDesc = metrics.NewDesc( | ||
"csistoragecapacities_obsolete", | ||
"Number of CSIStorageCapacity objects that exist and will be deleted automatically. Objects that exist and may need an update are not considered obsolete and therefore not included in this value.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your comment makes sense to me and I don't have a strong preference either way whether to add that part of the description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@msau42 I'll leave it as-is unless you have some other preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment was the one in #579 (comment)
/assign @Jiawei0227 |
/assign @Jiawei0227 Somehow this previously had no effect? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
I am not familiar with the storage capacity feature so I think @verult had more context. From the metrics perspective, honestly it is a mess already in the k8s metrics now... I think until that is figured out, this change is okay from my point of view.
But honestly I think we need to do something with the csi-lib-utils to make the whole thing of adding metrics to sidecars more elegant. The purpose of csi-lib-utils is to provide the central metrics registry for these sidecars. But I dont know if that would be a too much change or I guess that can be a separate issue...
// https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/clientgo/leaderelection/metrics.go | ||
// | ||
// Also to happens to include Go runtime and process metrics: | ||
// https://github.com/kubernetes/kubernetes/blob/9780d88cb6a4b5b067256ecb4abf56892093ee87/staging/src/k8s.io/component-base/metrics/legacyregistry/registry.go#L46-L49 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once asked for component-base to include the whole set of runtime and process metrics here: kubernetes/kubernetes#95152
But they decided not to include because of stability concern.
I guess that only leaves us the option to either manually add those metrics or use legacyregistry...? Dont really know how this will end up... It just feel messy for me overall for the metrics in k8s.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I once asked for component-base to include the whole set of runtime and process metrics here: kubernetes/kubernetes#95152
But they decided not to include because of stability concern.
Hmm, isn't https://github.com/kubernetes/kubernetes/blob/9780d88cb6a4b5b067256ecb4abf56892093ee87/staging/src/k8s.io/component-base/metrics/legacyregistry/registry.go#L46-L49 doing what you asked for?
Or do you want them to be added elsewhere?
I have not found documentation about the role this metrics/legacyregistry package plays respectively is meant to play. I just figured out that in order to get access to workqueue metrics, I have to gather from that legacyregistry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this code you reference does what I asked for. But they are under this legacyregistry which implies this is not something we might not want to use moving forward. So I was asking them to have something in the new registry.
I guess my main concern is that since we are using a lot legacyregistry. And I am afraid that for some later k8s release if the legacyregistry get deprecated and removed. We will have more work to do here. But this seems a bigger issue in the whole k8s
@@ -458,6 +468,32 @@ func main() { | |||
*capacityPollInterval, | |||
*capacityImmediateBinding, | |||
) | |||
legacyregistry.CustomMustRegister(capacityController) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can change the csi-lib-utils metricsManager constructor function to include an KubeRegistry instead of creating our own I guess?
@@ -107,6 +115,28 @@ var ( | |||
Factor: 1.1, | |||
Steps: 10, | |||
} | |||
|
|||
objectsGoalDesc = metrics.NewDesc( | |||
"csistoragecapacities_desired_goal", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: csistoragecapacities seems very long, do you think it makes sense to break them down a little bit like csi_storage_capacities_... totally up to you.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer csistoragecapacities
for consistency with kubectl get csistoragecapacities
.
I think your opinion about the metrics changes was what Michelle was after, so thanks for taking a look. FWIW, I tend to agree with your assessment of the overall situation in k8s metrics.
Separate issue, please. |
@msau42 : both @verult and @Jiawei0227 are okay with the current state, please approve. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: msau42, pohly, verult The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind feature
What this PR does / why we need it:
As pointed out in the KEP review for storage capacity tracking, admins need more insights into the operation of the sidecar than just the CSI calls. This PR adds new metrics gathering specifically for storage capacity tracking, but also enables the gather of metrices that already existed and just weren't reported (process and Go runtime, work queues, leaderelection).
Note to reviewer
The commit messages have examples of what the new data looks like. I'm not sure how we want to document this. IMHO we should first decide about central documentation (= kubernetes-csi/docs#339), then add something to each repo as needed.
Does this PR introduce a user-facing change?: