Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

additional metrics #579

Merged
merged 3 commits into from
Mar 29, 2021

Conversation

pohly
Copy link
Contributor

@pohly pohly commented Mar 10, 2021

What type of PR is this?
/kind feature

What this PR does / why we need it:

As pointed out in the KEP review for storage capacity tracking, admins need more insights into the operation of the sidecar than just the CSI calls. This PR adds new metrics gathering specifically for storage capacity tracking, but also enables the gather of metrices that already existed and just weren't reported (process and Go runtime, work queues, leaderelection).

Note to reviewer

The commit messages have examples of what the new data looks like. I'm not sure how we want to document this. IMHO we should first decide about central documentation (= kubernetes-csi/docs#339), then add something to each repo as needed.

Does this PR introduce a user-facing change?:

new metrics data (storage capacity tracking, process and Go runtime, work queues, leaderelection)

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 10, 2021
@k8s-ci-robot k8s-ci-robot requested review from lpabon and saad-ali March 10, 2021 14:18
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Mar 10, 2021
@pohly pohly force-pushed the storage-capacity-metrics branch 2 times, most recently from 8b96197 to cf1f5af Compare March 10, 2021 18:23
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 10, 2021
@pohly pohly force-pushed the storage-capacity-metrics branch 2 times, most recently from 24e5a70 to 306d40d Compare March 10, 2021 18:24
@verult
Copy link
Contributor

verult commented Mar 16, 2021

/assign

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2021
Copy link
Contributor

@verult verult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly minor suggestions

}
return nil
}

// TestCapacityController checks that the controller handles the initial state and
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Update function name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

metricsManager := metrics.NewCSIMetricsManager("" /* driverName */)
metricsManager := metrics.NewCSIMetricsManagerWithOptions("", /* driverName */
// Will be provided via default gatherer.
metrics.WithProcessStartTime(false),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this affect existing CSI cumulative metrics that rely on process start time, like csi_sidecar_operations_seconds for example?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The process start time metric is registered in the legacy registry. By gathering from there, we still get the same value. If we enabled it here, we would have the same metric from two different places.

metricsManager := metrics.NewCSIMetricsManagerWithOptions("", /* driverName */
// Will be provided via default gatherer.
metrics.WithProcessStartTime(false),
metrics.WithSubsystem(metrics.SubsystemSidecar),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just to make the default explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

@@ -227,18 +237,18 @@ func main() {
}
}

// Prepare http endpoint for metrics + leader election healthz
// Prepare http endpoint for metrics + leader election healthz + other metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "other metrics" is redundant since "metrics" is mentioned in the beginning

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

mux.Handle(*metricsPath,
promhttp.InstrumentMetricHandler(
reg,
promhttp.HandlerFor(gatherers, promhttp.HandlerOpts{})))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RegisterToServer() includes a HandlerOpt of ErrorHandling: metrics.ContinueOnError. Is that needed here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think our gatherers ever return an error, so this shouldn't make a difference.

Even if they did, I'd prefer to get a clear indication that something was broken instead continuing with incomplete data. But I don't feel strongly about this and could add it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I also don't have a strong preference :)

@@ -96,6 +123,9 @@ func TestController(t *testing.T) {
modify func(ctx context.Context, clientSet *fakeclientset.Clientset, expected []testCapacity) (modifiedExpected []testCapacity, err error)
capacityChange func(ctx context.Context, storage *mockCapacity, expected []testCapacity) (modifiedExpected []testCapacity)
topologyChange func(ctx context.Context, topology *topology.Mock, expected []testCapacity) (modifiedExpected []testCapacity)

expectedObjectsPrepared objects
expectedObjectsProcessed objects
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the case that in expectedObjectsProcessed, goal always == current and obsolete always == 0? If so, it might be useful to either add this here as a comment, or just use a single int as the expectation representing both the expected goal and current.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, using objects as type here is unnecessary complex. I've switched to int.

)
objectsObsoleteDesc = metrics.NewDesc(
"csistoragecapacities_obsolete",
"Number of CSIStorageCapacity objects that exist and will be deleted automatically.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add to the description mentioning that modified objects are not considered obsolete. I wasn't sure from the current descriptions alone what the behavior of these metrics are with updated objects.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing as a "modified object" 😁

There are objects that are already scheduled for a refresh, which then may or may not lead to them being updated. The length of the work queue is an indicator for that.

I added "Objects that exist and may need an update are not considered obsolete and therefore not included in this value." as clarification. But I don't think it actually says something that wasn't already in that existing description.

@@ -115,6 +140,8 @@ type CSICapacityClient interface {
}

// NewController creates a new controller for CSIStorageCapacity objects.
// It implements metrics.StableCollector and thus can be registered in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider having var _ metrics.StableCollector = &Controller{} instead of or in addition to this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the type assertion. That's always useful and not adding it was an oversight.

But as it wouldn't show up in generated documentation, therefore I kept the comment.

@@ -358,8 +385,11 @@ func (c *Controller) addWorkItem(segment *topology.Segment, sc *storagev1.Storag
storageClassName: sc.Name,
}
// Ensure that we have an entry for it...
capacity := c.capacities[item]
c.capacities[item] = capacity
_, found := c.capacities[item]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious: do the two versions do the same thing and the newer version just makes it more explicit, or are there actual functional differences?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change here alone is not functionally identical: the new line only checks for the capacity without adding it.

But directly underneath it we set the entry to nil if not found, and together with that it is functionally equivalent because capacity := c.capacities[item] will also be nil when not found.


func (c *Controller) isObsolete(capacity *storagev1alpha1.CSIStorageCapacity) bool {
for item, _ := range c.capacities {
if item.storageClassName == capacity.StorageClassName &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: define an workItem.equals() method, which can be easily identified and reused by future devs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is useful. I already had this comparison in two places.

pohly added 3 commits March 23, 2021 15:34
The external-provisioner has several named work queue which all can
provide metrics data. We just need to
- initialize gathering through the default legacy registry by importing
  the helper package for that
- configure the mux such that the default legacy registry is also included

Leader election can be enabled the same way.

The default legacy registry also includes information about processes
and the Go runtime.

Here is some example data without leader election (wasn't enabled):

   # HELP csi_sidecar_operations_seconds [ALPHA] Container Storage Interface operation duration with gRPC error code status total
   # TYPE csi_sidecar_operations_seconds histogram
   ...
   csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/ControllerGetCapabilities"} 0.001358078
   csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/ControllerGetCapabilities"} 1
   ...
   csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/GetCapacity"} 0.025071011
   csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Controller/GetCapacity"} 25
   ...
   csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginCapabilities"} 0.001320474
   csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginCapabilities"} 1
   ...
   csi_sidecar_operations_seconds_sum{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeGetInfo"} 0.000994998
   csi_sidecar_operations_seconds_count{driver_name="hostpath.csi.k8s.io",grpc_status_code="OK",method_name="/csi.v1.Node/NodeGetInfo"} 1
   ...
   csi_sidecar_operations_seconds_sum{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginInfo"} 0.000752024
   csi_sidecar_operations_seconds_count{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/GetPluginInfo"} 1
   ...
   csi_sidecar_operations_seconds_sum{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/Probe"} 0.002666977
   csi_sidecar_operations_seconds_count{driver_name="unknown-driver",grpc_status_code="OK",method_name="/csi.v1.Identity/Probe"} 1
   # HELP go_gc_duration_seconds A summary of the pause duration of garbage collection cycles.
   # TYPE go_gc_duration_seconds summary
   go_gc_duration_seconds{quantile="0"} 7.5671e-05
   go_gc_duration_seconds{quantile="0.25"} 9.2559e-05
   go_gc_duration_seconds{quantile="0.5"} 9.9967e-05
   go_gc_duration_seconds{quantile="0.75"} 0.000135593
   go_gc_duration_seconds{quantile="1"} 0.000194901
   go_gc_duration_seconds_sum 0.0012424
   go_gc_duration_seconds_count 11
   # HELP go_goroutines Number of goroutines that currently exist.
   # TYPE go_goroutines gauge
   go_goroutines 113
   # HELP go_info Information about the Go environment.
   # TYPE go_info gauge
   go_info{version="go1.16"} 1
   # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
   # TYPE go_memstats_alloc_bytes gauge
   go_memstats_alloc_bytes 4.669544e+06
   # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
   # TYPE go_memstats_alloc_bytes_total counter
   go_memstats_alloc_bytes_total 1.5654752e+07
   # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
   # TYPE go_memstats_buck_hash_sys_bytes gauge
   go_memstats_buck_hash_sys_bytes 1.45326e+06
   # HELP go_memstats_frees_total Total number of frees.
   # TYPE go_memstats_frees_total counter
   go_memstats_frees_total 108160
   # HELP go_memstats_gc_cpu_fraction The fraction of this program's available CPU time used by the GC since the program started.
   # TYPE go_memstats_gc_cpu_fraction gauge
   go_memstats_gc_cpu_fraction 6.767118038487869e-06
   # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
   # TYPE go_memstats_gc_sys_bytes gauge
   go_memstats_gc_sys_bytes 5.536336e+06
   # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
   # TYPE go_memstats_heap_alloc_bytes gauge
   go_memstats_heap_alloc_bytes 4.669544e+06
   # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
   # TYPE go_memstats_heap_idle_bytes gauge
   go_memstats_heap_idle_bytes 5.7565184e+07
   # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
   # TYPE go_memstats_heap_inuse_bytes gauge
   go_memstats_heap_inuse_bytes 7.512064e+06
   # HELP go_memstats_heap_objects Number of allocated objects.
   # TYPE go_memstats_heap_objects gauge
   go_memstats_heap_objects 25904
   # HELP go_memstats_heap_released_bytes Number of heap bytes released to OS.
   # TYPE go_memstats_heap_released_bytes gauge
   go_memstats_heap_released_bytes 5.6336384e+07
   # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
   # TYPE go_memstats_heap_sys_bytes gauge
   go_memstats_heap_sys_bytes 6.5077248e+07
   # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
   # TYPE go_memstats_last_gc_time_seconds gauge
   go_memstats_last_gc_time_seconds 1.6152977257743602e+09
   # HELP go_memstats_lookups_total Total number of pointer lookups.
   # TYPE go_memstats_lookups_total counter
   go_memstats_lookups_total 0
   # HELP go_memstats_mallocs_total Total number of mallocs.
   # TYPE go_memstats_mallocs_total counter
   go_memstats_mallocs_total 134064
   # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
   # TYPE go_memstats_mcache_inuse_bytes gauge
   go_memstats_mcache_inuse_bytes 43200
   # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
   # TYPE go_memstats_mcache_sys_bytes gauge
   go_memstats_mcache_sys_bytes 49152
   # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
   # TYPE go_memstats_mspan_inuse_bytes gauge
   go_memstats_mspan_inuse_bytes 298928
   # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
   # TYPE go_memstats_mspan_sys_bytes gauge
   go_memstats_mspan_sys_bytes 311296
   # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
   # TYPE go_memstats_next_gc_bytes gauge
   go_memstats_next_gc_bytes 9.140384e+06
   # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
   # TYPE go_memstats_other_sys_bytes gauge
   go_memstats_other_sys_bytes 3.742956e+06
   # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
   # TYPE go_memstats_stack_inuse_bytes gauge
   go_memstats_stack_inuse_bytes 2.031616e+06
   # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
   # TYPE go_memstats_stack_sys_bytes gauge
   go_memstats_stack_sys_bytes 2.031616e+06
   # HELP go_memstats_sys_bytes Number of bytes obtained from system.
   # TYPE go_memstats_sys_bytes gauge
   go_memstats_sys_bytes 7.8201864e+07
   # HELP go_threads Number of OS threads created.
   # TYPE go_threads gauge
   go_threads 31
   # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
   # TYPE process_cpu_seconds_total counter
   process_cpu_seconds_total 1.7
   # HELP process_max_fds Maximum number of open file descriptors.
   # TYPE process_max_fds gauge
   process_max_fds 1.048576e+06
   # HELP process_open_fds Number of open file descriptors.
   # TYPE process_open_fds gauge
   process_open_fds 11
   # HELP process_resident_memory_bytes Resident memory size in bytes.
   # TYPE process_resident_memory_bytes gauge
   process_resident_memory_bytes 4.8300032e+07
   # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
   # TYPE process_start_time_seconds gauge
   process_start_time_seconds 1.6152970493e+09
   # HELP process_virtual_memory_bytes Virtual memory size in bytes.
   # TYPE process_virtual_memory_bytes gauge
   process_virtual_memory_bytes 7.61479168e+08
   # HELP process_virtual_memory_max_bytes Maximum amount of virtual memory available in bytes.
   # TYPE process_virtual_memory_max_bytes gauge
   process_virtual_memory_max_bytes -1
   # HELP promhttp_metric_handler_requests_in_flight Current number of scrapes being served.
   # TYPE promhttp_metric_handler_requests_in_flight gauge
   promhttp_metric_handler_requests_in_flight 1
   # HELP promhttp_metric_handler_requests_total Total number of scrapes by HTTP status code.
   # TYPE promhttp_metric_handler_requests_total counter
   promhttp_metric_handler_requests_total{code="200"} 11
   promhttp_metric_handler_requests_total{code="500"} 0
   promhttp_metric_handler_requests_total{code="503"} 0
   # HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue
   # TYPE workqueue_adds_total counter
   workqueue_adds_total{name="claims"} 0
   workqueue_adds_total{name="csistoragecapacity"} 25
   workqueue_adds_total{name="unsavedpvs"} 0
   workqueue_adds_total{name="volumes"} 0
   # HELP workqueue_depth [ALPHA] Current depth of workqueue
   # TYPE workqueue_depth gauge
   workqueue_depth{name="claims"} 0
   workqueue_depth{name="csistoragecapacity"} 0
   workqueue_depth{name="unsavedpvs"} 0
   workqueue_depth{name="volumes"} 0
   # HELP workqueue_longest_running_processor_seconds [ALPHA] How many seconds has the longest running processor for workqueue been running.
   # TYPE workqueue_longest_running_processor_seconds gauge
   workqueue_longest_running_processor_seconds{name="claims"} 0
   workqueue_longest_running_processor_seconds{name="csistoragecapacity"} 0
   workqueue_longest_running_processor_seconds{name="unsavedpvs"} 0
   workqueue_longest_running_processor_seconds{name="volumes"} 0
   # HELP workqueue_queue_duration_seconds [ALPHA] How long in seconds an item stays in workqueue before being requested.
   # TYPE workqueue_queue_duration_seconds histogram
   ...
   workqueue_queue_duration_seconds_sum{name="claims"} 0
   workqueue_queue_duration_seconds_count{name="claims"} 0
   ...
   workqueue_queue_duration_seconds_sum{name="csistoragecapacity"} 0.229351696
   workqueue_queue_duration_seconds_count{name="csistoragecapacity"} 25
   ...
   workqueue_queue_duration_seconds_sum{name="unsavedpvs"} 0
   workqueue_queue_duration_seconds_count{name="unsavedpvs"} 0
   ...
   workqueue_queue_duration_seconds_sum{name="volumes"} 0
   workqueue_queue_duration_seconds_count{name="volumes"} 0
   # HELP workqueue_retries_total [ALPHA] Total number of retries handled by workqueue
   # TYPE workqueue_retries_total counter
   workqueue_retries_total{name="claims"} 0
   workqueue_retries_total{name="csistoragecapacity"} 0
   workqueue_retries_total{name="unsavedpvs"} 0
   workqueue_retries_total{name="volumes"} 0
   # HELP workqueue_unfinished_work_seconds [ALPHA] How many seconds of work has done that is in progress and hasn't been observed by work_duration. Large values indicate stuck threads. One can deduce the number of stuck threads by observing the rate at which this increases.
   # TYPE workqueue_unfinished_work_seconds gauge
   workqueue_unfinished_work_seconds{name="claims"} 0
   workqueue_unfinished_work_seconds{name="csistoragecapacity"} 0
   workqueue_unfinished_work_seconds{name="unsavedpvs"} 0
   workqueue_unfinished_work_seconds{name="volumes"} 0
   # HELP workqueue_work_duration_seconds [ALPHA] How long in seconds processing an item from workqueue takes.
   # TYPE workqueue_work_duration_seconds histogram
   ...
   workqueue_work_duration_seconds_sum{name="claims"} 0
   workqueue_work_duration_seconds_count{name="claims"} 0
   ...
   workqueue_work_duration_seconds_sum{name="csistoragecapacity"} 0.04584993
   workqueue_work_duration_seconds_count{name="csistoragecapacity"} 25
   ...
   workqueue_work_duration_seconds_sum{name="unsavedpvs"} 0
   workqueue_work_duration_seconds_count{name="unsavedpvs"} 0
   ...
   workqueue_work_duration_seconds_sum{name="volumes"} 0
   workqueue_work_duration_seconds_count{name="volumes"} 0
This adds gauges that describe how many objects the
external-provisioner manages, how many of those already exist, and how
many are to be deleted. Example:

  # HELP csistoragecapacities_desired_current [ALPHA] Number of CSIStorageCapacity objects that exist and are supposed to be managed automatically.
  # TYPE csistoragecapacities_desired_current gauge
  csistoragecapacities_desired_current 2
  # HELP csistoragecapacities_desired_goal [ALPHA] Number of CSIStorageCapacity objects that are supposed to be managed automatically.
  # TYPE csistoragecapacities_desired_goal gauge
  csistoragecapacities_desired_goal 2
  # HELP csistoragecapacities_obsolete [ALPHA] Number of CSIStorageCapacity objects that exist and will be deleted automatically.
  # TYPE csistoragecapacities_obsolete gauge
  csistoragecapacities_obsolete 0
The workItem against capacity comparison was done in two places. A
helper function makes the code easier to read.
@pohly pohly force-pushed the storage-capacity-metrics branch from 306d40d to 2b79a45 Compare March 23, 2021 14:35
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 23, 2021
@pohly
Copy link
Contributor Author

pohly commented Mar 23, 2021

@verult : updated and rebased - please take another look.

Copy link
Contributor

@verult verult left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

/assign @msau42
for approval

@@ -96,6 +123,9 @@ func TestController(t *testing.T) {
modify func(ctx context.Context, clientSet *fakeclientset.Clientset, expected []testCapacity) (modifiedExpected []testCapacity, err error)
capacityChange func(ctx context.Context, storage *mockCapacity, expected []testCapacity) (modifiedExpected []testCapacity)
topologyChange func(ctx context.Context, topology *topology.Mock, expected []testCapacity) (modifiedExpected []testCapacity)

expectedObjectsPrepared objects
expectedTotalProcessed int64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: name it something like expectedTotalFinal. "Total Processed" could also imply the number of objects processed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Processed" refers to the step when the check runs (= after processing the work queue). If the tests ever get extended to do more work, the "final" might become misleading. I therefore prefer "Processed".

)
objectsObsoleteDesc = metrics.NewDesc(
"csistoragecapacities_obsolete",
"Number of CSIStorageCapacity objects that exist and will be deleted automatically. Objects that exist and may need an update are not considered obsolete and therefore not included in this value.",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comment makes sense to me and I don't have a strong preference either way whether to add that part of the description

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@msau42 I'll leave it as-is unless you have some other preference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment was the one in #579 (comment)

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 23, 2021
@msau42
Copy link
Collaborator

msau42 commented Mar 25, 2021

/assign @Jiawei0227
actually I think this is the PR for metrics

@pohly
Copy link
Contributor Author

pohly commented Mar 25, 2021

/assign @Jiawei0227

Somehow this previously had no effect?

Copy link
Contributor

@Jiawei0227 Jiawei0227 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I am not familiar with the storage capacity feature so I think @verult had more context. From the metrics perspective, honestly it is a mess already in the k8s metrics now... I think until that is figured out, this change is okay from my point of view.

But honestly I think we need to do something with the csi-lib-utils to make the whole thing of adding metrics to sidecars more elegant. The purpose of csi-lib-utils is to provide the central metrics registry for these sidecars. But I dont know if that would be a too much change or I guess that can be a separate issue...

// https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/metrics/prometheus/clientgo/leaderelection/metrics.go
//
// Also to happens to include Go runtime and process metrics:
// https://github.com/kubernetes/kubernetes/blob/9780d88cb6a4b5b067256ecb4abf56892093ee87/staging/src/k8s.io/component-base/metrics/legacyregistry/registry.go#L46-L49
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once asked for component-base to include the whole set of runtime and process metrics here: kubernetes/kubernetes#95152

But they decided not to include because of stability concern.

I guess that only leaves us the option to either manually add those metrics or use legacyregistry...? Dont really know how this will end up... It just feel messy for me overall for the metrics in k8s.

Copy link
Contributor Author

@pohly pohly Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once asked for component-base to include the whole set of runtime and process metrics here: kubernetes/kubernetes#95152

But they decided not to include because of stability concern.

Hmm, isn't https://github.com/kubernetes/kubernetes/blob/9780d88cb6a4b5b067256ecb4abf56892093ee87/staging/src/k8s.io/component-base/metrics/legacyregistry/registry.go#L46-L49 doing what you asked for?

Or do you want them to be added elsewhere?

I have not found documentation about the role this metrics/legacyregistry package plays respectively is meant to play. I just figured out that in order to get access to workqueue metrics, I have to gather from that legacyregistry.

Copy link
Contributor

@Jiawei0227 Jiawei0227 Mar 29, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this code you reference does what I asked for. But they are under this legacyregistry which implies this is not something we might not want to use moving forward. So I was asking them to have something in the new registry.

I guess my main concern is that since we are using a lot legacyregistry. And I am afraid that for some later k8s release if the legacyregistry get deprecated and removed. We will have more work to do here. But this seems a bigger issue in the whole k8s

@@ -458,6 +468,32 @@ func main() {
*capacityPollInterval,
*capacityImmediateBinding,
)
legacyregistry.CustomMustRegister(capacityController)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can change the csi-lib-utils metricsManager constructor function to include an KubeRegistry instead of creating our own I guess?

@@ -107,6 +115,28 @@ var (
Factor: 1.1,
Steps: 10,
}

objectsGoalDesc = metrics.NewDesc(
"csistoragecapacities_desired_goal",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: csistoragecapacities seems very long, do you think it makes sense to break them down a little bit like csi_storage_capacities_... totally up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer csistoragecapacities for consistency with kubectl get csistoragecapacities.

@pohly
Copy link
Contributor Author

pohly commented Mar 29, 2021

I am not familiar with the storage capacity feature so I think @verult had more context. From the metrics perspective, honestly it is a mess already in the k8s metrics now...

I think your opinion about the metrics changes was what Michelle was after, so thanks for taking a look. FWIW, I tend to agree with your assessment of the overall situation in k8s metrics.

But honestly I think we need to do something with the csi-lib-utils to make the whole thing of adding metrics to sidecars more elegant. The purpose of csi-lib-utils is to provide the central metrics registry for these sidecars. But I dont know if that would be a too much change or I guess that can be a separate issue...

Separate issue, please.

@pohly
Copy link
Contributor Author

pohly commented Mar 29, 2021

@msau42 : both @verult and @Jiawei0227 are okay with the current state, please approve.

@msau42
Copy link
Collaborator

msau42 commented Mar 29, 2021

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: msau42, pohly, verult

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 29, 2021
@k8s-ci-robot k8s-ci-robot merged commit e11ccd6 into kubernetes-csi:master Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants