-
Notifications
You must be signed in to change notification settings - Fork 92
Conversation
In testing this I've found a few issues:
In addition:
|
Additional detail: I believe |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial review and added some comments. Have not tested manually yet. I wonder if there's unit tests we could build.
I am also wondering if it makes sense to keep some (or most) of these out of the sync loop, and instead start a separate goroutine that is responsible for publishing the majority of these metrics on a specific cadence (update metrics every X seconds).
It seems like they don't need to be tied to discovery events (in k8s case) or the sync loop (in openstack case).
Although now that I have written it out, I think changing to this approach could result in metrics lagging behind?
@@ -154,7 +154,22 @@ func (r *Reconciler) reconcile() { | |||
r.reconcileSvcs(desiredSvcs, currentServices.Items) | |||
|
|||
desiredEndpoints := kubeEndpoints(r.BackendName, projectName, loadbalancers, pools) | |||
for _, ep := range desiredEndpoints { | |||
desiredEndpoints = append(desiredEndpoints, ep) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this duplicating the list?
if err != nil { | ||
r.Logger.Error(err) | ||
} | ||
r.Metrics.DiscovererReplicatedEndpointsMetric(r.BackendName, projectName, r.sumEndpoints(endpoints.Items)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this competing with the metric being set in sync.Action
? It's unclear to me why we need them in both places. Also seems like we are missing the equivalent for DiscovererReplicatedServicesMetric
?
|
||
// Log upstream services prometheus | ||
r.Metrics.DiscovererUpstreamServicesMetric(r.BackendName, projectName, totalUpstreamServices) | ||
r.Metrics.DiscovererInvalidServicesMetric(r.BackendName, projectName, totalUpstreamServices-len(loadbalancers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: took me some time to figure out how this subtraction results in the number of invalid services. Can we create a variable invalidServicesCount
or something and move it up closer to totalUpstreamServices
and loadbalancers
?
endpoints, err := r.GimbalKubeClient.CoreV1().Endpoints(projectName).List(metav1.ListOptions{}) | ||
if err != nil { | ||
r.Logger.Error(err) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an else here to avoid using endpoints
when err == nil
?
discovery/pkg/sync/service.go
Outdated
} | ||
|
||
// // Log Total Services Metric | ||
// totalUpstreamServices, err := getTotalServicesCount(kubeClient, action.ObjectMeta().GetNamespace(), backendName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove?
@alexbrand I 100% believe we could re-do how metrics are implemented, this was my comment last time we were on the phone. I have them tied to events since that's the trigger for when something changes. Looking into the other comments/issues now. |
@rosskukulinski for the openstack |
I think modified value @stevesloka |
@stevesloka I lied, the |
@@ -71,9 +71,12 @@ func main() { | |||
log.Info("Gimbal Kubernetes Discoverer Starting up...") | |||
|
|||
// Init prometheus metrics | |||
discovererMetrics = localmetrics.NewMetrics() | |||
discovererMetrics = localmetrics.NewMetrics("kubernetes", backendName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
discovery/pkg/k8s/controller.go
Outdated
func (c *Controller) writeServiceMetrics(svc *v1.Service) error { | ||
upstreamServices, err := c.serviceLister.Services(svc.GetNamespace()).List(labels.Everything()) | ||
if err != nil { | ||
return err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we log instead of return here? We are not handling it at the call site
discovery/pkg/metrics/metrics.go
Outdated
DiscovererUpstreamServicesGauge: prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Name: DiscovererUpstreamServicesGauge, | ||
Help: "Total number of services in the upstream backend cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"upstream backend" sounds redundant to me. Thoughts?
discovery/pkg/metrics/metrics.go
Outdated
} | ||
|
||
// DiscovererInvalidEndpointsMetric records the total replicated endpoints | ||
func (d *DiscovererMetrics) DiscovererInvalidEndpointsMetric(namespace, serviceName string, totalEp int) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we get rid of this one?
discovery/pkg/metrics/metrics.go
Outdated
DiscovererInvalidEndpointsGauge: prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Name: DiscovererInvalidEndpointsGauge, | ||
Help: "Total number of endpoints invalid endpoints that could not be replicaed from the backend cluster", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "endpoints invalid endpoints"
but not sure if we need this metric anyways.
@@ -210,3 +219,19 @@ func (r *Reconciler) reconcileEndpoints(desired, current []v1.Endpoints) { | |||
r.syncqueue.Enqueue(sync.DeleteEndpointsAction(&ep)) | |||
} | |||
} | |||
|
|||
func (r *Reconciler) sumEndpoints(eps v1.Endpoints) int { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be func instead of method
return total | ||
} | ||
|
||
func (r *Reconciler) getListFromMap(mp map[string]v1.Endpoints) []v1.Endpoints { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can be func instead of method
discovery/pkg/sync/endpoints.go
Outdated
metrics.EndpointsEventTimestampMetric(action.endpoints.GetNamespace(), action.endpoints.GetName(), time.Now().Unix()) | ||
|
||
// TODO: Move to lister() | ||
totalEps, err := metrics.GetTotalEndpointsCount(kubeClient, action.endpoints.GetNamespace(), action.endpoints.GetName()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following why we need to GET the endpoints object again, if we already have it in action.endpoints
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, before I had the metric sum all endpoints for a namespace which was different.
discovery/pkg/metrics/metrics.go
Outdated
} | ||
|
||
// GetTotalServicesCount returns the number of services in a namespace for the particular backend | ||
func (d *DiscovererMetrics) GetTotalServicesCount(kubeclient kubernetes.Interface, namespace string) (int, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These funcs seem somewhat out of place in that they are unrelated to the metrics package. Wondering if it makes sense to move these calls to the k8s API into the sync pkg, which is where this is called and it already has a dependency on the k8s packages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I needed a good home for this but couldn't find one. Also, I need access to backendName
to filter on labels. If it's moved from metrics will just need to make those params public.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aaaaah I see. I think another option might be to add a backendName
field to endpointsAction
and serviceAction
, and do AddEndpointsAction(backendName, endpoints)
. The downside is that we have to plumb backendName
again through a couple of layers. Not sure if it's worth it at this time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved stuff around, just picked the backend name out of metrics, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Is it worth it for me to test it out as well?
The only other thought that just came to me is that we are adding an extra API call for every service add/update/delete we do, right?
discovery/pkg/metrics/metrics.go
Outdated
DiscovererInvalidEndpointsGauge: prometheus.NewGaugeVec( | ||
prometheus.GaugeOpts{ | ||
Name: DiscovererInvalidEndpointsGauge, | ||
Help: "Total number of endpoints invalid endpoints that could not be replicaed from the backend", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: endpoints invalid endpoints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 ...also fixed replicated
Yes right now new api call for each service CRUD, let me see if I can pipe through the lister, save us the call. |
ok I've got the lister piped in, so no additional api requests are required for metrics. |
Nice! Should I test this out on my end or can we merge? |
something still isn't quite right, but now it's with the service metrics. To test, I stopped both discoverers, deleted all services in the The following has been discovered:
yet here's the metrics gathered:
|
And |
I'm having trouble reproducing the service error where the counts are off. Could you port-forward to the discoverer on |
logs:
|
@stevesloka it's possible my local container build is screwed up. If you push a build from your environment I can test that. |
Try this image: |
I thought this was only for upstream services, not all. I can work to update. I thought once you sync'd then you wanted to see the gimbal service name, not the upstream name. |
@alexbrand @rosskukulinski I think this is ready to review again. The only things that I think could be improved are writing more tests around the Prom metrics. Additionally, there may be cases where we need to write out "0" for some error conditions. For example, should we always write a zero for gimbal_discoverer_invalid_services_total? I have a prebuilt image here: Some notable changes:
|
@stevesloka This is now working for me! I don't think we need to write out a zero for metrics that don't have a value. |
@@ -46,7 +46,7 @@ var ( | |||
|
|||
func init() { | |||
flag.BoolVar(&printVersion, "version", false, "Show version and quit") | |||
flag.IntVar(&numProcessThreads, "num-threads", 2, "Specify number of threads to use when processing queue items.") | |||
flag.IntVar(&numProcessThreads, "num-threads", 1, "Specify number of threads to use when processing queue items.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for changing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nah, just local debugging, I'll restore
@@ -50,6 +50,7 @@ var ( | |||
prometheusListenPort int | |||
discovererMetrics localmetrics.DiscovererMetrics | |||
log *logrus.Logger | |||
resyncInterval time.Duration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like this is used
discovery/pkg/k8s/controller_test.go
Outdated
|
||
gathering, err := gatherers.Gather() | ||
if err != nil { | ||
fmt.Println(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Fatal here?
discovery/pkg/k8s/controller_test.go
Outdated
|
||
gathering, err := gatherers.Gather() | ||
if err != nil { | ||
fmt.Println(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
t.Fatal here?
// Convert the k8s list to type []Endpoints so make comparison easier | ||
currentEndpoints := []Endpoints{} | ||
for _, v := range currentk8sEndpoints.Items { | ||
currentEndpoints = append(currentEndpoints, Endpoints{endpoints: v, upstreamName: ""}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to set the upstreamName
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are the current endpoints which come from the k8s query. The upstream name is not used, but having two lists of the same type makes the comparison logic work as-is.
|
||
for _, tc := range tests { | ||
t.Run(tc.name, func(t *testing.T) { | ||
nowFunc = func() time.Time { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How/where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Timestamp metric used time.Now()
to get the current time. In order to unit test that metric, we need a way to set it in the test. If it's not set in the test, then it defaults to now(). Both the service & endpoint in package sync use this.
discovery/pkg/sync/endpoints_test.go
Outdated
expectErr bool | ||
expectedCount float64 | ||
expectedTimestamp float64 | ||
expectedLabels map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like there are assertions on the labels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I didn't write tests for those. They are simple pass-through values. But initially, I had though to implement those.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it! Should we remove them from this struct?
discovery/pkg/sync/queue.go
Outdated
err := action.Sync(sq.KubeClient, sq.Metrics, sq.BackendName) | ||
err := action.Sync(sq.KubeClient, sq.Logger) | ||
if err != nil { | ||
sq.Metrics.ServiceMetricError(action.ObjectMeta().GetNamespace(), action.ObjectMeta().GetName(), action.GetActionType()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if we get an error while handling an Endpoints resource?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup, meant to refactor that, good catch.
discovery/pkg/sync/endpoints.go
Outdated
return nil | ||
} | ||
|
||
func (action endpointsAction) String() string { | ||
return fmt.Sprintf(`%s endpoints '%s/%s'`, action.kind, action.endpoints.Namespace, action.endpoints.Name) | ||
} | ||
|
||
func addEndpoints(kubeClient kubernetes.Interface, endpoints *v1.Endpoints, lm localmetrics.DiscovererMetrics, backendName string) error { | ||
func (action endpointsAction) LogMetrics(gimbalKubeClient kubernetes.Interface, metrics localmetrics.DiscovererMetrics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we call this SetMetrics
or UpdateMetrics
? Log makes me think of log messages
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about WriteMetrics
? Set & Update sound like they are values that are getting set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eh we can do SetMetrics
I think
|
||
var nowFunc nowFuncT | ||
|
||
func init() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add some docs on why we need this and what it is doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just inline comments? We don't have any dev docs at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, sorry! That is what I meant :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cleaned it up, removed unused methods, added the missing header, and added some comments.
// a way to override the default values. | ||
type nowFuncT func() time.Time | ||
|
||
var nowFunc nowFuncT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will result in a nil-pointer deref outside of tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah my bad, I got overzealous, I restored the old impl.
@@ -114,6 +114,15 @@ func serviceName(lb loadbalancers.LoadBalancer) string { | |||
return strings.ToLower(lbName) | |||
} | |||
|
|||
// get the lb Name or ID if name is empty | |||
func serviceNameOriginal(lb loadbalancers.LoadBalancer) string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this be a problem if there are two load balancers with the same name? Seems like it'll affect the metrics, but not 100% sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it will. @rosskukulinski do you want to allow duplicates here? It only should affect openstack at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is to use the name + ID, but not sure if that is what we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I'm really not sure. I don't know how often folks name their load balancers the same thing in the same cluster. Let's leave it as-is for now (at risk of duplicates).
If the LB doesn't have a name, do we put in the id for the label?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes name is tried first, then id is used.
Testing tonight, I lost the golang metrics, looking to add those back in. |
Ok latest commit adds those metrics back in. |
…ervices, upstream-endpoints, replicated-endpoints, invalid-endpoints) Signed-off-by: Steve Sloka <steves@heptio.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple of comments that might require some follow up work, but I still want to merge this to start testing. I will open up issues for us to follow up.
@@ -71,6 +72,7 @@ func init() { | |||
flag.IntVar(&prometheusListenPort, "prometheus-listen-address", 8080, "The address to listen on for Prometheus HTTP requests") | |||
flag.Float64Var(&gimbalKubeClientQPS, "gimbal-client-qps", 5, "The maximum queries per second (QPS) that can be performed on the Gimbal Kubernetes API server") | |||
flag.IntVar(&gimbalKubeClientBurst, "gimbal-client-burst", 10, "The maximum number of queries that can be performed on the Gimbal Kubernetes API server during a burst") | |||
flag.DurationVar(&resyncInterval, "resync-interval", time.Minute*30, "Default resync period for watcher to refresh") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this isn't used. I want to merge this PR to start testing, so I will open a new issue to discuss this flag. It seems like we can remove it, but wanted to confirm with Steve.
if err != nil { | ||
action.SetMetricError(sq.Metrics) | ||
} | ||
action.SetMetrics(sq.KubeClient, sq.Metrics, sq.Logger) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering how these metrics will behave if we need to retry. Is it OK to set these metrics multiple times for the same service because of retries? Will open a new issue to discuss/look into this.
Add additional metrics to the Discoverers:
// Fixes #108
Signed-off-by: Steve Sloka steves@heptio.com