Share OSB client for ServiceBroker #2337

piotrmiskiewicz · 2018-09-12T09:28:08Z

This PR is a

Feature Implementation
Bug Fix
Documentation

What this PR does / why we need it:

This PR introduces BrokerClientManager which stores OSB clients - one client per broker. It allows to share one OSB client instance for all calls to the broker. It prevents the controller from creating OSB clients for every operation and it follows the description of the Golang HTTP client: "The Client's Transport typically has internal state (cached TCP connections), so Clients should be reused instead of created as needed. Clients are safe for concurrent use by multiple goroutines."

Which issue(s) this PR fixes

Fixes #2276

Please leave this checklist in the PR comment so that maintainers can ensure a good PR.

Merge Checklist:

New feature
- Tests
- Documentation
SVCat CLI flag
Server Flag for config
- Chart changes
- removing a flag by marking deprecated and hiding to avoid
  breaking the chart release and existing clients who provide a
  flag that will get an error when they try to update

luksa · 2018-09-12T12:14:04Z

/ok-to-test

piotrmiskiewicz · 2018-09-12T17:47:58Z

/test pull-service-catalog-integration

piotrmiskiewicz · 2018-09-13T12:26:56Z

pkg/controller/controller_clusterservicebroker.go

@@ -119,13 +124,45 @@ func (c *controller) reconcileClusterServiceBrokerKey(key string) error {
 	return c.reconcileClusterServiceBroker(broker)
 }

+func (c *controller) updateClusterServiceBrokerClient(broker *v1beta1.ClusterServiceBroker) (osb.Client, error) {


I'm thinking about a change:
return only one value here - the error
the brokerClientManager UpdateBrokerClient also return only error

it seems to be better, but I'd like to wait for tests and comments about the main concept - caching OSB clients

luksa · 2018-09-13T12:29:53Z

pkg/controller/broker_client_manager.go

+	return client, nil
+}
+
+func (m *BrokerClientManager) configHasChanged(cfg1 *osb.ClientConfiguration, cfg2 *osb.ClientConfiguration) bool {


Would this be better if it was a function instead of a method?

I wrote the helper method to be used inside BrokerClientManager and I don't want to expose it. It was not designed to be used by other components.

I've changed it to a function.

jboyd01 · 2018-09-13T16:51:46Z

pkg/controller/broker_client_manager.go

@@ -0,0 +1,132 @@
+/*
+Copyright 2017 The Kubernetes Authors.


jboyd01

I've only got a couple of minor nits. This looks solid to me. I had concerns around this concept but on review it looks good. I'll take a closer look at the tests you removed, I would think they are still valid tests but perhaps need to be reworked?

I've mentioned it to @nilebox, it would be beneficial to have additional review other others that have been deep in this code. Also @kibbles-n-bytes if you have cycles.

jboyd01 · 2018-09-13T16:57:33Z

pkg/controller/broker_client_manager.go

+	delete(m.clients, brokerKey)
+}
+
+// BrokerClient returns broker client fro a broker specified by the brokerKey


nit: spelling s/fro/for/

jboyd01 · 2018-09-13T17:00:40Z

pkg/controller/broker_client_manager_test.go

@@ -0,0 +1,139 @@
+/*
+Copyright 2017 The Kubernetes Authors.


piotrmiskiewicz · 2018-09-13T20:42:38Z

I think the test I removed, and additional integration test I skipped should be reworked. The old test was testing how the controller behaves when the service broker auth is wrong with processing serviceinstances. The test was testing scenario, where "controller fails to locate the broker authentication secret." In current solution - the controller does not need to locate the secret - it is done when clusterservicebroker is processd. I'll try to create such test for broker resource processing.

piotrmiskiewicz · 2018-09-13T20:44:45Z

I'm thinking about checking the size of the cache in a unit (or integration) test. Just to be sure it is not growing without reason (to not make new memory leak).

piotrmiskiewicz · 2018-09-14T12:45:03Z

I've added a test for non-existing broker and reconcile service instance. I realized one change - how controller handles not existing secret with auth credentials. In my solution, when a user creates a ClusterServiceBroker instance with a reference to a secret which does not exists - the broker client won't be created. After that, if he creates a secret - nothing changes until next reconciliation. Maybe that is an issue.

piotrmiskiewicz · 2018-09-17T07:27:20Z

I performed tests described in the issue: #2276
80 brokers which always responds with HTTP 500
controller-manager 0.1.31, we can see restart ("current memory" is going down very fast - the pod is restarted)

controller-manager 0.1.32

I've applied the PR to new version 0.1.32 and the result is: no restart

I've also tested the fix with version 0.1.31 and I saw the controller manager pod was working more than one day without restart.

jboyd01

I've added a test for non-existing broker and reconcile service instance. I realized one change - how controller handles not existing secret with auth credentials. In my solution, when a user creates a ClusterServiceBroker instance with a reference to a secret which does not exists - the broker client won't be created. After that, if he creates a secret - nothing changes until next reconciliation. Maybe that is an issue.

Can you elaborate on this - "nothing changes until the next reconciliation" - the user creates the missing secret, the broker client will be created when the exponential backoff expires and it does the retry, right? If that is the case, it seems pretty correct to me, I'm good with that.

Thanks for the additional analysis and long runs @piotrmiskiewicz. This is looking good, I'd like to move this forward. @luksa reviewed last week and only had one minor comment, I discussed briefly with @nilebox and he was on board with the idea. Let's get one more review.

jboyd01 · 2018-09-19T02:33:04Z

pkg/controller/controller_instance_test.go

-// TestReconcileServiceInstanceWithAuthError tests reconcileInstance when Kube Client
-// fails to locate the broker authentication secret.
-func TestReconcileServiceInstanceWithAuthError(t *testing.T) {
+// TestReconcileServiceInstanceWithNotExistingBroker tests reconcileInstance when the BrokerClientManager instance does not contain client for the broker.


nit: as a rule we wrap all function comments at column 80

jboyd01 · 2018-09-19T02:45:57Z

/approve

k8s-ci-robot · 2018-09-19T02:46:01Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jboyd01

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jboyd01]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

piotrmiskiewicz · 2018-09-19T06:09:58Z

Can you elaborate on this - "nothing changes until the next reconciliation" - the user creates the missing secret, the broker client will be created when the exponential backoff expires and it does the retry, right? If that is the case, it seems pretty correct to me, I'm good with that.

I'll give an example. The way it works is like an image pull secret, when you need credentials for a Docker repository. If you specify an imagePullSecret in the Deployment definition, but the secret does not exists - the Pod will fail (won't pull image). After creating a secret nothing changes. You need to delete Pod or change the Deployment.
The same is here. If the broker expects credentials (Broker definition contains a reference to a secret) but the secret does not exists - the OSB client cannot work. That is the same as before my change. But, when a user creates expected secret, service catalog still cannot do a call. Like a Kubernetes - after creating a secret with Docker repository credentials Pod is not in Running state.
When the controller performs resync (or next backoff retry is being processed), the OSB client is updated. After that, everything works fine.

The user creates a ClusterServiceBroker resource with a reference to a secret (with auth credentials)
Service Catalog is triggered by the ClusterServiceBroker addition
Service Catalog is trying to create an OSB client but the secret is missing, the reconcileClusterServiceBroker method returns error, the controller will do retries
All retries (defines by the exponential backoff policy) are done with error because the secret is missing
The user creates the secret.
Any operation like provisioning, deprovisioning cannot be done because the OSB client is not created
We need to wait until next resync (default is every 5 minutes defined with defaultResyncInterval) - the OSB client is created with authorization

In my opinion it is not a problem, but I wanted to describe what was changed

jboyd01 · 2018-09-19T14:24:58Z

re #2337 (comment)
Great, working as expected, thanks for verifying.

piotrmiskiewicz · 2018-09-20T10:24:02Z

/retest

piotrmiskiewicz · 2018-09-20T10:55:35Z

/test pull-service-catalog-integration

luksa

Looks good. I'm just not completely sure about the fact that we now create the client only when reconciling the broker. When reconciling other resources, we now no longer create the client, but simply log an error.

The new way seems more correct, but I need to think about the implications.

The new way also ensures we only retrieve the broker auth secret once instead of every time.

FYI: I tested this manually, and have confirmed that the broker only has one open connection for each ServiceBroker/ClusterServiceBroker instance.

luksa · 2018-09-25T13:27:00Z

pkg/controller/broker_client_manager.go

+// BrokerKey defines a key which points to a broker (cluster wide or namespaced)
+type BrokerKey struct {
+	name      string
+	namespace string


Idea for a future improvement: consider a case where a large multi-user cluster has a large number of ServiceBroker instances all pointing to the same broker (with the same osb.ClientConfiguration). We may want to ensure the connections are shared between all those ServiceBrokers, so we don't hold too many open connections to the same broker.

Yes, the improvement would be the key is not a namespace/name pair but the configuration (the TLS config). The authentication part (username/password) is not set in the golang http.Client. The best improvement would be a change in the OSB client and share http.Client even if username/password is different. This change is much bigger.

Anyway, the current solution (without my implementation of sharing clients) - when the resync is set to 5 min for all 1000 registered brokers (3 get catalog request per second) - the controller manager will be restarting every few minutes (because of "out of memory").

I agree. The improvement should go in later, in a separate PR. We need to get this PR in fast, since it will solve a lot of problems for us.

luksa · 2018-09-25T13:36:36Z

pkg/controller/controller.go

-				"Error getting broker auth credentials for broker %q: %s",
-				broker.Name, err,
+				"The instance references a broker %q which has no OSB client created",
+				serviceClass.Spec.ClusterServiceBrokerName,


When does this happen? Previously, we would create the client here, but now we expect it to always exist at this point. Would panicking be more appropriate, since we're not expecting the client to not exist here?

A 'clusterservicebroker' resource is created and at the same time as a 'serviceclass' provisioning request. The provisioning request is being processed before processing adding 'clusterservicebroker'. From the controller perspective (or the implementation of the method) it could happen. Another scenario - deletion of clusterservice broker comes to the controller at the same time as 'deprovisioning' request. Before my PR the problem also could happen. The client can be removed before processing. I'm not aware of every details. If such error occurs, the call will be retried with proper backoff policy.

Yes, I realized later that we definitely shouldn't panic.

luksa · 2018-09-25T13:42:42Z

pkg/controller/controller_clusterservicebroker.go

@@ -107,8 +108,12 @@ func shouldReconcileClusterServiceBroker(broker *v1beta1.ClusterServiceBroker, n
 func (c *controller) reconcileClusterServiceBrokerKey(key string) error {
 	broker, err := c.clusterServiceBrokerLister.Get(key)
 	pcb := pretty.NewContextBuilder(pretty.ClusterServiceBroker, "", key, "")
+
+	glog.V(4).Info(pcb.Message(fmt.Sprintf("Processing service broker %s", key)))


Redundant info. This is how the log line looks:

...icebroker.go:112] ClusterServiceBroker "ups-broker": Processing service broker ups-broker

piotrmiskiewicz · 2018-09-26T08:15:36Z

I'm just not completely sure about the fact that we now create the client only when reconciling the broker. When reconciling other resources, we now no longer create the client, but simply log an error.

It is what I described before, it is like imagePullSecret for docker registry. Even if you update the secret, the cluster won't try to pull the image once again.
You need to decide, if it is good enough.

On the other hand, storing credentials in the OSB client maybe is not the best solution. Providing credentials in every call would fix the problem. It allows us to implement better caching - few brokers with different credentials but one TLS config could share one OSB client.

luksa · 2018-09-26T16:12:40Z

/lgtm

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2018

k8s-ci-robot requested review from jberkhahn and staebler September 12, 2018 09:28

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 12, 2018

piotrmiskiewicz force-pushed the shared-osb-clients branch 2 times, most recently from 487afab to 3f33290 Compare September 12, 2018 10:33

piotrmiskiewicz changed the title ~~[WIP] Share OSB client for ServiceBroker~~ Share OSB client for ServiceBroker Sep 12, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2018

k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 12, 2018

piotrmiskiewicz changed the title ~~Share OSB client for ServiceBroker~~ [WIP] Share OSB client for ServiceBroker Sep 13, 2018

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2018

aszecowka mentioned this pull request Sep 13, 2018

Update Service-Catalog to v0.1.31 kyma-project/kyma#453

Closed

piotrmiskiewicz force-pushed the shared-osb-clients branch from 3f33290 to 4c73b89 Compare September 13, 2018 12:15

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 13, 2018

piotrmiskiewicz commented Sep 13, 2018

View reviewed changes

luksa reviewed Sep 13, 2018

View reviewed changes

piotrmiskiewicz changed the title ~~[WIP] Share OSB client for ServiceBroker~~ Share OSB client for ServiceBroker Sep 13, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2018

piotrmiskiewicz changed the title ~~Share OSB client for ServiceBroker~~ [WIP] Share OSB client for ServiceBroker Sep 13, 2018

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2018

jboyd01 reviewed Sep 13, 2018

View reviewed changes

pkg/controller/broker_client_manager.go Outdated

@@ -0,0 +1,132 @@

/*

Copyright 2017 The Kubernetes Authors.

Copy link

Contributor

jboyd01 Sep 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2018

jboyd01 reviewed Sep 13, 2018

View reviewed changes

PK85 mentioned this pull request Sep 14, 2018

Update Service-Catalog kyma-project/kyma#705

Closed

piotrmiskiewicz changed the title ~~[WIP] Share OSB client for ServiceBroker~~ Share OSB client for ServiceBroker Sep 14, 2018

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 14, 2018

jboyd01 approved these changes Sep 19, 2018

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2018

piotrmiskiewicz force-pushed the shared-osb-clients branch from 3196f48 to 156ab76 Compare September 20, 2018 09:33

Share OSB client for ServiceBroker

b717a14

piotrmiskiewicz force-pushed the shared-osb-clients branch from 156ab76 to b717a14 Compare September 20, 2018 09:35

luksa reviewed Sep 25, 2018

View reviewed changes

Fix logging - remove redundant info

92e24ef

k8s-ci-robot assigned luksa Sep 26, 2018

k8s-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Sep 26, 2018

k8s-ci-robot merged commit 118c04f into kubernetes-retired:master Sep 26, 2018

piotrmiskiewicz mentioned this pull request May 17, 2019

REQUEST: New membership for piotrmiskiewicz kubernetes/org#830

Closed

6 tasks

piotrmiskiewicz mentioned this pull request Jun 4, 2019

REQUEST: New membership for @piotrmiskiewicz kubernetes/org#877

Closed

6 tasks

cblecker unassigned luksa Jun 4, 2019

piotrmiskiewicz mentioned this pull request Jun 12, 2019

REQUEST: New membership for piotrmiskiewicz kubernetes/org#907

Closed

6 tasks

Share OSB client for ServiceBroker #2337

Share OSB client for ServiceBroker #2337

Conversation

piotrmiskiewicz commented Sep 12, 2018 • edited by jberkhahn Loading

luksa commented Sep 12, 2018

piotrmiskiewicz commented Sep 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jboyd01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrmiskiewicz commented Sep 13, 2018

piotrmiskiewicz commented Sep 13, 2018

piotrmiskiewicz commented Sep 14, 2018

piotrmiskiewicz commented Sep 17, 2018 • edited Loading

jboyd01 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jboyd01 commented Sep 19, 2018

k8s-ci-robot commented Sep 19, 2018

piotrmiskiewicz commented Sep 19, 2018 • edited Loading

jboyd01 commented Sep 19, 2018

piotrmiskiewicz commented Sep 20, 2018

piotrmiskiewicz commented Sep 20, 2018

luksa left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrmiskiewicz Sep 26, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piotrmiskiewicz commented Sep 26, 2018

luksa commented Sep 26, 2018

piotrmiskiewicz commented Sep 12, 2018 •

edited by jberkhahn

Loading

piotrmiskiewicz commented Sep 17, 2018 •

edited

Loading

piotrmiskiewicz commented Sep 19, 2018 •

edited

Loading

piotrmiskiewicz Sep 26, 2018 •

edited

Loading