Skip to content
This repository has been archived by the owner on May 6, 2022. It is now read-only.

Share OSB client for ServiceBroker #2337

Merged

Conversation

piotrmiskiewicz
Copy link
Contributor

@piotrmiskiewicz piotrmiskiewicz commented Sep 12, 2018

This PR is a

  • Feature Implementation
  • Bug Fix
  • Documentation

What this PR does / why we need it:

This PR introduces BrokerClientManager which stores OSB clients - one client per broker. It allows to share one OSB client instance for all calls to the broker. It prevents the controller from creating OSB clients for every operation and it follows the description of the Golang HTTP client: "The Client's Transport typically has internal state (cached TCP connections), so Clients should be reused instead of created as needed. Clients are safe for concurrent use by multiple goroutines."

Which issue(s) this PR fixes

Fixes #2276

Please leave this checklist in the PR comment so that maintainers can ensure a good PR.

Merge Checklist:

  • New feature
    • Tests
    • Documentation
  • SVCat CLI flag
  • Server Flag for config
    • Chart changes
    • removing a flag by marking deprecated and hiding to avoid
      breaking the chart release and existing clients who provide a
      flag that will get an error when they try to update

@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2018
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 12, 2018
@piotrmiskiewicz piotrmiskiewicz force-pushed the shared-osb-clients branch 2 times, most recently from 487afab to 3f33290 Compare September 12, 2018 10:33
@piotrmiskiewicz piotrmiskiewicz changed the title [WIP] Share OSB client for ServiceBroker Share OSB client for ServiceBroker Sep 12, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 12, 2018
@luksa
Copy link
Contributor

luksa commented Sep 12, 2018

/ok-to-test

@k8s-ci-robot k8s-ci-robot removed the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Sep 12, 2018
@piotrmiskiewicz
Copy link
Contributor Author

/test pull-service-catalog-integration

@piotrmiskiewicz piotrmiskiewicz changed the title Share OSB client for ServiceBroker [WIP] Share OSB client for ServiceBroker Sep 13, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2018
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 13, 2018
@@ -119,13 +124,45 @@ func (c *controller) reconcileClusterServiceBrokerKey(key string) error {
return c.reconcileClusterServiceBroker(broker)
}

func (c *controller) updateClusterServiceBrokerClient(broker *v1beta1.ClusterServiceBroker) (osb.Client, error) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking about a change:
return only one value here - the error
the brokerClientManager UpdateBrokerClient also return only error

it seems to be better, but I'd like to wait for tests and comments about the main concept - caching OSB clients

return client, nil
}

func (m *BrokerClientManager) configHasChanged(cfg1 *osb.ClientConfiguration, cfg2 *osb.ClientConfiguration) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this be better if it was a function instead of a method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the helper method to be used inside BrokerClientManager and I don't want to expose it. It was not designed to be used by other components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it to a function.

@piotrmiskiewicz piotrmiskiewicz changed the title [WIP] Share OSB client for ServiceBroker Share OSB client for ServiceBroker Sep 13, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2018
@piotrmiskiewicz piotrmiskiewicz changed the title Share OSB client for ServiceBroker [WIP] Share OSB client for ServiceBroker Sep 13, 2018
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 13, 2018
@@ -0,0 +1,132 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: 2018

Copy link
Contributor

@jboyd01 jboyd01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only got a couple of minor nits. This looks solid to me. I had concerns around this concept but on review it looks good. I'll take a closer look at the tests you removed, I would think they are still valid tests but perhaps need to be reworked?

I've mentioned it to @nilebox, it would be beneficial to have additional review other others that have been deep in this code. Also @kibbles-n-bytes if you have cycles.

delete(m.clients, brokerKey)
}

// BrokerClient returns broker client fro a broker specified by the brokerKey
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: spelling s/fro/for/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -0,0 +1,139 @@
/*
Copyright 2017 The Kubernetes Authors.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit 2018

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@piotrmiskiewicz
Copy link
Contributor Author

I think the test I removed, and additional integration test I skipped should be reworked. The old test was testing how the controller behaves when the service broker auth is wrong with processing serviceinstances. The test was testing scenario, where "controller fails to locate the broker authentication secret." In current solution - the controller does not need to locate the secret - it is done when clusterservicebroker is processd. I'll try to create such test for broker resource processing.

@piotrmiskiewicz
Copy link
Contributor Author

I'm thinking about checking the size of the cache in a unit (or integration) test. Just to be sure it is not growing without reason (to not make new memory leak).

@piotrmiskiewicz
Copy link
Contributor Author

I've added a test for non-existing broker and reconcile service instance. I realized one change - how controller handles not existing secret with auth credentials. In my solution, when a user creates a ClusterServiceBroker instance with a reference to a secret which does not exists - the broker client won't be created. After that, if he creates a secret - nothing changes until next reconciliation. Maybe that is an issue.

@piotrmiskiewicz piotrmiskiewicz changed the title [WIP] Share OSB client for ServiceBroker Share OSB client for ServiceBroker Sep 14, 2018
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 14, 2018
@piotrmiskiewicz
Copy link
Contributor Author

piotrmiskiewicz commented Sep 17, 2018

I performed tests described in the issue: #2276
80 brokers which always responds with HTTP 500
controller-manager 0.1.31, we can see restart ("current memory" is going down very fast - the pod is restarted)
controller-manager-0 1 31-12hours

controller-manager 0.1.32
controller-manager-0 1 32-mem
controller-manager-0 1 32-restart

I've applied the PR to new version 0.1.32 and the result is: no restart
controller-manager-0 1 32-shared-os

I've also tested the fix with version 0.1.31 and I saw the controller manager pod was working more than one day without restart.

Copy link
Contributor

@jboyd01 jboyd01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a test for non-existing broker and reconcile service instance. I realized one change - how controller handles not existing secret with auth credentials. In my solution, when a user creates a ClusterServiceBroker instance with a reference to a secret which does not exists - the broker client won't be created. After that, if he creates a secret - nothing changes until next reconciliation. Maybe that is an issue.

Can you elaborate on this - "nothing changes until the next reconciliation" - the user creates the missing secret, the broker client will be created when the exponential backoff expires and it does the retry, right? If that is the case, it seems pretty correct to me, I'm good with that.

Thanks for the additional analysis and long runs @piotrmiskiewicz. This is looking good, I'd like to move this forward. @luksa reviewed last week and only had one minor comment, I discussed briefly with @nilebox and he was on board with the idea. Let's get one more review.

// TestReconcileServiceInstanceWithAuthError tests reconcileInstance when Kube Client
// fails to locate the broker authentication secret.
func TestReconcileServiceInstanceWithAuthError(t *testing.T) {
// TestReconcileServiceInstanceWithNotExistingBroker tests reconcileInstance when the BrokerClientManager instance does not contain client for the broker.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: as a rule we wrap all function comments at column 80

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@jboyd01
Copy link
Contributor

jboyd01 commented Sep 19, 2018

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jboyd01

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 19, 2018
@piotrmiskiewicz
Copy link
Contributor Author

piotrmiskiewicz commented Sep 19, 2018

Can you elaborate on this - "nothing changes until the next reconciliation" - the user creates the missing secret, the broker client will be created when the exponential backoff expires and it does the retry, right? If that is the case, it seems pretty correct to me, I'm good with that.

I'll give an example. The way it works is like an image pull secret, when you need credentials for a Docker repository. If you specify an imagePullSecret in the Deployment definition, but the secret does not exists - the Pod will fail (won't pull image). After creating a secret nothing changes. You need to delete Pod or change the Deployment.
The same is here. If the broker expects credentials (Broker definition contains a reference to a secret) but the secret does not exists - the OSB client cannot work. That is the same as before my change. But, when a user creates expected secret, service catalog still cannot do a call. Like a Kubernetes - after creating a secret with Docker repository credentials Pod is not in Running state.
When the controller performs resync (or next backoff retry is being processed), the OSB client is updated. After that, everything works fine.

  1. The user creates a ClusterServiceBroker resource with a reference to a secret (with auth credentials)
  2. Service Catalog is triggered by the ClusterServiceBroker addition
  3. Service Catalog is trying to create an OSB client but the secret is missing, the reconcileClusterServiceBroker method returns error, the controller will do retries
  4. All retries (defines by the exponential backoff policy) are done with error because the secret is missing
  5. The user creates the secret.
  6. Any operation like provisioning, deprovisioning cannot be done because the OSB client is not created
  7. We need to wait until next resync (default is every 5 minutes defined with defaultResyncInterval) - the OSB client is created with authorization

In my opinion it is not a problem, but I wanted to describe what was changed

@jboyd01
Copy link
Contributor

jboyd01 commented Sep 19, 2018

re #2337 (comment)
Great, working as expected, thanks for verifying.

@piotrmiskiewicz
Copy link
Contributor Author

/retest

@piotrmiskiewicz
Copy link
Contributor Author

/test pull-service-catalog-integration

Copy link
Contributor

@luksa luksa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I'm just not completely sure about the fact that we now create the client only when reconciling the broker. When reconciling other resources, we now no longer create the client, but simply log an error.

The new way seems more correct, but I need to think about the implications.

The new way also ensures we only retrieve the broker auth secret once instead of every time.

FYI: I tested this manually, and have confirmed that the broker only has one open connection for each ServiceBroker/ClusterServiceBroker instance.

// BrokerKey defines a key which points to a broker (cluster wide or namespaced)
type BrokerKey struct {
name string
namespace string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Idea for a future improvement: consider a case where a large multi-user cluster has a large number of ServiceBroker instances all pointing to the same broker (with the same osb.ClientConfiguration). We may want to ensure the connections are shared between all those ServiceBrokers, so we don't hold too many open connections to the same broker.

Copy link
Contributor Author

@piotrmiskiewicz piotrmiskiewicz Sep 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the improvement would be the key is not a namespace/name pair but the configuration (the TLS config). The authentication part (username/password) is not set in the golang http.Client. The best improvement would be a change in the OSB client and share http.Client even if username/password is different. This change is much bigger.

Anyway, the current solution (without my implementation of sharing clients) - when the resync is set to 5 min for all 1000 registered brokers (3 get catalog request per second) - the controller manager will be restarting every few minutes (because of "out of memory").

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. The improvement should go in later, in a separate PR. We need to get this PR in fast, since it will solve a lot of problems for us.

"Error getting broker auth credentials for broker %q: %s",
broker.Name, err,
"The instance references a broker %q which has no OSB client created",
serviceClass.Spec.ClusterServiceBrokerName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen? Previously, we would create the client here, but now we expect it to always exist at this point. Would panicking be more appropriate, since we're not expecting the client to not exist here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A 'clusterservicebroker' resource is created and at the same time as a 'serviceclass' provisioning request. The provisioning request is being processed before processing adding 'clusterservicebroker'. From the controller perspective (or the implementation of the method) it could happen. Another scenario - deletion of clusterservice broker comes to the controller at the same time as 'deprovisioning' request. Before my PR the problem also could happen. The client can be removed before processing. I'm not aware of every details. If such error occurs, the call will be retried with proper backoff policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I realized later that we definitely shouldn't panic.

@@ -107,8 +108,12 @@ func shouldReconcileClusterServiceBroker(broker *v1beta1.ClusterServiceBroker, n
func (c *controller) reconcileClusterServiceBrokerKey(key string) error {
broker, err := c.clusterServiceBrokerLister.Get(key)
pcb := pretty.NewContextBuilder(pretty.ClusterServiceBroker, "", key, "")

glog.V(4).Info(pcb.Message(fmt.Sprintf("Processing service broker %s", key)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant info. This is how the log line looks:

...icebroker.go:112] ClusterServiceBroker "ups-broker": Processing service broker ups-broker

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@piotrmiskiewicz
Copy link
Contributor Author

I'm just not completely sure about the fact that we now create the client only when reconciling the broker. When reconciling other resources, we now no longer create the client, but simply log an error.

It is what I described before, it is like imagePullSecret for docker registry. Even if you update the secret, the cluster won't try to pull the image once again.
You need to decide, if it is good enough.

On the other hand, storing credentials in the OSB client maybe is not the best solution. Providing credentials in every call would fix the problem. It allows us to implement better caching - few brokers with different credentials but one TLS config could share one OSB client.

@luksa
Copy link
Contributor

luksa commented Sep 26, 2018

/lgtm

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm Indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Memory leak in Controller Manager when registered malfunctioning broker
4 participants