Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential memory leak in the operator #1984

Closed
pebrc opened this issue Oct 11, 2019 · 17 comments · Fixed by #1989
Closed

Potential memory leak in the operator #1984

pebrc opened this issue Oct 11, 2019 · 17 comments · Fixed by #1989
Assignees
Labels
>bug Something isn't working

Comments

@pebrc
Copy link
Collaborator

pebrc commented Oct 11, 2019

After running a recent 1.0.0-beta1-b10 for two days with metrics reporting turned on and fetching those via metricbeat into Elasticsearch it seems that memory consumption has roughly doubled even though the number of clusters managed has been more or less constant (I have created and destroyed a few test clusters in that time)

Worth noting regarding the severity of this issue that the process's memory after two days of uptime is still at 25Mi which half of what we have as requests and far below the limit of 150Mi.

Operator manifest used:
https://github.com/pebrc/cloud-on-k8s/blob/dd77e135df28a80c10d603fd272009928eee0618/config/samples/beats/all-in-one.yaml#L3015-L3054

Go mstats struct with comments: https://golang.org/src/runtime/mstats.go

Visualisation: Heap Alloc bytes averaged in 10m interval
Screenshot 2019-10-11 at 20 28 25

Visualisation: Heap Objects averaged in 10m intervals
Screenshot 2019-10-11 at 18 31 10

Goroutines: averaged in 10m intervals are constant 👍
Screenshot 2019-10-11 at 18 41 42

Number of Reconciliations per controller: max in 10m intervals (note this is a counter so increase over time is expected)
Screenshot 2019-10-11 at 20 36 55

@pebrc pebrc added the >bug Something isn't working label Oct 11, 2019
@pebrc
Copy link
Collaborator Author

pebrc commented Oct 13, 2019

I restarted the operator with the latest bc12 and pprof profiling turned on. Memory still seems to be increasing over time.

Screenshot 2019-10-13 at 10 26 42

I took a heap profile right after restarting the operator and then again multiple times afterwards.
Comparing the first and the latest profile with pprof -normalize -base=$HOME/pprof/pprof.elastic-operator.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz ~/pprof/pprof.elastic-operator.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz

the one thing that sticks out is
Screenshot 2019-10-13 at 10 19 55

@barkbay
Copy link
Contributor

barkbay commented Oct 13, 2019

I did the same test and I also saw a steady increase of the RSS memory (I have not been able to configure metricbeat to collect the go metrics):
image
A memory dump gave me the same suspect:
image

I might be wrong but there is something suspicious in the source code of the go client.

Before I dive into the theory some facts about the operator and the "work queue":

  • The “delaying queue” is never empty in ECK because we schedule reconciliations to ensure that the CA are rotated. It means that there is always at least 1 element in the waiting queue of each of the controllers (apm,kibana,elasticsearch) set to at least 365 days. The requeue is set during every reconciliation loop, first time to 365 days, later according to this function:

func ShouldRotateIn(now time.Time, certExpiration time.Time, caCertRotateBefore time.Duration) time.Duration {
// make sure we are past the safety margin when rotating, by making it a little bit shorter
safetyMargin := caCertRotateBefore - 1*time.Second
requeueTime := certExpiration.Add(-safetyMargin)
requeueIn := requeueTime.Sub(now)
if requeueIn < 0 {
// requeue asap
requeueIn = 0
}
return requeueIn
}

  • An other important thing is that the implementation of the “delaying queue” is using 2 timers :
  1. A ticker to wake up every 10 seconds:

https://github.com/kubernetes/client-go/blob/172b42569ccacc5c33a885060fc0b7eab863d81e/util/workqueue/delaying_queue.go#L169-L172

  1. time.After(d Duration) (used indirectly through After(..)) to schedule a delayed event at the right time.
    It is important to note the following comment in the godoc: “The underlying Timer is not recovered by the garbage collector until the timer fires.”

Knowing that here is the part that seems suspicious to me:

// waitingLoop runs until the workqueue is shutdown and keeps a check on the list of items to be added.
func (q *delayingType) waitingLoop() {
/* .... */
	for {
/* .... */
		if waitingForQueue.Len() > 0 {
			entry := waitingForQueue.Peek().(*waitFor)
			nextReadyAt = q.clock.After(entry.readyAt.Sub(now)) // <=== Here
		}
		select {
		case <-q.stopCh:
			return

		case <-q.heartbeat.C():
			// continue the loop, which will add ready items

		case <-nextReadyAt:
			// continue the loop, which will add ready items
/* ... */

A loop is scheduled at least every 10 seconds because of the heartbeat (case <-q.heartbeat.C()), it also means that a new nextReadyAt (and an underlying timer) is created at least every 10 seconds.

Unless I'm missing something it means that at least every 10 seconds we are creating timers that will start to expire (and be removed from memory) in 365 days...

@pebrc
Copy link
Collaborator Author

pebrc commented Oct 13, 2019

Excellent analysis @barkbay! I think this is spot on. Just one addition we are also scheduling long running requeues in the license controller to re-issue new licenses after the expiry of the current ones.

func nextReconcileRelativeTo(now, expiry time.Time, safety time.Duration) reconcile.Result {
// short-circuit to default if no expiry given
if expiry.IsZero() {
return reconcile.Result{
RequeueAfter: minimumRetryInterval,
}
}
requeueAfter := expiry.Add(-1 * (safety / 2)).Sub(now)
if requeueAfter <= 0 {
return reconcile.Result{Requeue: true}
}
return reconcile.Result{
// requeue at expiry minus safetyMargin/2 to ensure we actually reissue a license on the next attempt
RequeueAfter: requeueAfter,
}
}

I am thinking we can work around this issue by managing long running re-queues in a dedicated scheduler structure (managed by ECK not go-client's workqueue) that correctly handles a single timer per key (typically namespaced name + the controller that initiated the timer) stopping existing timers before creating new ones. We could then inject an event into the correct workqueue just in time once a timer fires.

@pebrc
Copy link
Collaborator Author

pebrc commented Oct 13, 2019

I have drafted the idea in #1985

@pebrc pebrc removed their assignment Oct 14, 2019
@pebrc
Copy link
Collaborator Author

pebrc commented Oct 14, 2019

Unassigning myself as I am out the next couple of days and we are probably going with a simpler solution of not scheduling long running re-queues.

@sebgl
Copy link
Contributor

sebgl commented Oct 14, 2019

Let's implement an easy workaround we discussed: schedule a requeue in 10 hours instead of 365 days. This should mitigate the memory leak problem (timers still accumulate but are garbage-collected after 10 hours).

@sebgl sebgl self-assigned this Oct 14, 2019
@pebrc
Copy link
Collaborator Author

pebrc commented Oct 14, 2019

Let's not forget the license controller where something similar happens (license_expiry - safetyMargin) which can be years in the future

sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Oct 14, 2019
We have an issue where the underlying timer used by client-go worker
queue implementation stays in memory until it expires.
Since one gets created at every reconciliation attempt, we end up with a
big bunch of timers in memory that will expire in 365 days by default.

To mitigate the memory leak, let's wait for no more than 10 hours to
reconcile.

For more details, see
elastic#1984.
sebgl added a commit to sebgl/cloud-on-k8s that referenced this issue Oct 14, 2019
We have an issue where the underlying timer used by client-go worker
queue implementation stays in memory until it expires.
Since one gets created at every reconciliation attempt, we end up with a
big bunch of timers in memory that will expire in 365 days by default.

To mitigate the memory leak, let's wait for no more than 10 hours to
reconcile.

This is done at the level of the aggregated results, to decouple this
wokaround from any business logic like certs expiration.

For more details, see
elastic#1984.
sebgl added a commit that referenced this issue Oct 14, 2019
* Ensure we don't RequeueAfter for more than 10 hours

We have an issue where the underlying timer used by client-go worker
queue implementation stays in memory until it expires.
Since one gets created at every reconciliation attempt, we end up with a
big bunch of timers in memory that will expire in 365 days by default.

To mitigate the memory leak, let's wait for no more than 10 hours to
reconcile.

This is done at the level of the aggregated results, to decouple this
wokaround from any business logic like certs expiration.

For more details, see
#1984.

* Use aggregated results in the license controller
@anyasabo
Copy link
Contributor

I'm not sure either is necessary since this works for now, and when we land a fix upstream (or someone else does) we can revisit this. But there were two other options we discussed out of band I thought might be worth noting here:

  1. the default resync period is already 10h, so all items in the informer cache get flushed to the event handler. We could even just not requeue at all (at least assuming longer lived certificates) and instead rely on requeuing. This doesn't feel great, but is an option. Ref code and issue

  2. I was curious how cert-manager handles this since it seemed like they would have very similar problems. If I'm reading it right, cert-manager wraps the workqueue and forgets items if they're already present before re-adding them (which should clear out the old timer?). Which is a fancier version of our old approach and might be worth considering. Reference:
    https://github.com/jetstack/cert-manager/blob/master/pkg/scheduler/scheduler.go#L65
    https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/controller.go#L122

@sebgl
Copy link
Contributor

sebgl commented Oct 15, 2019

Let's keep this open until we correctly observe that the memory leak is gone (or at least significantly contained).

I'm keeping the operator running since a few hours and it looks like it still leaking memory:

image

@sebgl sebgl reopened this Oct 15, 2019
@sebgl
Copy link
Contributor

sebgl commented Oct 16, 2019

Memory usage seem to have stabilised over the last 24 hours:

image

Not sure yet the memory leak is fixed though. Also I realized I should probably generate some reconciliations event (eg. patch ES annotations with some randomness every X seconds), since the memory leak we observed was directly related to reconciliations happening.

@sebgl
Copy link
Contributor

sebgl commented Oct 16, 2019

I setup a k8s job that patches the Elasticsearch resource annotation every second, to trigger reconciliations:

apiVersion: batch/v1
kind: Job
metadata:
  name: generate-reconciliations
  namespace: elastic-system
spec:
  template:
    metadata:
      name: generate-reconciliations
    spec:
      serviceAccount: elastic-operator
      containers:
        - name: generate-reconciliations
          image: bitnami/kubectl:latest
          command:
          - bash
          args:
          - -c
          - while true; do kubectl patch elasticsearch monitor -n beats -p "{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}" --type merge; sleep 1; done
      restartPolicy: Never

And restarted the operator to use a recent nightly build that includes the memory leak fix.

I expect to see memory growing for the first 10 hours, then being approximately constant for the remaining timeframe.

@sebgl
Copy link
Contributor

sebgl commented Oct 17, 2019

Finally observed what I wanted to! 🎉

image

This is when triggering around 1 reconciliation per second with the above job.
The memory usage of the container slowly grows for the first 10 hours, since we're creating one new timer per second. Then after the first 10 hours, old timers start expiring, along with new ones being created. Leading to memory usage remaining stable (we always have approximately the same number of timers in memory since we keep doing 1 reconciliation per second).

In this scenario we seem to be in the 20-30MB range additional stable memory usage.
It seems to properly contain the leak, until we depend on k8s v1.17 that includes kubernetes/kubernetes#83945.

@sebgl sebgl closed this as completed Oct 17, 2019
@ljdelight
Copy link

Hey I'm curious what tools you used to collect the performance data and view the graphs? I haven't done much performance monitoring in K8s but am wanting to learn.

@sebgl
Copy link
Contributor

sebgl commented Oct 18, 2019

Hey @ljdelight, we are running Metricbeats on Kubernetes, which pushes metrics into the configured Elasticsearch cluster, that we visualize with Kibana.

@ljdelight
Copy link

What's your process when you "collect a heap profile"? Like, do you have profiling code in the binary that writes the heap data on some interval, or do you pprof the binary (which needs specific compile flags?). I'm thinking of approaches and I imagine it's tricky in K8s if there's no pprof in the container.

@anyasabo
Copy link
Contributor

@ljdelight
Copy link

Thank you so much for the replies @sebgl and @anyasabo, i really appreciate it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working
Projects
None yet
5 participants