Potential memory leak in the operator #1984

pebrc · 2019-10-11T18:32:14Z

After running a recent 1.0.0-beta1-b10 for two days with metrics reporting turned on and fetching those via metricbeat into Elasticsearch it seems that memory consumption has roughly doubled even though the number of clusters managed has been more or less constant (I have created and destroyed a few test clusters in that time)

Worth noting regarding the severity of this issue that the process's memory after two days of uptime is still at 25Mi which half of what we have as requests and far below the limit of 150Mi.

Operator manifest used:
https://github.com/pebrc/cloud-on-k8s/blob/dd77e135df28a80c10d603fd272009928eee0618/config/samples/beats/all-in-one.yaml#L3015-L3054

Go mstats struct with comments: https://golang.org/src/runtime/mstats.go

Visualisation: Heap Alloc bytes averaged in 10m interval

Visualisation: Heap Objects averaged in 10m intervals

Goroutines: averaged in 10m intervals are constant 👍

Number of Reconciliations per controller: max in 10m intervals (note this is a counter so increase over time is expected)

The text was updated successfully, but these errors were encountered:

pebrc · 2019-10-13T08:33:17Z

I restarted the operator with the latest bc12 and pprof profiling turned on. Memory still seems to be increasing over time.

I took a heap profile right after restarting the operator and then again multiple times afterwards.
Comparing the first and the latest profile with pprof -normalize -base=$HOME/pprof/pprof.elastic-operator.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz ~/pprof/pprof.elastic-operator.alloc_objects.alloc_space.inuse_objects.inuse_space.007.pb.gz

the one thing that sticks out is

barkbay · 2019-10-13T17:23:14Z

I did the same test and I also saw a steady increase of the RSS memory (I have not been able to configure metricbeat to collect the go metrics):

A memory dump gave me the same suspect:

I might be wrong but there is something suspicious in the source code of the go client.

Before I dive into the theory some facts about the operator and the "work queue":

The “delaying queue” is never empty in ECK because we schedule reconciliations to ensure that the CA are rotated. It means that there is always at least 1 element in the waiting queue of each of the controllers (apm,kibana,elasticsearch) set to at least 365 days. The requeue is set during every reconciliation loop, first time to 365 days, later according to this function:

cloud-on-k8s/pkg/controller/common/certificates/expiration.go

Lines 27 to 37 in fb2e693

    
           func ShouldRotateIn(now time.Time, certExpiration time.Time, caCertRotateBefore time.Duration) time.Duration { 
        
           	// make sure we are past the safety margin when rotating, by making it a little bit shorter 
        
           	safetyMargin := caCertRotateBefore - 1*time.Second 
        
           	requeueTime := certExpiration.Add(-safetyMargin) 
        
           	requeueIn := requeueTime.Sub(now) 
        
           	if requeueIn < 0 { 
        
           		// requeue asap 
        
           		requeueIn = 0 
        
           	} 
        
           	return requeueIn 
        
           }

An other important thing is that the implementation of the “delaying queue” is using 2 timers :

A ticker to wake up every 10 seconds:

https://github.com/kubernetes/client-go/blob/172b42569ccacc5c33a885060fc0b7eab863d81e/util/workqueue/delaying_queue.go#L169-L172

time.After(d Duration) (used indirectly through After(..)) to schedule a delayed event at the right time.
It is important to note the following comment in the godoc: “The underlying Timer is not recovered by the garbage collector until the timer fires.”

Knowing that here is the part that seems suspicious to me:

// waitingLoop runs until the workqueue is shutdown and keeps a check on the list of items to be added.
func (q *delayingType) waitingLoop() {
/* .... */
	for {
/* .... */
		if waitingForQueue.Len() > 0 {
			entry := waitingForQueue.Peek().(*waitFor)
			nextReadyAt = q.clock.After(entry.readyAt.Sub(now)) // <=== Here
		}
		select {
		case <-q.stopCh:
			return

		case <-q.heartbeat.C():
			// continue the loop, which will add ready items

		case <-nextReadyAt:
			// continue the loop, which will add ready items
/* ... */

A loop is scheduled at least every 10 seconds because of the heartbeat (case <-q.heartbeat.C()), it also means that a new nextReadyAt (and an underlying timer) is created at least every 10 seconds.

Unless I'm missing something it means that at least every 10 seconds we are creating timers that will start to expire (and be removed from memory) in 365 days...

pebrc · 2019-10-13T17:48:42Z

Excellent analysis @barkbay! I think this is spot on. Just one addition we are also scheduling long running requeues in the license controller to re-issue new licenses after the expiry of the current ones.

cloud-on-k8s/pkg/controller/license/license_controller.go

Lines 74 to 89 in 25caaf1

    
           func nextReconcileRelativeTo(now, expiry time.Time, safety time.Duration) reconcile.Result { 
        
           	// short-circuit to default if no expiry given 
        
           	if expiry.IsZero() { 
        
           		return reconcile.Result{ 
        
           			RequeueAfter: minimumRetryInterval, 
        
           		} 
        
           	} 
        
           	requeueAfter := expiry.Add(-1 * (safety / 2)).Sub(now) 
        
           	if requeueAfter <= 0 { 
        
           		return reconcile.Result{Requeue: true} 
        
           	} 
        
           	return reconcile.Result{ 
        
           		// requeue at expiry minus safetyMargin/2 to ensure we actually reissue a license on the next attempt 
        
           		RequeueAfter: requeueAfter, 
        
           	} 
        
           }

I am thinking we can work around this issue by managing long running re-queues in a dedicated scheduler structure (managed by ECK not go-client's workqueue) that correctly handles a single timer per key (typically namespaced name + the controller that initiated the timer) stopping existing timers before creating new ones. We could then inject an event into the correct workqueue just in time once a timer fires.

pebrc · 2019-10-13T18:53:32Z

I have drafted the idea in #1985

pebrc · 2019-10-14T09:48:25Z

Unassigning myself as I am out the next couple of days and we are probably going with a simpler solution of not scheduling long running re-queues.

sebgl · 2019-10-14T10:06:45Z

Let's implement an easy workaround we discussed: schedule a requeue in 10 hours instead of 365 days. This should mitigate the memory leak problem (timers still accumulate but are garbage-collected after 10 hours).

pebrc · 2019-10-14T10:08:12Z

Let's not forget the license controller where something similar happens (license_expiry - safetyMargin) which can be years in the future

We have an issue where the underlying timer used by client-go worker queue implementation stays in memory until it expires. Since one gets created at every reconciliation attempt, we end up with a big bunch of timers in memory that will expire in 365 days by default. To mitigate the memory leak, let's wait for no more than 10 hours to reconcile. For more details, see elastic#1984.

We have an issue where the underlying timer used by client-go worker queue implementation stays in memory until it expires. Since one gets created at every reconciliation attempt, we end up with a big bunch of timers in memory that will expire in 365 days by default. To mitigate the memory leak, let's wait for no more than 10 hours to reconcile. This is done at the level of the aggregated results, to decouple this wokaround from any business logic like certs expiration. For more details, see elastic#1984.

* Ensure we don't RequeueAfter for more than 10 hours We have an issue where the underlying timer used by client-go worker queue implementation stays in memory until it expires. Since one gets created at every reconciliation attempt, we end up with a big bunch of timers in memory that will expire in 365 days by default. To mitigate the memory leak, let's wait for no more than 10 hours to reconcile. This is done at the level of the aggregated results, to decouple this wokaround from any business logic like certs expiration. For more details, see #1984. * Use aggregated results in the license controller

anyasabo · 2019-10-14T19:22:51Z

I'm not sure either is necessary since this works for now, and when we land a fix upstream (or someone else does) we can revisit this. But there were two other options we discussed out of band I thought might be worth noting here:

the default resync period is already 10h, so all items in the informer cache get flushed to the event handler. We could even just not requeue at all (at least assuming longer lived certificates) and instead rely on requeuing. This doesn't feel great, but is an option. Ref code and issue
I was curious how cert-manager handles this since it seemed like they would have very similar problems. If I'm reading it right, cert-manager wraps the workqueue and forgets items if they're already present before re-adding them (which should clear out the old timer?). Which is a fancier version of our old approach and might be worth considering. Reference:
https://github.com/jetstack/cert-manager/blob/master/pkg/scheduler/scheduler.go#L65
https://github.com/jetstack/cert-manager/blob/master/pkg/controller/certificates/controller.go#L122

sebgl · 2019-10-15T08:36:37Z

Let's keep this open until we correctly observe that the memory leak is gone (or at least significantly contained).

I'm keeping the operator running since a few hours and it looks like it still leaking memory:

sebgl · 2019-10-16T13:09:28Z

Memory usage seem to have stabilised over the last 24 hours:

Not sure yet the memory leak is fixed though. Also I realized I should probably generate some reconciliations event (eg. patch ES annotations with some randomness every X seconds), since the memory leak we observed was directly related to reconciliations happening.

sebgl · 2019-10-16T14:01:09Z

I setup a k8s job that patches the Elasticsearch resource annotation every second, to trigger reconciliations:

apiVersion: batch/v1
kind: Job
metadata:
  name: generate-reconciliations
  namespace: elastic-system
spec:
  template:
    metadata:
      name: generate-reconciliations
    spec:
      serviceAccount: elastic-operator
      containers:
        - name: generate-reconciliations
          image: bitnami/kubectl:latest
          command:
          - bash
          args:
          - -c
          - while true; do kubectl patch elasticsearch monitor -n beats -p "{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}" --type merge; sleep 1; done
      restartPolicy: Never

And restarted the operator to use a recent nightly build that includes the memory leak fix.

I expect to see memory growing for the first 10 hours, then being approximately constant for the remaining timeframe.

sebgl · 2019-10-17T08:58:45Z

Finally observed what I wanted to! 🎉

This is when triggering around 1 reconciliation per second with the above job.
The memory usage of the container slowly grows for the first 10 hours, since we're creating one new timer per second. Then after the first 10 hours, old timers start expiring, along with new ones being created. Leading to memory usage remaining stable (we always have approximately the same number of timers in memory since we keep doing 1 reconciliation per second).

In this scenario we seem to be in the 20-30MB range additional stable memory usage.
It seems to properly contain the leak, until we depend on k8s v1.17 that includes kubernetes/kubernetes#83945.

ljdelight · 2019-10-17T16:25:45Z

Hey I'm curious what tools you used to collect the performance data and view the graphs? I haven't done much performance monitoring in K8s but am wanting to learn.

sebgl · 2019-10-18T06:27:54Z

Hey @ljdelight, we are running Metricbeats on Kubernetes, which pushes metrics into the configured Elasticsearch cluster, that we visualize with Kibana.

ljdelight · 2019-10-18T20:29:26Z

What's your process when you "collect a heap profile"? Like, do you have profiling code in the binary that writes the heap data on some interval, or do you pprof the binary (which needs specific compile flags?). I'm thinking of approaches and I imagine it's tricky in K8s if there's no pprof in the container.

anyasabo · 2019-10-18T20:48:59Z

We expose pprof endpoints in dev mode: https://github.com/elastic/cloud-on-k8s/blob/master/cmd/manager/main.go#L162

ljdelight · 2019-10-19T06:55:33Z

Thank you so much for the replies @sebgl and @anyasabo, i really appreciate it

pebrc added the >bug Something isn't working label Oct 11, 2019

pebrc self-assigned this Oct 13, 2019

pebrc mentioned this issue Oct 13, 2019

Add scheduler to handle long-running requeues #1985

Closed

pebrc removed their assignment Oct 14, 2019

sebgl self-assigned this Oct 14, 2019

sebgl mentioned this issue Oct 14, 2019

Enqueue reconciliations for no more than 10 hours #1988

Closed

sebgl mentioned this issue Oct 14, 2019

Mitigate memory leaks from long RequeueAfter periods #1989

Merged

barkbay mentioned this issue Oct 14, 2019

Memory/Timer leak in the work queue kubernetes/kubernetes#83895

Closed

sebgl closed this as completed in #1989 Oct 14, 2019

sebgl reopened this Oct 15, 2019

sebgl closed this as completed Oct 17, 2019

pebrc mentioned this issue Oct 22, 2019

Document beta1 known issues #2042

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential memory leak in the operator #1984

Potential memory leak in the operator #1984

pebrc commented Oct 11, 2019 •

edited

Loading

pebrc commented Oct 13, 2019 •

edited

Loading

barkbay commented Oct 13, 2019 •

edited

Loading

pebrc commented Oct 13, 2019

pebrc commented Oct 13, 2019

pebrc commented Oct 14, 2019

sebgl commented Oct 14, 2019

pebrc commented Oct 14, 2019

anyasabo commented Oct 14, 2019

sebgl commented Oct 15, 2019

sebgl commented Oct 16, 2019

sebgl commented Oct 16, 2019

sebgl commented Oct 17, 2019

ljdelight commented Oct 17, 2019

sebgl commented Oct 18, 2019 •

edited

Loading

ljdelight commented Oct 18, 2019

anyasabo commented Oct 18, 2019

ljdelight commented Oct 19, 2019

Potential memory leak in the operator #1984

Potential memory leak in the operator #1984

Comments

pebrc commented Oct 11, 2019 • edited Loading

pebrc commented Oct 13, 2019 • edited Loading

barkbay commented Oct 13, 2019 • edited Loading

pebrc commented Oct 13, 2019

pebrc commented Oct 13, 2019

pebrc commented Oct 14, 2019

sebgl commented Oct 14, 2019

pebrc commented Oct 14, 2019

anyasabo commented Oct 14, 2019

sebgl commented Oct 15, 2019

sebgl commented Oct 16, 2019

sebgl commented Oct 16, 2019

sebgl commented Oct 17, 2019

ljdelight commented Oct 17, 2019

sebgl commented Oct 18, 2019 • edited Loading

ljdelight commented Oct 18, 2019

anyasabo commented Oct 18, 2019

ljdelight commented Oct 19, 2019

pebrc commented Oct 11, 2019 •

edited

Loading

pebrc commented Oct 13, 2019 •

edited

Loading

barkbay commented Oct 13, 2019 •

edited

Loading

sebgl commented Oct 18, 2019 •

edited

Loading