Deadlock on operator update using helm #920

baluchicken · 2019-01-15T13:31:51Z

Bug Report

What did you do?

Create a helm chart using the operator generated yaml's.
Deploy the operator with helm
Change the values and upgrade running operator using helm update

What did you expect to see?
Old pod release the lock and the updated pod acquire the lock

What did you see instead? Under which circumstances?
It results a deadlock, old pod is still running, new pod tries to start and it cannot acquire a lock which is held by the old pod

{"level":"info","ts":1547553963.9364433,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1547553964.150086,"logger":"leader","msg":"Found existing lock","LockOwner":"operator-777d97c68d-8j5xg"}
{"level":"info","ts":1547553964.2477176,"logger":"leader","msg":"Not the leader. Waiting."}

Environment

operator-sdk version: 0.3.0
Kubernetes version information: 1.11
Kubernetes cluster kind: Amazon EKS
Are you writing your operator in ansible, helm, or go?
Go

The text was updated successfully, but these errors were encountered:

joelanford · 2019-01-15T22:02:03Z

I was able to recreate this without using Helm just by modifying and re-applying deploy/operator.yaml (In my case, I just added an extra environment variable).

It looks like the issue is that the pod doesn't go to ready until it has become the leader. When the deployment update is rolling out, it waits until the new pods are ready before terminating the old ones. Since the old pod maintains the lock, the new pod can't become the leader and therefore never becomes ready -- hence the deadlock.

@mhrivnak Is there a reason the call to leader.Become() needs to be before the call to ready.NewFileReady()?

mhrivnak · 2019-01-15T22:20:39Z

We only added the readiness check as a way to communicate that the pod has become the leader, primarily because it was convenient for automatically introspecting who is the leader in automated tests. But it sounds like that's causing this problem, so I would just remove the readiness probe entirely from operator.yaml and stop calling ready.NewFileReady().

yharish991 · 2020-07-29T17:14:36Z

Does the deadlock happen with liveness probe as well? I tested and it appears to work fine with liveness probe, but just wanted to confirm..

joelanford added the kind/bug Categorizes issue or PR as related to a bug. label Jan 15, 2019

joelanford self-assigned this Jan 15, 2019

mhrivnak mentioned this issue Jan 15, 2019

pkg/ansible,pkg/scaffold/ansible; enables leader election for ansible operators #904

Merged

szymonpk mentioned this issue Jan 16, 2019

Deadlock on vault-operator update banzaicloud/banzai-charts#608

Closed

This was referenced Jan 16, 2019

pkg/scaffold: fixing deadlock caused by readiness probe #932

Merged

Re-enable and fix leader election tests #934

Closed

joelanford closed this as completed in #932 Jan 16, 2019

joelanford mentioned this issue Mar 22, 2019

Health Check (Liveness and Readiness probe for Operator) #1234

Closed

joelanford mentioned this issue Apr 9, 2019

Deadlock on operator with Leader Locker with Evicted Pod Bug 1749620 #1305

Closed

joelanford mentioned this issue May 15, 2019

operator deployment running changed env #1430

Closed

d-kuro mentioned this issue Jan 7, 2020

Fix rolling update deadlock. mumoshu/aws-secret-operator#23

Merged

joel-bluedata mentioned this issue Feb 8, 2020

bad interaction between pod eviction and leader lock bluek8s/kubedirector#265

Closed

davidkarlsen added a commit to evryfs/helm-charts that referenced this issue May 6, 2020

avoid deadlock on upgrade: operator-framework/operator-sdk#920

db08428

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock on operator update using helm #920

Deadlock on operator update using helm #920

baluchicken commented Jan 15, 2019

joelanford commented Jan 15, 2019

mhrivnak commented Jan 15, 2019

yharish991 commented Jul 29, 2020

Deadlock on operator update using helm #920

Deadlock on operator update using helm #920

Comments

baluchicken commented Jan 15, 2019

Bug Report

joelanford commented Jan 15, 2019

mhrivnak commented Jan 15, 2019

yharish991 commented Jul 29, 2020