Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadlock on operator update using helm #920

Closed
baluchicken opened this issue Jan 15, 2019 · 3 comments
Closed

Deadlock on operator update using helm #920

baluchicken opened this issue Jan 15, 2019 · 3 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@baluchicken
Copy link

Bug Report

What did you do?

  • Create a helm chart using the operator generated yaml's.
  • Deploy the operator with helm
  • Change the values and upgrade running operator using helm update

What did you expect to see?
Old pod release the lock and the updated pod acquire the lock

What did you see instead? Under which circumstances?
It results a deadlock, old pod is still running, new pod tries to start and it cannot acquire a lock which is held by the old pod

{"level":"info","ts":1547553963.9364433,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1547553964.150086,"logger":"leader","msg":"Found existing lock","LockOwner":"operator-777d97c68d-8j5xg"}
{"level":"info","ts":1547553964.2477176,"logger":"leader","msg":"Not the leader. Waiting."}

Environment

  • operator-sdk version: 0.3.0

  • Kubernetes version information: 1.11

  • Kubernetes cluster kind: Amazon EKS

  • Are you writing your operator in ansible, helm, or go?
    Go

@joelanford joelanford added the kind/bug Categorizes issue or PR as related to a bug. label Jan 15, 2019
@joelanford
Copy link
Member

I was able to recreate this without using Helm just by modifying and re-applying deploy/operator.yaml (In my case, I just added an extra environment variable).

It looks like the issue is that the pod doesn't go to ready until it has become the leader. When the deployment update is rolling out, it waits until the new pods are ready before terminating the old ones. Since the old pod maintains the lock, the new pod can't become the leader and therefore never becomes ready -- hence the deadlock.

@mhrivnak Is there a reason the call to leader.Become() needs to be before the call to ready.NewFileReady()?

@joelanford joelanford self-assigned this Jan 15, 2019
@mhrivnak
Copy link
Member

We only added the readiness check as a way to communicate that the pod has become the leader, primarily because it was convenient for automatically introspecting who is the leader in automated tests. But it sounds like that's causing this problem, so I would just remove the readiness probe entirely from operator.yaml and stop calling ready.NewFileReady().

@yharish991
Copy link

Does the deadlock happen with liveness probe as well? I tested and it appears to work fine with liveness probe, but just wanted to confirm..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

4 participants