Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow controller v3 fails to start on OpenShift 3.11 #5638

Closed
tobisinghania opened this issue Apr 9, 2021 · 20 comments · Fixed by #5648
Closed

Workflow controller v3 fails to start on OpenShift 3.11 #5638

tobisinghania opened this issue Apr 9, 2021 · 20 comments · Fixed by #5648
Labels
Milestone

Comments

@tobisinghania
Copy link
Contributor

Summary

When tying to upgrade to Argo workflows 3, startup of the workflow controller on OpenShift 3.11 failes, because the underlying api for leader election is not available.

A configuration for disabling leader election might mitigate the situation.

Diagnostics

What Kubernetes provider are you using?

OpenShift 3.11 which is running on Kubernetes 1.11

What version of Argo Workflows are you running?

v3.0.1

The issue can be reproduced by running the default namespace install manifest

E0409 12:01:04.539406       1 leaderelection.go:329] error initially creating leader election record: the server could not find the requested resource (post leases.coordination.k8s.io)
time="2021-04-09T12:01:09.671Z" level=info msg="Get leases 404"
time="2021-04-09T12:01:09.672Z" level=info msg="Create leases 404"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Apr 9, 2021

Can you please run kubectl api-resources|grep leases and attach the result?

@alexec
Copy link
Contributor

alexec commented Apr 9, 2021

Can you also Google "OpenShift coordination API" / speak to your OpenShift contacts and ask for their advice or documentation of this feature.

Once you've confirm if they do not support this API, then you'll have confirmed that OpenShift that Leader Election/HA is not supported on v3.

If that is the case, can you please submit a PR to support a LEADER_ELECTION_IDENTITY=off configuration option? Code changes will need to be made here:

https://github.com/argoproj/argo-workflows/blob/master/workflow/controller/controller.go#L225

@tobisinghania
Copy link
Contributor Author

The grep command returns nothing and indeed it seems that the required kubernetes coordination.k8s.io api was introduced in 1.14.
In the 1.14 change log I found this The Lease API type in the coordination.k8s.io API group is promoted to v1

OpenShift 3.11 is using Kubernetes 1.11, so it makes sense this is not working.

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.14.md

Thanks for pointing out where this can be fixed....I'll try on the weekend if I'm able to get things set up.

@alexec
Copy link
Contributor

alexec commented Apr 9, 2021

Thank you for helping!

@alexec alexec added this to the v3.0 milestone Apr 9, 2021
tobisinghania pushed a commit to tobisinghania/argo-workflows that referenced this issue Apr 12, 2021
Leader election used the `coordination.k8s.io` kubernetes api with is
only available in kubernetes >= 1.14.
By settings the environment variable LEADER_ELECTION_IDENTITY=off
leader election can be disabled for the workflow controller.

Signed-off-by: tobi <mail@singhania.at>
tobisinghania pushed a commit to tobisinghania/argo-workflows that referenced this issue Apr 12, 2021
…_ELECTION_DISABLE (argoproj#5638)

Signed-off-by: tobi <mail@singhania.at>
alexec pushed a commit that referenced this issue Apr 16, 2021
@alexec alexec linked a pull request Apr 16, 2021 that will close this issue
@simster7 simster7 mentioned this issue Apr 19, 2021
50 tasks
@vbarbaresi
Copy link

vbarbaresi commented Apr 21, 2021

I'm trying this fix to run Argo Workflows v3 on a cluster running in version < 1.14
I set the environment variable LEADER_ELECTION_DISABLE to true

I found out 2 issues:

  • It still tries to start leading and will fails to get leases. It seems this cause the controller to never be in a processing state
time="2021-04-21T12:35:38.543Z" level=info msg="Manager initialized successfully"
I0421 12:35:38.543556       1 leaderelection.go:243] attempting to acquire leader lease  mynamespace/workflow-controller...
time="2021-04-21T12:35:38.545Z" level=info msg="Get leases 404"
time="2021-04-21T12:35:38.547Z" level=info msg="Create leases 404"
E0421 12:35:38.547373       1 leaderelection.go:329] error initially creating leader election record: the server could not find the requested resource (post leases.coordination.k8s.io)
time="2021-04-21T12:35:49.334Z" level=info msg="Get leases 404"
time="2021-04-21T12:35:49.336Z" level=info msg="Create leases 404"
E0421 12:35:49.336784       1 leaderelection.go:329] error initially creating leader election record: the server could not find the requested resource (post leases.coordination.k8s.io)

I think when the leader election is disabled, we shouldn't try to start leading:

go wfc.startLeading(ctx, logCtx, podCleanupWorkers, workflowTTLWorkers, wfWorkers, podWorkers)

  • The liveness probe fails because the metric server is not started
    So the workflow-controller is killed every minute and ends up in CrashLoopBackOff state
    I think this comes from issue Workflow-controller non-leader replicas are unhealthy #5525
    I saw the issue wasn't cherry-picked in v3.0.2 so it may explain that
    It's probably a consequence of the first issue anyway

@joyciep
Copy link
Contributor

joyciep commented Apr 22, 2021

I also encountered the same issue as @vbarbaresi in Openshift 3.11 (Argo v3.0.2).

time="2021-04-22T08:23:06.630Z" level=info msg="Get leases 404"
time="2021-04-22T08:23:06.631Z" level=info msg="Create leases 404"
E0422 08:23:06.631986       1 leaderelection.go:329] error initially creating leader election record: the server could not find the requested resource (post leases.coordination.k8s.io)

@alexec
Copy link
Contributor

alexec commented Apr 22, 2021

@sarabala1979 didn't this get fixed?

@joyciep
Copy link
Contributor

joyciep commented Apr 28, 2021

Any updates on this?

@alexec
Copy link
Contributor

alexec commented Apr 28, 2021

I think you need to set LEADER_ELECTION_IDENTITY=off.

@joyciep
Copy link
Contributor

joyciep commented Apr 28, 2021

I think you need to set LEADER_ELECTION_IDENTITY=off.

I already set LEADER_ELECTION_IDENTITY to off and LEADER_ELECTION_DISABLE to true and I still encountered the same error.

@vbarbaresi
Copy link

vbarbaresi commented Apr 28, 2021

I tested with the latest image, and setting LEADER_ELECTION_DISABLE=true worked for me this time
I previously tested on v3.0.2 but LEADER_ELECTION_DISABLE wasn't released on this version

@alexec alexec reopened this Apr 28, 2021
@alexec
Copy link
Contributor

alexec commented Apr 28, 2021

We should backport the fix.

@alexec alexec assigned alexec and unassigned alexec Apr 28, 2021
@alexec
Copy link
Contributor

alexec commented May 5, 2021

@sarabala1979 this did not appear to get back-ported to v3.0.2 - can you please make sure it makes it into v3.0.3. Commits are:

e3d1d1e
4c3b0ac

sarabala1979 pushed a commit that referenced this issue May 5, 2021
@lucastheisen
Copy link
Contributor

I'm on OpenShift Container Platform 4.7, and the leases are there:

ltheisen@MM233009-PC:~/git/caasd-portal-config$ oc api-resources|grep leases
helmreleases                                                   apps.open-cluster-management.io              true         HelmRelease
leases                                                         coordination.k8s.io                          true         Lease
machinepoolnameleases                                          hive.openshift.io                            true         MachinePoolNameLease
clustersyncleases                     csl                      hiveinternal.openshift.io                    true         ClusterSyncLease
releases                              rls                      redis.databases.cloud.ibm.com                true         Release

However, we are using namespaced install and leases are not aggregated into admin permission which effectively makes it impossible to used namespaced installs without more (undocumented) cluster admin level installations (they also have to install the CRD's). Im not sure if this is the best place to discuss this, but i did once upon a time mention that there are issues with namespace installs that do not appear to be documented.

Anyway, if leases are required, perhaps more updates to the documentation to describe what is actually required to get namespaced installs to work?

@alexec
Copy link
Contributor

alexec commented May 11, 2021

@lucastheisen would you like to supply the updated manifests needed to run it in a PR?

@lucastheisen
Copy link
Contributor

@alexec I would... As soon as I figure it out. We are stuck on the lease thing. Any chance you could explain a little about what leases add, why needed and if you understand the implication of allowing namespace admin role to have the required permissions for creating the role/binding? Our research seems to indicate that granting the perms shouldn't be bad, but we might be missing some detail here...

@alexec
Copy link
Contributor

alexec commented May 11, 2021

Leases are used for controller HA. If v3.0.3 you can disable them by setting LEADER_ELECTION_DISABLE=true

@lucastheisen
Copy link
Contributor

Was just about to say:

3.0.3 is not yet released, right?

Looks like you released it 29 minutes ago... Ill give that a go.

@lucastheisen
Copy link
Contributor

Looks like events results in a similar permissions conundrum. Investigating, but is events also something that can be tuned out? If i understand correctly its a new feature of 3.0, but is it integral (required for the old functionality to continue to work) or standalone (only required for the workfloweventbinding? Can we leave off workfloweventbinding?

@alexec
Copy link
Contributor

alexec commented May 11, 2021

events is a v2.x feature, and the provided manifests should already have the requisite RBAC.

@alexec alexec closed this as completed May 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants