Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bootstrap: More details for flux-system context deadline exceeded #4411

Closed
1 task done
gecube opened this issue Nov 17, 2023 · 9 comments · Fixed by #4422
Closed
1 task done

bootstrap: More details for flux-system context deadline exceeded #4411

gecube opened this issue Nov 17, 2023 · 9 comments · Fixed by #4422
Assignees
Labels
area/bootstrap Bootstrap related issues and pull requests enhancement New feature or request

Comments

@gecube
Copy link

gecube commented Nov 17, 2023

Describe the bug

Good day colleagues!

Please check the log below:

flux bootstrap gitlab --owner=gecube --repository=****** --path=mail-cluster --components-extra="image-reflector-controller,image-automation-controller" 
► connecting to https://gitlab.com
► cloning branch "main" from Git repository "https://gitlab.com/gecube/******.git"
✔ cloned repository
► generating component manifests
✔ generated component manifests
✔ committed sync manifests to "main" ("0bb95c27ed1f4b4c538d865877e647222d8c8af9")
► pushing component manifests to "https://gitlab.com/gecube/******.git"
► installing components in "flux-system" namespace
✔ installed components
✔ reconciled components
► determining if source secret "flux-system/flux-system" exists
► generating source secret
✔ public key: ecdsa-sha2-nistp384 AAAAE2VjZHNhLXNoYTItbmlzdHAzODQAAAAIbmlzdHAzODQAAABhBH0A5BfvqdYEujW3R2WfChwEtr9qJ2bsSHmUGSgpvELWmz3/rqFISfJCNQNXhR+SwfsshFjyBY6bLddgt9aWaukqhr/yuQMwE81ytcPRhkE9Vw014qRdmYzNEAZ0ymzlGw==
✔ configured deploy key "flux-system-main-flux-system-./mail-cluster" for "https://gitlab.com/gecube/*****"
► applying source secret "flux-system/flux-system"
✔ reconciled source secret
► generating sync manifests
✔ generated sync manifests
✔ committed sync manifests to "main" ("e315002da1f3a64793f9c3d84e4cf952b93734a5")
► pushing sync manifests to "https://gitlab.com/gecube/******.git"
► applying sync manifests
✔ reconciled sync configuration
◎ waiting for Kustomization "flux-system/flux-system" to be reconciled
✗ client rate limiter Wait returned an error: context deadline exceeded
► confirming components are healthy
✔ helm-controller: deployment ready
✔ image-automation-controller: deployment ready
✔ image-reflector-controller: deployment ready
✔ kustomize-controller: deployment ready
✔ notification-controller: deployment ready
✔ source-controller: deployment ready
✔ all components are healthy
✗ bootstrap failed with 1 health check failure(s)

Looks like that I was hit by rate limit.

Steps to reproduce

Prepare the empty cluster on single node with kubeadm + cilium.

Before installation of FluxCD:

kubectl get pods  -A
NAMESPACE     NAME                                               READY   STATUS    RESTARTS   AGE
kube-system   cilium-9kchm                                       1/1     Running   0          31m
kube-system   cilium-operator-5f75d7d7ff-g7blj                   1/1     Running   0          31m
kube-system   coredns-6b5b8ddb57-9p287                           1/1     Running   0          29m
kube-system   coredns-6b5b8ddb57-pgbxd                           1/1     Running   0          29m
kube-system   etcd-ubuntu-standard-4-8-60gb                      1/1     Running   2          32m
kube-system   kube-apiserver-ubuntu-standard-4-8-60gb            1/1     Running   1          32m
kube-system   kube-controller-manager-ubuntu-standard-4-8-60gb   1/1     Running   1          32m
kube-system   kube-scheduler-ubuntu-standard-4-8-60gb            1/1     Running   2          32m

After installation of FluxCD:

kubectl get pods -A
NAMESPACE     NAME                                               READY   STATUS    RESTARTS   AGE
flux-system   helm-controller-5f964c6579-pvkg2                   1/1     Running   0          16m
flux-system   image-automation-controller-7764f8957c-l8pbh       1/1     Running   0          16m
flux-system   image-reflector-controller-84c449dc57-xxn67        1/1     Running   0          16m
flux-system   kustomize-controller-9c588946c-729r5               1/1     Running   0          16m
flux-system   notification-controller-76dc5d768-vdmmn            1/1     Running   0          16m
flux-system   source-controller-6c49485888-cthlh                 1/1     Running   0          16m
kube-system   cilium-9kchm                                       1/1     Running   0          84m
kube-system   cilium-operator-5f75d7d7ff-g7blj                   1/1     Running   0          84m
kube-system   coredns-6b5b8ddb57-9p287                           1/1     Running   0          82m
kube-system   coredns-6b5b8ddb57-pgbxd                           1/1     Running   0          82m
kube-system   etcd-ubuntu-standard-4-8-60gb                      1/1     Running   2          85m
kube-system   kube-apiserver-ubuntu-standard-4-8-60gb            1/1     Running   1          85m
kube-system   kube-controller-manager-ubuntu-standard-4-8-60gb   1/1     Running   1          85m
kube-system   kube-scheduler-ubuntu-standard-4-8-60gb            1/1     Running   2          85m

Expected behavior

Add more timeouts and checks to finish installation from the first try and return the proper successful exit code.

Screenshots and recordings

No response

OS / Distro

Ubuntu 22.04

Flux version

v2.1.2

Flux check

► checking prerequisites
✔ Kubernetes 1.28.2 >=1.25.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.36.2
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.36.1
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.30.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.1.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.1.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.1.2
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2
✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2
✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed

Git provider

gitlab saas

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@gecube
Copy link
Author

gecube commented Nov 17, 2023

I believe it could be a very difficult to reproduce issue.

@makkes
Copy link
Member

makkes commented Nov 17, 2023

Flux waits 5 minutes by default for the bootstrapping to succeed which is enough time in most cases. If you need more time, use the --timeout parameter.

@gecube
Copy link
Author

gecube commented Nov 17, 2023

Oh... I am stupid one... it was cluster misconfiguration.

flux get all -A
NAMESPACE  	NAME                     	REVISION	SUSPENDED	READY	MESSAGE                                                                                                                                                               
flux-system	gitrepository/flux-system	        	False    	False	failed to checkout and determine revision: unable to clone 'ssh://git@gitlab.com/gecube/****': dial tcp: lookup gitlab.com on 10.128.0.10:53: server misbehaving	

NAMESPACE  	NAME                     	REVISION	SUSPENDED	READY	MESSAGE                                    
flux-system	kustomization/flux-system	        	False    	False	Source artifact not found, retrying in 30s

I used an improper configuration for CoreDNS.

Was:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
    }

The proper one:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
        forward . 8.8.8.8 <----- add this line
    }

Now everything is working. It is very interesting that the error message from flux bootstrap does not give any idea about DNS issues.

Finally everything is working:

flux get all -A
NAMESPACE  	NAME                     	REVISION          	SUSPENDED	READY	MESSAGE                                           
flux-system	gitrepository/flux-system	main@sha1:e315002d	False    	True 	stored artifact for revision 'main@sha1:e315002d'	

NAMESPACE  	NAME                     	REVISION          	SUSPENDED	READY	MESSAGE                              
flux-system	kustomization/flux-system	main@sha1:e315002d	False    	True 	Applied revision: main@sha1:e315002d	

@gecube
Copy link
Author

gecube commented Nov 17, 2023

The second finding is that flux check also does not give any information about misbehaving CoreDNS in the cluster.

@stefanprodan
Copy link
Member

flux check assumes your cluster networking is functional, we only look if the controllers pods are healthy. It's not up to Flux to diagnose CoreDNS or any other thing outside the Flux pods.

@gecube
Copy link
Author

gecube commented Nov 17, 2023

@stefanprodan Hi! I completely agree, but it would be nice to get some additional info from FluxCD side to debug the issue :-) I understand that it was my fault to forget some config settings, but if we think logically - if there is no access to main repo defining flux-system installation itself, flux is not working. No matter if crd are installed and all pods are running.

@stefanprodan
Copy link
Member

stefanprodan commented Nov 17, 2023

Flux can be configured to sync from hundreds of Git repos, Helm repos, OCI registries, S3 buckers, etc Why would we consider that Flux is not working just because some sync fails, it can be transient network error or some external service having an outage. Flux CLI gives you all the tools you need to diagnose such issues flux get, flux events, flux logs, the check command is scoped to pods being healthy.

@gecube gecube changed the title flux bootstrap failed flux bootstrap failed (caused by DNS config issue) Nov 17, 2023
@makkes
Copy link
Member

makkes commented Nov 17, 2023

Flux can't diagnose your Kubernetes cluster for misconfiguration but assumes a functioning cluster. There's a lot of guides and tooling out there to diagnose cluster misconfigurations, e.g. the official Kubernetes documentation.

@makkes makkes closed this as completed Nov 17, 2023
@gecube
Copy link
Author

gecube commented Nov 17, 2023

@makkes I completely agree and understand that there is no need to check for transient issues (I thought about them). And I share the opinion that FluxCD is not a diagnose tool for the cluster. The issue is that as flux user I want to get the conscious error messages. Get any idea that you have a wrong config for cluster from bootstrap subcommand is totally impossible:

✔ reconciled sync configuration
◎ waiting for Kustomization "flux-system/flux-system" to be reconciled
✗ client rate limiter Wait returned an error: context deadline exceeded
...
✗ bootstrap failed with 1 health check failure(s)

If there would be error message like "dns resolving error", or "failed to checkout and determine revision", or something like the same verbosity level, I'd be very happy. Not just simple "context deadline exceeded" Thanks for the attention.

@stefanprodan stefanprodan changed the title flux bootstrap failed (caused by DNS config issue) bootstrap: More details for flux-system context deadline exceeded Nov 17, 2023
@stefanprodan stefanprodan reopened this Nov 17, 2023
@stefanprodan stefanprodan added enhancement New feature or request area/bootstrap Bootstrap related issues and pull requests labels Nov 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/bootstrap Bootstrap related issues and pull requests enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants