bootstrap: More details for `flux-system context deadline exceeded` #4411

gecube · 2023-11-17T12:37:14Z

Describe the bug

Good day colleagues!

Please check the log below:

flux bootstrap gitlab --owner=gecube --repository=****** --path=mail-cluster --components-extra="image-reflector-controller,image-automation-controller" 
► connecting to https://gitlab.com
► cloning branch "main" from Git repository "https://gitlab.com/gecube/******.git"
✔ cloned repository
► generating component manifests
✔ generated component manifests
✔ committed sync manifests to "main" ("0bb95c27ed1f4b4c538d865877e647222d8c8af9")
► pushing component manifests to "https://gitlab.com/gecube/******.git"
► installing components in "flux-system" namespace
✔ installed components
✔ reconciled components
► determining if source secret "flux-system/flux-system" exists
► generating source secret
✔ public key: ecdsa-sha2-nistp384 AAAAE2VjZHNhLXNoYTItbmlzdHAzODQAAAAIbmlzdHAzODQAAABhBH0A5BfvqdYEujW3R2WfChwEtr9qJ2bsSHmUGSgpvELWmz3/rqFISfJCNQNXhR+SwfsshFjyBY6bLddgt9aWaukqhr/yuQMwE81ytcPRhkE9Vw014qRdmYzNEAZ0ymzlGw==
✔ configured deploy key "flux-system-main-flux-system-./mail-cluster" for "https://gitlab.com/gecube/*****"
► applying source secret "flux-system/flux-system"
✔ reconciled source secret
► generating sync manifests
✔ generated sync manifests
✔ committed sync manifests to "main" ("e315002da1f3a64793f9c3d84e4cf952b93734a5")
► pushing sync manifests to "https://gitlab.com/gecube/******.git"
► applying sync manifests
✔ reconciled sync configuration
◎ waiting for Kustomization "flux-system/flux-system" to be reconciled
✗ client rate limiter Wait returned an error: context deadline exceeded
► confirming components are healthy
✔ helm-controller: deployment ready
✔ image-automation-controller: deployment ready
✔ image-reflector-controller: deployment ready
✔ kustomize-controller: deployment ready
✔ notification-controller: deployment ready
✔ source-controller: deployment ready
✔ all components are healthy
✗ bootstrap failed with 1 health check failure(s)

Looks like that I was hit by rate limit.

Steps to reproduce

Prepare the empty cluster on single node with kubeadm + cilium.

Before installation of FluxCD:

kubectl get pods  -A
NAMESPACE     NAME                                               READY   STATUS    RESTARTS   AGE
kube-system   cilium-9kchm                                       1/1     Running   0          31m
kube-system   cilium-operator-5f75d7d7ff-g7blj                   1/1     Running   0          31m
kube-system   coredns-6b5b8ddb57-9p287                           1/1     Running   0          29m
kube-system   coredns-6b5b8ddb57-pgbxd                           1/1     Running   0          29m
kube-system   etcd-ubuntu-standard-4-8-60gb                      1/1     Running   2          32m
kube-system   kube-apiserver-ubuntu-standard-4-8-60gb            1/1     Running   1          32m
kube-system   kube-controller-manager-ubuntu-standard-4-8-60gb   1/1     Running   1          32m
kube-system   kube-scheduler-ubuntu-standard-4-8-60gb            1/1     Running   2          32m

After installation of FluxCD:

kubectl get pods -A
NAMESPACE     NAME                                               READY   STATUS    RESTARTS   AGE
flux-system   helm-controller-5f964c6579-pvkg2                   1/1     Running   0          16m
flux-system   image-automation-controller-7764f8957c-l8pbh       1/1     Running   0          16m
flux-system   image-reflector-controller-84c449dc57-xxn67        1/1     Running   0          16m
flux-system   kustomize-controller-9c588946c-729r5               1/1     Running   0          16m
flux-system   notification-controller-76dc5d768-vdmmn            1/1     Running   0          16m
flux-system   source-controller-6c49485888-cthlh                 1/1     Running   0          16m
kube-system   cilium-9kchm                                       1/1     Running   0          84m
kube-system   cilium-operator-5f75d7d7ff-g7blj                   1/1     Running   0          84m
kube-system   coredns-6b5b8ddb57-9p287                           1/1     Running   0          82m
kube-system   coredns-6b5b8ddb57-pgbxd                           1/1     Running   0          82m
kube-system   etcd-ubuntu-standard-4-8-60gb                      1/1     Running   2          85m
kube-system   kube-apiserver-ubuntu-standard-4-8-60gb            1/1     Running   1          85m
kube-system   kube-controller-manager-ubuntu-standard-4-8-60gb   1/1     Running   1          85m
kube-system   kube-scheduler-ubuntu-standard-4-8-60gb            1/1     Running   2          85m

Expected behavior

Add more timeouts and checks to finish installation from the first try and return the proper successful exit code.

Screenshots and recordings

No response

OS / Distro

Ubuntu 22.04

Flux version

v2.1.2

Flux check

► checking prerequisites
✔ Kubernetes 1.28.2 >=1.25.0-0
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.36.2
✔ image-automation-controller: deployment ready
► ghcr.io/fluxcd/image-automation-controller:v0.36.1
✔ image-reflector-controller: deployment ready
► ghcr.io/fluxcd/image-reflector-controller:v0.30.0
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.1.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.1.0
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.1.2
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta2
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta1
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ imagepolicies.image.toolkit.fluxcd.io/v1beta2
✔ imagerepositories.image.toolkit.fluxcd.io/v1beta2
✔ imageupdateautomations.image.toolkit.fluxcd.io/v1beta1
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta2
✔ receivers.notification.toolkit.fluxcd.io/v1
✔ all checks passed

Git provider

gitlab saas

Container Registry provider

No response

Additional context

No response

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

gecube · 2023-11-17T12:37:40Z

I believe it could be a very difficult to reproduce issue.

makkes · 2023-11-17T12:42:37Z

Flux waits 5 minutes by default for the bootstrapping to succeed which is enough time in most cases. If you need more time, use the --timeout parameter.

gecube · 2023-11-17T12:44:25Z

Oh... I am stupid one... it was cluster misconfiguration.

flux get all -A
NAMESPACE  	NAME                     	REVISION	SUSPENDED	READY	MESSAGE                                                                                                                                                               
flux-system	gitrepository/flux-system	        	False    	False	failed to checkout and determine revision: unable to clone 'ssh://git@gitlab.com/gecube/****': dial tcp: lookup gitlab.com on 10.128.0.10:53: server misbehaving	

NAMESPACE  	NAME                     	REVISION	SUSPENDED	READY	MESSAGE                                    
flux-system	kustomization/flux-system	        	False    	False	Source artifact not found, retrying in 30s

I used an improper configuration for CoreDNS.

Was:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
    }

The proper one:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        ready
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        cache 30
        loop
        reload
        loadbalance
        forward . 8.8.8.8 <----- add this line
    }

Now everything is working. It is very interesting that the error message from flux bootstrap does not give any idea about DNS issues.

Finally everything is working:

flux get all -A
NAMESPACE  	NAME                     	REVISION          	SUSPENDED	READY	MESSAGE                                           
flux-system	gitrepository/flux-system	main@sha1:e315002d	False    	True 	stored artifact for revision 'main@sha1:e315002d'	

NAMESPACE  	NAME                     	REVISION          	SUSPENDED	READY	MESSAGE                              
flux-system	kustomization/flux-system	main@sha1:e315002d	False    	True 	Applied revision: main@sha1:e315002d

gecube · 2023-11-17T12:45:53Z

The second finding is that flux check also does not give any information about misbehaving CoreDNS in the cluster.

stefanprodan · 2023-11-17T12:51:28Z

flux check assumes your cluster networking is functional, we only look if the controllers pods are healthy. It's not up to Flux to diagnose CoreDNS or any other thing outside the Flux pods.

gecube · 2023-11-17T12:55:31Z

@stefanprodan Hi! I completely agree, but it would be nice to get some additional info from FluxCD side to debug the issue :-) I understand that it was my fault to forget some config settings, but if we think logically - if there is no access to main repo defining flux-system installation itself, flux is not working. No matter if crd are installed and all pods are running.

stefanprodan · 2023-11-17T12:59:25Z

Flux can be configured to sync from hundreds of Git repos, Helm repos, OCI registries, S3 buckers, etc Why would we consider that Flux is not working just because some sync fails, it can be transient network error or some external service having an outage. Flux CLI gives you all the tools you need to diagnose such issues flux get, flux events, flux logs, the check command is scoped to pods being healthy.

makkes · 2023-11-17T13:14:40Z

Flux can't diagnose your Kubernetes cluster for misconfiguration but assumes a functioning cluster. There's a lot of guides and tooling out there to diagnose cluster misconfigurations, e.g. the official Kubernetes documentation.

gecube · 2023-11-17T13:27:26Z

@makkes I completely agree and understand that there is no need to check for transient issues (I thought about them). And I share the opinion that FluxCD is not a diagnose tool for the cluster. The issue is that as flux user I want to get the conscious error messages. Get any idea that you have a wrong config for cluster from bootstrap subcommand is totally impossible:

✔ reconciled sync configuration
◎ waiting for Kustomization "flux-system/flux-system" to be reconciled
✗ client rate limiter Wait returned an error: context deadline exceeded
...
✗ bootstrap failed with 1 health check failure(s)

If there would be error message like "dns resolving error", or "failed to checkout and determine revision", or something like the same verbosity level, I'd be very happy. Not just simple "context deadline exceeded" Thanks for the attention.

gecube changed the title ~~flux bootstrap failed~~ flux bootstrap failed (caused by DNS config issue) Nov 17, 2023

makkes closed this as completed Nov 17, 2023

stefanprodan changed the title ~~flux bootstrap failed (caused by DNS config issue)~~ bootstrap: More details for flux-system context deadline exceeded Nov 17, 2023

stefanprodan reopened this Nov 17, 2023

stefanprodan assigned somtochiama Nov 17, 2023

stefanprodan added enhancement New feature or request area/bootstrap Bootstrap related issues and pull requests labels Nov 17, 2023

somtochiama mentioned this issue Nov 22, 2023

bootstrap: More details for context deadline exceeded error #4422

Merged

darkowlzz closed this as completed in #4422 Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bootstrap: More details for `flux-system context deadline exceeded` #4411

bootstrap: More details for `flux-system context deadline exceeded` #4411

gecube commented Nov 17, 2023 •

edited

Loading

gecube commented Nov 17, 2023

makkes commented Nov 17, 2023 •

edited

Loading

gecube commented Nov 17, 2023

gecube commented Nov 17, 2023

stefanprodan commented Nov 17, 2023

gecube commented Nov 17, 2023 •

edited

Loading

stefanprodan commented Nov 17, 2023 •

edited

Loading

makkes commented Nov 17, 2023

gecube commented Nov 17, 2023 •

edited

Loading

bootstrap: More details for flux-system context deadline exceeded #4411

bootstrap: More details for flux-system context deadline exceeded #4411

Comments

gecube commented Nov 17, 2023 • edited Loading

Describe the bug

Steps to reproduce

Expected behavior

Screenshots and recordings

OS / Distro

Flux version

Flux check

Git provider

Container Registry provider

Additional context

Code of Conduct

gecube commented Nov 17, 2023

makkes commented Nov 17, 2023 • edited Loading

gecube commented Nov 17, 2023

gecube commented Nov 17, 2023

stefanprodan commented Nov 17, 2023

gecube commented Nov 17, 2023 • edited Loading

stefanprodan commented Nov 17, 2023 • edited Loading

makkes commented Nov 17, 2023

gecube commented Nov 17, 2023 • edited Loading

bootstrap: More details for `flux-system context deadline exceeded` #4411

bootstrap: More details for `flux-system context deadline exceeded` #4411

gecube commented Nov 17, 2023 •

edited

Loading

makkes commented Nov 17, 2023 •

edited

Loading

gecube commented Nov 17, 2023 •

edited

Loading

stefanprodan commented Nov 17, 2023 •

edited

Loading

gecube commented Nov 17, 2023 •

edited

Loading