Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source controller is unstable #2072

Closed
1 task done
ANKAMBLE opened this issue Nov 9, 2021 · 8 comments · Fixed by fluxcd/source-controller#626
Closed
1 task done

Source controller is unstable #2072

ANKAMBLE opened this issue Nov 9, 2021 · 8 comments · Fixed by fluxcd/source-controller#626

Comments

@ANKAMBLE
Copy link

ANKAMBLE commented Nov 9, 2021

Describe the bug

We are seeing source controller pod is getting restarted frequently every hour and it take 15-30 mins to recover from below error
failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/ocp-fleet/2998cb88140643569d3264e00b51a0ac552d856f.tar.gz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/ocp-fleet/2998cb88140643569d3264e00b51a0ac552d856f.tar.gz": dial tcp 172.30.237.29:80: i/o timeout

Steps to reproduce

We have flux muti tenancy enabled on Opensshift cluster with multiple namespace. We created below for repositories for maintaining infrastructure and application release files and helm charts

  1. ocp-cluster repo - contains flux system cluster definition
  2. ocp-fleet repo- contains infrastructure and tenants definitions
  3. ocp-tenant repo - contains specific namespace related definitions
  4. ocp-app repo - contains application specific helm charts

Expected behavior

flux should be able to connect to four repositories defined and update the cluster/namespace accordingly on each attempt, randomly we are seeing failures with below errors
failed to download artifact, error: GET http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/ocp-fleet/2998cb88140643569d3264e00b51a0ac552d856f.tar.gz giving up after 10 attempt(s): Get "http://source-controller.flux-system.svc.cluster.local./gitrepository/flux-system/ocp-fleet/2998cb88140643569d3264e00b51a0ac552d856f.tar.gz": dial tcp 172.30.237.29:80: i/o timeout

Screenshots and recordings

image

OS / Distro

RHCOS, Openshift 4.6

Flux version

v0.17.2

Flux check

image

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@stefanprodan
Copy link
Member

What's the reason for source-controller restarts? If it's OOM you should increase the memory limits.

@ANKAMBLE
Copy link
Author

how can we check if it is failing due to OOM, i don't see it is showing OOM errors in logs. I do see pod memory utilization reaching around 98% memory utilization sometimes

@stefanprodan
Copy link
Member

Please see https://sysdig.com/blog/troubleshoot-kubernetes-oom/ If the memory goes to 98% then you need to increase the limit, setting it to 2Gi should fix it.

@chanwit
Copy link
Contributor

chanwit commented Nov 11, 2021

@ANKAMBLE my question might not be directly related to your issue.

did you use flux bootstrap or install Flux via the Operator Hub?

@ANKAMBLE
Copy link
Author

we used flux bootstrap, the issue got resolved after increasing the memory

@chanwit
Copy link
Contributor

chanwit commented Nov 11, 2021

@ANKAMBLE glad to know your problem got solved.

@stefanprodan
Copy link
Member

source-controller memory usage is governed by the number of Helm charts it needs to build. The default limit of 1Gi works fine for charts that are already bundled (originating from Helm repositories). For charts in Git, users have to increase the limit. I think we could add this to the FAQ page.

@gecube
Copy link

gecube commented Dec 25, 2021

Is it possible to add VPA object for governing the source controller memory size limit?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants