Skip to content
This repository has been archived by the owner on Nov 1, 2022. It is now read-only.

Flux 1.22.2+ - kustomize build commands causing flux pod evictions due to disk space consumption in tmp directory #3500

Closed
mmcaya opened this issue Jul 10, 2021 · 10 comments · Fixed by #3509
Assignees
Labels
Milestone

Comments

@mmcaya
Copy link

mmcaya commented Jul 10, 2021

Describe the bug

After upgrading to flux 1.22.2, k8s clusters immediately saw a spike in flux pod disk consumption due to /tmp not being properly cleaned up after sync loops involving kustomize build commands in our .flux.yml.
Disk space was quickly consumed leading to flux pod evictions due to disk pressure

Initial investigation suggests the update from #3381 is errantly or prematurely cancelling the sync loop, leaving orphaned data in the /tmp directory typically cleared by kustomize directly when it completes execution.

I haven't traced through the entire code execution path, but the context used during the calls to kustomize build (or whatever is in the generators section of the .flux.yml) via execCommand (link below) already had a context timeout using the same sync timeout settings as the PR noted above, which now means the same context was wrapped with a timeout twice.

See: https://github.com/fluxcd/flux/blob/master/pkg/manifests/configfile.go#L492

Issue is not present when reverting to 1.22.1.

I've also locally tried using both kustomize 3.8.4 and 3.8.10 with flux 1.22.2 to eliminate that as the potential culprit (as it was also upgraded in flux 1.22.2), and so far have seen the issue with either kustomize version

To Reproduce

Steps to reproduce the behaviour:

  1. Provide Flux install instructions
    Current flux settings
 spec:
      containers:
      - args:
        - --log-format=fmt
        - --ssh-keygen-dir=/var/fluxd/keygen
        - --k8s-secret-name=XXX
        - --memcached-hostname=XXX
        - --sync-state=git
        - --sync-timeout=10m
        - --memcached-service=XXX
        - --git-url=git@github.com:XXX
        - --git-branch=release
        - --git-path=overlays/XXX
        - --git-readonly=false
        - --git-user=Weave Flux
        - --git-email=support@weave.works
        - --git-set-author=false
        - --git-poll-interval=5m
        - --git-timeout=20s
        - --sync-interval=5m
        - --git-ci-skip=false
        - --git-label=XXX-flux-sync
        - --manifest-generation=true
        - --registry-poll-interval=5m
        - --registry-rps=200
        - --registry-burst=125
        - --registry-trace=false
  1. Provide a GitHub repository with Kubernetes manifests

Sample manifests to reproduce issue:

.
├── bases
│   └── aws-load-balancer-controller
│       └── kustomization.yaml
└── overlays
    ├── aws-load-balancer-controller
    │   └── kustomization.yaml
    └── .flux.yaml

.flux.yaml

version: 1
patchUpdated:
  generators:
    - command: kustomize build aws-load-balancer-controller
  patchFile:  flux-patch.yaml

bases/aws-load-balancer-controller/kustomization.yaml

resources:
  - github.com/kubernetes-sigs/aws-load-balancer-controller/config/default/?ref=v2.2.1

overlays/aws-load-balancer-controller/kustomization.yaml

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../bases/aws-load-balancer-controller

Expected behavior

Any generator commands should not prematurely exit from context timeout tied to cancel calls of parent functions, and should rely on the context timeout and error handling already present in execCommand
As of now, we are pinned to version 1.22.1 until this issue gets patched, or we are able to start our migration to flux v2.

Logs

Sample tmp dir output from a live 1.22.2 instance

Right after startup

/home/flux # date
Fri Jul  9 19:57:07 UTC 2021
/home/flux # du -hs /tmp
72.2M   /tmp

After initial sync (~4 minutes later), already containing orphaned data

/home/flux # date
Fri Jul  9 20:01:01 UTC 2021

/home/flux # du -hs /tmp/*
532.0K  /tmp/flux-gitclone621756422
1.3M    /tmp/flux-working107927848
8.0K    /tmp/getter106386092
155.4M  /tmp/getter405598756
12.0K   /tmp/kustomize-937816237
12.0K   /tmp/kustomize-979226010

/home/flux # du -hs /tmp
157.2M  /tmp

30 Minutes later with orphaned data growth

/home/flux # date
Fri Jul  9 20:32:48 UTC 2021

/home/flux # du -hs /tmp/*
544.0K  /tmp/flux-gitclone621756422
1.3M    /tmp/flux-working082270969
8.0K    /tmp/getter013765201
8.0K    /tmp/getter106386092
155.4M  /tmp/getter198844007
8.0K    /tmp/getter246770219
155.4M  /tmp/getter329558908
8.0K    /tmp/getter388119449
155.4M  /tmp/getter405598756
8.0K    /tmp/getter494579190
8.0K    /tmp/getter588493785
8.0K    /tmp/getter616006697
100.3M  /tmp/getter679815951
8.0K    /tmp/getter932683696
480.0K  /tmp/kustomize-770103074
12.0K   /tmp/kustomize-937816237
12.0K   /tmp/kustomize-979226010

/home/flux # du -hs /tmp
583.5M  /tmp

For comparison, here is data from flux 1.22.1 working as expected on the same git repo with no data being orphaned

Initial Startup

/home/flux # date
Fri Jul  9 21:12:56 UTC 2021
/home/flux # du -hs /tmp
1.8M    /tmp
/home/flux # du -hs /tmp/*
532.0K  /tmp/flux-gitclone740260762
1.3M    /tmp/flux-working391222833

~30 minutes later

/home/flux # date
Fri Jul  9 21:45:35 UTC 2021
/home/flux # du -hs /tmp
1.8M    /tmp
/home/flux # du -hs /tmp/*
544.0K  /tmp/flux-gitclone740260762
1.3M    /tmp/flux-working323903471

~50 minutes later

/home/flux # date
Fri Jul  9 22:02:51 UTC 2021
/home/flux # du -hs /tmp/*
544.0K  /tmp/flux-gitclone740260762
1.3M    /tmp/flux-working208442515
/home/flux # du -hs /tmp
1.8M    /tmp

Additional context

  • Flux version: 1.22.2+
  • Kubernetes version: 1.18,1.19
@mmcaya mmcaya added blocked-needs-validation Issue is waiting to be validated before we can proceed bug labels Jul 10, 2021
@kingdonb
Copy link
Member

Thank you for the report. Very sorry for the inconvenience.

The details that you've provided are helpful and really make a difference, the bisect and the other investigation you've done. 🙇 I have created a prerelease build with the PR reverted that is in question, can you please confirm this resolves the issue?

The image is at docker.io/kingdonb/flux:revert-pr-3381-1be3ff15 or you can build it from the latest commit on the revert-pr-3381 branch.

I will take a look later and try to understand the details, but for now my main concern is to determine if this report is correct, then revert the bad PR and get a new release candidate ready for 1.23.1. We should be able to publish it early this week if the report can be confirmed.

@mmcaya
Copy link
Author

mmcaya commented Jul 12, 2021

Thanks for the quick response.

I've also locally tried using both kustomize 3.8.4 and 3.8.10 with flux 1.22.2 to eliminate that as the potential culprit (as it was also upgraded in flux 1.22.2), and so far have seen the issue with either kustomize version

Following up on this note from the original description ^^^

I've continued testing both the original branch and new patch on different kustomize versions as reproducing the issue consistently in a local setup was still difficult, and confirmed kustomize v3.8.8+ is causing problems for larger checkouts. They added a hard coded timeout of 27s to all git calls used in remote loading, which very much seems related to the issue I reported.

See:

The original patch/PR I referenced may be a red herring here, at least for the specific issue we are encountering. I'll comment with additional details as I have time to test.

@kingdonb
Copy link
Member

kingdonb commented Jul 13, 2021

Thank you for following up with more details. I've linked a few issues to this one for visibility. If I read your last note correctly, the issue would be solved by a revert of the Kustomize version included here, to the release: kustomize v3.8.7

That's very convenient as we have at least one user who initially requested at least v3.8.5, in #3457 (which has been renamed since it was opened after much debate.)

kingdonb pushed a commit that referenced this issue Jul 14, 2021
According to #3500, there are breaking changes in the patch release of
Kustomize v3.8.8 which were included in the Flux 1.23.0 release that
upgraded from Kustomize v3.8.4 to v3.8.10.

A PATCH or MINOR release in semver is meant to signal that compatibility
is maintained, but it appears the semver wasn't totally honest and a
behavior was changed in a destructive way for some users.

Other Flux users were looking (#3457) for functionality that was added
in v3.8.5, (so I guess that could have been our first clue semver isn't
strictly honored) rather than revert the update entirely, we'd like to
at least upgrade that far.

Signed-off-by: Kingdon Barrett <kingdon@weave.works>
@kingdonb
Copy link
Member

There is docker.io/kingdonb/flux:revert-kustomize-3-8-10-b30133a2 now which implements #3504, if you can please test and let me know if this resolves your issue.

@mmcaya
Copy link
Author

mmcaya commented Jul 14, 2021

There is docker.io/kingdonb/flux:revert-kustomize-3-8-10-b30133a2 now which implements #3504, if you can please test and let me know if this resolves your issue.

Thanks, I'll have this on my list for tomorrow (7/15) to review and provide confirmation of resolution to the issue.

@mmcaya
Copy link
Author

mmcaya commented Jul 16, 2021

There is docker.io/kingdonb/flux:revert-kustomize-3-8-10-b30133a2 now which implements #3504, if you can please test and let me know if this resolves your issue.

Built and deployed the above patch, which has been running for 8+ hours without any /tmp dir growth issues. kustomize 3.8.7 seems to be operating as expected.

@kingdonb kingdonb removed the blocked-needs-validation Issue is waiting to be validated before we can proceed label Jul 16, 2021
@kingdonb
Copy link
Member

Thank you for the confirmation! I am currently wrestling with e2e tests and will try to get at least a release candidate out before the weekend. Should expect to see Flux 1.23.1 out some time next week with this change included.

@aleclerc-sonrai
Copy link

We hit some backwards compatibilty issues going from 1.22.0 -> 1.23.0 which Seems to be kustomize related. Reverting back for now but anxious for 1.23.1 release

kingdonb pushed a commit that referenced this issue Jul 21, 2021
According to #3500, there are breaking changes in the patch release of
Kustomize v3.8.8 which were included in the Flux 1.23.0 release that
upgraded from Kustomize v3.8.4 to v3.8.10.

A PATCH or MINOR release in semver is meant to signal that compatibility
is maintained, but it appears the semver wasn't totally honest and a
behavior was changed in a destructive way for some users.

Other Flux users were looking (#3457) for functionality that was added
in v3.8.5, (so I guess that could have been our first clue semver isn't
strictly honored) rather than revert the update entirely, we'd like to
at least upgrade that far.

Signed-off-by: Kingdon Barrett <kingdon@weave.works>
kingdonb pushed a commit to spron-in/flux that referenced this issue Jul 21, 2021
According to fluxcd#3500, there are breaking changes in the patch release of
Kustomize v3.8.8 which were included in the Flux 1.23.0 release that
upgraded from Kustomize v3.8.4 to v3.8.10.

A PATCH or MINOR release in semver is meant to signal that compatibility
is maintained, but it appears the semver wasn't totally honest and a
behavior was changed in a destructive way for some users.

Other Flux users were looking (fluxcd#3457) for functionality that was added
in v3.8.5, (so I guess that could have been our first clue semver isn't
strictly honored) rather than revert the update entirely, we'd like to
at least upgrade that far.

Signed-off-by: Kingdon Barrett <kingdon@weave.works>
kingdonb pushed a commit that referenced this issue Jul 22, 2021
According to #3500, there are breaking changes in the patch release of
Kustomize v3.8.8 which were included in the Flux 1.23.0 release that
upgraded from Kustomize v3.8.4 to v3.8.10.

A PATCH or MINOR release in semver is meant to signal that compatibility
is maintained, but it appears the semver wasn't totally honest and a
behavior was changed in a destructive way for some users.

Other Flux users were looking (#3457) for functionality that was added
in v3.8.5, (so I guess that could have been our first clue semver isn't
strictly honored) rather than revert the update entirely, we'd like to
at least upgrade that far.

Signed-off-by: Kingdon Barrett <kingdon@weave.works>
@kingdonb
Copy link
Member

@aleclerc-sonrai @mmcaya The PR which is expected to resolve this has just been merged to master

fluxcd/flux-prerelease:master-f9fe2abb will be ready in a few seconds, which shall resolve this issue. I expect the 1.23.1 release to likely be out tomorrow, but since there are no chart changes you can upgrade to this image in advance, if you want to resolve this faster by using a prerelease.

We've already had one confirmation given that this PR evidently fixes the issue, from @mmcaya (thank you!)

@aleclerc-sonrai if you can try out the prerelease and let us know, in case there was a different issue in your case.

Thanks everyone for your patience.

@kingdonb kingdonb linked a pull request Jul 22, 2021 that will close this issue
@kingdonb
Copy link
Member

The helm chart has been updated in #3510

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants