Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on git-sync running as a CronJon #439

Closed
michal-jagiello-tmpl opened this issue Aug 9, 2021 · 17 comments
Closed

Error on git-sync running as a CronJon #439

michal-jagiello-tmpl opened this issue Aug 9, 2021 · 17 comments
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@michal-jagiello-tmpl
Copy link

Hi,
I'm using git-sync v3.3.4 in a CrobJob as an initContainer. It's a definition:

- name: clone-results-repo
  image: "{{ .Values.init.cloneRepo.image.repository }}:{{ .Values.init.cloneRepo.image.tag }}"
  imagePullPolicy: {{ .Values.init.cloneRepo.image.pullPolicy }}
  volumeMounts:
    - name: persistent-storage
      mountPath: /git
  env:
  - name: GIT_SYNC_REPO
    value: {{ .Values.git.url }}
  - name: GIT_SYNC_ONE_TIME
    value: "true"
  - name: GIT_SYNC_BRANCH
    value: my_awesome_branch
  - name: GIT_SYNC_DEPTH
    value: "1"
  - name: GIT_SYNC_USERNAME
    valueFrom:
      secretKeyRef:
        name: {{ include "my-awesome-app.fullname" . }}-git-credentials
        key: GIT_PULL_USERNAME
  - name: GIT_SYNC_PASSWORD
    valueFrom:
      secretKeyRef:
        name: {{ include "my-awesome-app.fullname" . }}-git-credentials
        key: GIT_PULL_PASSWORD
  - name: GIT_SYNC_ROOT
    value: /git
  - name: GIT_SYNC_TIMEOUT
    value: "99999"

There is another container which also mounts persistent-storage volume and consumes the data from the cloned repo. The issue is that after few succeeded executions I have always the same error:

INFO: detected pid 1, running init handler
I0809 06:29:23.648815      11 main.go:507] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"]
I0809 06:29:24.077876      11 main.go:1003] "level"=0 "msg"="update required" "rev"="HEAD" "local"="1172cc4eeed3a3dd6d5e8fb65f3c15134adf9f32" "remote"="bfa07ea5354c25fa7e267dbcb6bbb305f2bd315f"
I0809 06:29:24.077969      11 main.go:690] "level"=0 "msg"="syncing git" "rev"="HEAD" "hash"="bfa07ea5354c25fa7e267dbcb6bbb305f2bd315f"
E0809 06:29:24.702409      11 main.go:172] "msg"="too many failures, aborting" "error"="Run(git gc --prune=all): exit status 128: { stdout: "", stderr: "fatal: gc is already running on machine 'my-awesome-app-1628460000-pr7v8' pid 49 (use --force if not)\n" }" "failCount"=0

and pod my-awesome-app-1628460000-pr7v8 does not exist anymore.
Repository has c.a. 7GB of data (if it matters).

@thockin
Copy link
Member

thockin commented Aug 9, 2021 via email

@michal-jagiello-tmpl
Copy link
Author

@thockin no, unfortunately not :| is that git gc runs asyncronously? Can that process be somehow interrupted?

@thockin
Copy link
Member

thockin commented Aug 10, 2021 via email

@thockin
Copy link
Member

thockin commented Aug 10, 2021

I mean, I could just add a --force and cross my fingers. Or I could add a --skip-gc flag for cases that are not going to be long-lived. But that seems likely to end in repo bloat. If you never ever GC it becuase you never have time, then what?

@thockin
Copy link
Member

thockin commented Aug 10, 2021

You could try --git-config=gc.autoDetach:false ? That could become the default if it works...

Or we could catch this specific case ("already running") and not treat it as fatal.

Without a repro, it's scary.

@michal-jagiello-tmpl
Copy link
Author

I've run that cron with

env:
....
  - name: GIT_SYNC_GIT_CONFIG
    value: "gc.autoDetach:false"

but there is in doc:

gc.autoDetach
    Make git gc --auto return immediately and run in background if the system supports it. Default is true.

I see that you call git gc prune --all here.
Maybe the solution could be to add --disable-git-gc flag if user is absolutely sure what they doing and I could care myself about that. I'll run git gc by myself once a day and not every few hours?

@thockin
Copy link
Member

thockin commented Aug 11, 2021 via email

@christophercutajar
Copy link

Over the past few days we started to experience such issue!

Logs:

+ playbook-slackbot-deployment-69756b6699-sc5c4 › playbook-sync
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:38:03.158022829+02:00 INFO: detected pid 1, running init handler
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:38:03.197820105+02:00 I1019 11:38:03.197508      11 main.go:507] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"]
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:38:03.197887375+02:00 I1019 11:38:03.197689      11 main.go:860] "level"=0 "msg"="cloning repo" "origin"="https://github.com/<org>/<repo>.git" "path"="/git"
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:39:47.561229307+02:00 I1019 11:39:47.560916      11 main.go:690] "level"=0 "msg"="syncing git" "rev"="HEAD" "hash"="dd21cb48350c2354a4b36ad535173ff962e75fad"
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:42:26.640201065+02:00 E1019 11:42:26.639933      11 main.go:172] "msg"="too many failures, aborting" "error"="Run(git gc --prune=all): context deadline exceeded: { stdout: "", stderr: "" }" "failCount"=0
- playbook-slackbot-deployment-69756b6699-sc5c4 › playbook-sync

Config:

- name: playbook-sync
  image: k8s.gcr.io/git-sync/git-sync:v3.3.4
  env:
  - name: GIT_SYNC_USERNAME
    value: "user"
  - name: GIT_SYNC_PASSWORD
    valueFrom:
      secretKeyRef:
        name: playbooks-bot-tokens
        key: github
  - name: GIT_SYNC_ROOT
    value: "/git"
  - name: GIT_SYNC_REPO
    value: "https://github.com/<org>/<repo>.git"
  - name: GIT_SYNC_BRANCH
    value: "master"
  volumeMounts:
  - name: playbooks-shared-data
    mountPath: /git

@christophercutajar
Copy link

Increasing the GIT_SYNC_TIMEOUT to 300 from the default 120 seems to have helped in resolving the issue

@thockin
Copy link
Member

thockin commented Oct 19, 2021 via email

@jdavidheiser
Copy link

I also just hit this - it's definitely a case where the garbage collection was interrupted while running on persistent storage. You can repro by kicking off a GC and then killing the pod before the GC finishes. Manually running git gc --force on the pod in between crashes was enough to get the repo back into a state where it stopped crashing.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2022
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022
@thockin
Copy link
Member

thockin commented Feb 19, 2022

So there seems to be a few issues with GC

  1. GC can take too long and timeout
  2. GC can auto-detach and be running in the background on the next sync
  3. GC can auto-detach and be terminated in an init container (leaving some stale metadata)

At least #3 I was able to force repro, and git seems smart enough to realize that the remembered PID is dead, so not an issue.

To fix #2 we can set autoDetach to false. That converts #2 into #1

We probably want to use --auto on the "every sync" GC and only run more aggressive GC periodically, async to the main loop. That means we probably need some flag to control GC strategy. I'll have to think more on how to do this.

@thockin
Copy link
Member

thockin commented Feb 19, 2022

Also we should probably set prunExpire to something other than "all" or "now" (e.g. 1.hour.ago)

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2022
@thockin thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2022
@thockin thockin added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2022
@thockin
Copy link
Member

thockin commented Jun 13, 2023

So, the v4 branch has a LOT of changes around GC. I can't figure out how to force this situation to happen now. I'm going to close this and if someone can make it happen again (once I cut a v4, that is) then we can re-examine.

@thockin thockin closed this as completed Jun 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

No branches or pull requests

6 participants