Error on git-sync running as a CronJon #439

michal-jagiello-tmpl · 2021-08-09T06:53:40Z

Hi,
I'm using git-sync v3.3.4 in a CrobJob as an initContainer. It's a definition:

- name: clone-results-repo
  image: "{{ .Values.init.cloneRepo.image.repository }}:{{ .Values.init.cloneRepo.image.tag }}"
  imagePullPolicy: {{ .Values.init.cloneRepo.image.pullPolicy }}
  volumeMounts:
    - name: persistent-storage
      mountPath: /git
  env:
  - name: GIT_SYNC_REPO
    value: {{ .Values.git.url }}
  - name: GIT_SYNC_ONE_TIME
    value: "true"
  - name: GIT_SYNC_BRANCH
    value: my_awesome_branch
  - name: GIT_SYNC_DEPTH
    value: "1"
  - name: GIT_SYNC_USERNAME
    valueFrom:
      secretKeyRef:
        name: {{ include "my-awesome-app.fullname" . }}-git-credentials
        key: GIT_PULL_USERNAME
  - name: GIT_SYNC_PASSWORD
    valueFrom:
      secretKeyRef:
        name: {{ include "my-awesome-app.fullname" . }}-git-credentials
        key: GIT_PULL_PASSWORD
  - name: GIT_SYNC_ROOT
    value: /git
  - name: GIT_SYNC_TIMEOUT
    value: "99999"

There is another container which also mounts persistent-storage volume and consumes the data from the cloned repo. The issue is that after few succeeded executions I have always the same error:

INFO: detected pid 1, running init handler
I0809 06:29:23.648815      11 main.go:507] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"]
I0809 06:29:24.077876      11 main.go:1003] "level"=0 "msg"="update required" "rev"="HEAD" "local"="1172cc4eeed3a3dd6d5e8fb65f3c15134adf9f32" "remote"="bfa07ea5354c25fa7e267dbcb6bbb305f2bd315f"
I0809 06:29:24.077969      11 main.go:690] "level"=0 "msg"="syncing git" "rev"="HEAD" "hash"="bfa07ea5354c25fa7e267dbcb6bbb305f2bd315f"
E0809 06:29:24.702409      11 main.go:172] "msg"="too many failures, aborting" "error"="Run(git gc --prune=all): exit status 128: { stdout: "", stderr: "fatal: gc is already running on machine 'my-awesome-app-1628460000-pr7v8' pid 49 (use --force if not)\n" }" "failCount"=0

and pod my-awesome-app-1628460000-pr7v8 does not exist anymore.
Repository has c.a. 7GB of data (if it matters).

The text was updated successfully, but these errors were encountered:

thockin · 2021-08-09T19:20:37Z

It looks like `git gc` was running and terminated early (e.g. your pod died) and since you user persistent storage, it's "corrupt". I am trying to reproduce this so I can figure out what to do, but not having any luck. You can't send me a tarfile of your repo in this state, can you?

…

On Sun, Aug 8, 2021 at 11:53 PM michal-jagiello-tmpl < ***@***.***> wrote: Hi, I'm using git-sync v3.3.4 in a CrobJob as an initContainer. It's a definition: - name: clone-results-repo image: "{{ .Values.init.cloneRepo.image.repository }}:{{ .Values.init.cloneRepo.image.tag }}" imagePullPolicy: {{ .Values.init.cloneRepo.image.pullPolicy }} volumeMounts: - name: persistent-storage mountPath: /git env: - name: GIT_SYNC_REPO value: {{ .Values.git.url }} - name: GIT_SYNC_ONE_TIME value: "true" - name: GIT_SYNC_BRANCH value: my_awesome_branch - name: GIT_SYNC_DEPTH value: "1" - name: GIT_SYNC_USERNAME valueFrom: secretKeyRef: name: {{ include "my-awesome-app.fullname" . }}-git-credentials key: GIT_PULL_USERNAME - name: GIT_SYNC_PASSWORD valueFrom: secretKeyRef: name: {{ include "my-awesome-app.fullname" . }}-git-credentials key: GIT_PULL_PASSWORD - name: GIT_SYNC_ROOT value: /git - name: GIT_SYNC_TIMEOUT value: "99999" There is another container which also mounts persistent-storage volume and consumes the data from the cloned repo. The issue is that after few succeeded executions I have always the same error: INFO: detected pid 1, running init handler I0809 06:29:23.648815 11 main.go:507] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"] I0809 06:29:24.077876 11 main.go:1003] "level"=0 "msg"="update required" "rev"="HEAD" "local"="1172cc4eeed3a3dd6d5e8fb65f3c15134adf9f32" "remote"="bfa07ea5354c25fa7e267dbcb6bbb305f2bd315f" I0809 06:29:24.077969 11 main.go:690] "level"=0 "msg"="syncing git" "rev"="HEAD" "hash"="bfa07ea5354c25fa7e267dbcb6bbb305f2bd315f" E0809 06:29:24.702409 11 main.go:172] "msg"="too many failures, aborting" "error"="Run(git gc --prune=all): exit status 128: { stdout: "", stderr: "fatal: gc is already running on machine 'my-awesome-app-1628460000-pr7v8' pid 49 (use --force if not)\n" }" "failCount"=0 and pod my-awesome-app-1628460000-pr7v8 does not exist anymore. Repository has c.a. 7GB of data (if it matters). — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#439>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVHN2MBDEQNNAVEBU5DT353P7ANCNFSM5BZMTLUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

michal-jagiello-tmpl · 2021-08-10T11:05:35Z

@thockin no, unfortunately not :| is that git gc runs asyncronously? Can that process be somehow interrupted?

thockin · 2021-08-10T14:03:21Z

I'm not really a git expert, so that's what I want to figure out :(

…

On Tue, Aug 10, 2021, 4:05 AM michal-jagiello-tmpl ***@***.***> wrote: @thockin <https://github.com/thockin> no, unfortunately not :| is that git gc runs asyncronously? Can that process be somehow interrupted? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVFSSDXXZDPTFXHYHI3T4EBYTANCNFSM5BZMTLUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

thockin · 2021-08-10T17:56:00Z

I mean, I could just add a --force and cross my fingers. Or I could add a --skip-gc flag for cases that are not going to be long-lived. But that seems likely to end in repo bloat. If you never ever GC it becuase you never have time, then what?

thockin · 2021-08-10T18:01:03Z

You could try --git-config=gc.autoDetach:false ? That could become the default if it works...

Or we could catch this specific case ("already running") and not treat it as fatal.

Without a repro, it's scary.

michal-jagiello-tmpl · 2021-08-11T07:18:11Z

I've run that cron with

env:
....
  - name: GIT_SYNC_GIT_CONFIG
    value: "gc.autoDetach:false"

but there is in doc:

gc.autoDetach
    Make git gc --auto return immediately and run in background if the system supports it. Default is true.

I see that you call git gc prune --all here.
Maybe the solution could be to add --disable-git-gc flag if user is absolutely sure what they doing and I could care myself about that. I'll run git gc by myself once a day and not every few hours?

thockin · 2021-08-11T14:37:50Z

As far as I can see, that error only happens if a prune is running in background. That can happen, I think, if the repo hits some "dirty" metric, which is possible on a shared volume. I am trying to figure out if there is REALLY a prune running, or just some state left on disk. Hence why a repro would be helpful. :)

…

On Wed, Aug 11, 2021, 12:18 AM michal-jagiello-tmpl < ***@***.***> wrote: I've run that cron with env: .... - name: GIT_SYNC_GIT_CONFIG value: "gc.autoDetach:false" but there is in doc <https://git-scm.com/docs/git-gc#Documentation/git-gc.txt-gcautoDetach>: gc.autoDetach Make git gc --auto return immediately and run in background if the system supports it. Default is true. I see that you call git gc prune --all here <https://github.com/kubernetes/git-sync/blob/259f7d80007b0f2342756b359c3a7f3a80e99348/cmd/git-sync/main.go#L712> . Maybe the solution could be to add --disable-git-gc flag if user is absolutely sure what they doing and I could care myself about that. I'll run git gc by myself once a day and not every few hours? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVHJBH377S47KUG3CSLT4IP35ANCNFSM5BZMTLUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

christophercutajar · 2021-10-19T11:46:32Z

Over the past few days we started to experience such issue!

Logs:

+ playbook-slackbot-deployment-69756b6699-sc5c4 › playbook-sync
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:38:03.158022829+02:00 INFO: detected pid 1, running init handler
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:38:03.197820105+02:00 I1019 11:38:03.197508      11 main.go:507] "level"=0 "msg"="starting up" "pid"=11 "args"=["/git-sync"]
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:38:03.197887375+02:00 I1019 11:38:03.197689      11 main.go:860] "level"=0 "msg"="cloning repo" "origin"="https://github.com/<org>/<repo>.git" "path"="/git"
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:39:47.561229307+02:00 I1019 11:39:47.560916      11 main.go:690] "level"=0 "msg"="syncing git" "rev"="HEAD" "hash"="dd21cb48350c2354a4b36ad535173ff962e75fad"
playbook-slackbot-deployment-69756b6699-sc5c4 playbook-sync 2021-10-19T13:42:26.640201065+02:00 E1019 11:42:26.639933      11 main.go:172] "msg"="too many failures, aborting" "error"="Run(git gc --prune=all): context deadline exceeded: { stdout: "", stderr: "" }" "failCount"=0
- playbook-slackbot-deployment-69756b6699-sc5c4 › playbook-sync

Config:

- name: playbook-sync
  image: k8s.gcr.io/git-sync/git-sync:v3.3.4
  env:
  - name: GIT_SYNC_USERNAME
    value: "user"
  - name: GIT_SYNC_PASSWORD
    valueFrom:
      secretKeyRef:
        name: playbooks-bot-tokens
        key: github
  - name: GIT_SYNC_ROOT
    value: "/git"
  - name: GIT_SYNC_REPO
    value: "https://github.com/<org>/<repo>.git"
  - name: GIT_SYNC_BRANCH
    value: "master"
  volumeMounts:
  - name: playbooks-shared-data
    mountPath: /git

christophercutajar · 2021-10-19T16:30:04Z

Increasing the GIT_SYNC_TIMEOUT to 300 from the default 120 seems to have helped in resolving the issue

thockin · 2021-10-19T18:26:33Z

When I get some time I will try to force a repro case.

…

On Tue, Oct 19, 2021 at 9:30 AM Christopher Cutajar < ***@***.***> wrote: Increasing the GIT_SYNC_TIMEOUT to 300 from the default 120 seems to have helped in resolving the issue — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVE45UF2UXBOHYP5NQTUHWMJPANCNFSM5BZMTLUA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

jdavidheiser · 2021-10-20T14:31:45Z

I also just hit this - it's definitely a case where the garbage collection was interrupted while running on persistent storage. You can repro by kicking off a GC and then killing the pod before the GC finishes. Manually running git gc --force on the pod in between crashes was enough to get the repo back into a state where it stopped crashing.

k8s-triage-robot · 2022-01-18T15:03:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

thockin · 2022-02-19T19:05:20Z

So there seems to be a few issues with GC

GC can take too long and timeout
GC can auto-detach and be running in the background on the next sync
GC can auto-detach and be terminated in an init container (leaving some stale metadata)

At least #3 I was able to force repro, and git seems smart enough to realize that the remembered PID is dead, so not an issue.

To fix #2 we can set autoDetach to false. That converts #2 into #1

We probably want to use --auto on the "every sync" GC and only run more aggressive GC periodically, async to the main loop. That means we probably need some flag to control GC strategy. I'll have to think more on how to do this.

thockin · 2022-02-19T19:12:30Z

Also we should probably set prunExpire to something other than "all" or "now" (e.g. 1.hour.ago)

k8s-triage-robot · 2022-05-20T20:11:11Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-08-21T20:09:06Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

thockin · 2023-06-13T22:51:21Z

So, the v4 branch has a LOT of changes around GC. I can't figure out how to force this situation to happen now. I'm going to close this and if someone can make it happen again (once I cut a v4, that is) then we can re-examine.

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 18, 2022

thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 19, 2022

This was referenced Feb 21, 2022

Add GC controls, e2e regexes (v3 branch) #495

Merged

Add GC controls, e2e regexes (v4 branch) #496

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 20, 2022

thockin removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 23, 2022

thockin mentioned this issue Jun 18, 2022

Fail to init Airflow scheduler after new commit in git repo #532

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 21, 2022

thockin added lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 21, 2022

thockin closed this as completed Jun 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on git-sync running as a CronJon #439

Error on git-sync running as a CronJon #439

michal-jagiello-tmpl commented Aug 9, 2021

thockin commented Aug 9, 2021 via email

michal-jagiello-tmpl commented Aug 10, 2021

thockin commented Aug 10, 2021 via email

thockin commented Aug 10, 2021

thockin commented Aug 10, 2021

michal-jagiello-tmpl commented Aug 11, 2021

thockin commented Aug 11, 2021 via email

christophercutajar commented Oct 19, 2021

christophercutajar commented Oct 19, 2021

thockin commented Oct 19, 2021 via email

jdavidheiser commented Oct 20, 2021

k8s-triage-robot commented Jan 18, 2022

thockin commented Feb 19, 2022 •

edited

Loading

thockin commented Feb 19, 2022

k8s-triage-robot commented May 20, 2022

k8s-triage-robot commented Aug 21, 2022

thockin commented Jun 13, 2023

Error on git-sync running as a CronJon #439

Error on git-sync running as a CronJon #439

Comments

michal-jagiello-tmpl commented Aug 9, 2021

thockin commented Aug 9, 2021 via email

michal-jagiello-tmpl commented Aug 10, 2021

thockin commented Aug 10, 2021 via email

thockin commented Aug 10, 2021

thockin commented Aug 10, 2021

michal-jagiello-tmpl commented Aug 11, 2021

thockin commented Aug 11, 2021 via email

christophercutajar commented Oct 19, 2021

christophercutajar commented Oct 19, 2021

thockin commented Oct 19, 2021 via email

jdavidheiser commented Oct 20, 2021

k8s-triage-robot commented Jan 18, 2022

thockin commented Feb 19, 2022 • edited Loading

thockin commented Feb 19, 2022

k8s-triage-robot commented May 20, 2022

k8s-triage-robot commented Aug 21, 2022

thockin commented Jun 13, 2023

thockin commented Feb 19, 2022 •

edited

Loading