Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

QUESTION: What is terminating the DinD container? #616

Closed
jwalters-gpsw opened this issue Jun 8, 2021 · 9 comments
Closed

QUESTION: What is terminating the DinD container? #616

jwalters-gpsw opened this issue Jun 8, 2021 · 9 comments
Labels

Comments

@jwalters-gpsw
Copy link

About 5% of the time my workflows are failing because the workspace is "dirty" (contain the contents of a prior checkout which causes an untar to fail). Because it's inconsistent when it happens (sometimes the job succeeds, sometimes it doesn't) I'm guessing it must be a timing issue with the restarts of the pod's containers. I thought I would modify the runner container wait on the DinD container restart but couldn't figure out what was triggering the DinD container to restart.

Something in runsvc.sh?

Log of the DinD container shutdown below.

runner
Runner listener exited with error code 0
runner
Runner listener exit with 0 return code, stop the service, no retry needed.
docker
time="2021-06-08T22:14:39.462407666Z" level=info msg="Processing signal 'terminated'"
docker
time="2021-06-08T22:14:39.463396975Z" level=info msg="Daemon shutdown complete"
docker
time="2021-06-08T22:14:39.463454993Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
docker
time="2021-06-08T22:14:39.463507528Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
docker
time="2021-06-08T22:14:39.463605980Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
docker
rpc error: code = Unknown desc = Error: No such container: 10da79bf7edf42ac2629bea2a8dd9b4e603357090b05f62c29ac5628e53a308e
@mumoshu
Copy link
Collaborator

mumoshu commented Jun 8, 2021

@jwalters-gpsw Hey. This shouldn't happen as long as you use it normally. Are you bind mounting /var/lib/docker and/or /var/run/docker.sock onto runner pods?

@jwalters-gpsw
Copy link
Author

Not doing anything special with the controller Helm install or the RunnerDeployment. No changes to the controller install. It happens a small percentage of the time. I have also tried to move the runner to our private docker repo (to avoid rate limit issue) and it still occasionally happens. Here is the RunnerDeployment:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: v0.0.1-runnerdeployment
spec:
  replicas: 6
  template:
    spec:
      group: OrgGroup
      labels:
      - stable
      - v0.0.1
      organization: xxx
      resources:
        limits:
          cpu: "2"
          memory: "4Gi"
        requests:
          cpu: "600m"
          memory: "300Mi"

@jwalters-gpsw
Copy link
Author

jwalters-gpsw commented Jun 9, 2021

For more context, the failing job pulls down a tarfile of the workspace in a previous job and untars it. The untar fails becausw there are files already in the workspace. No checkout in the job.

  lint-js:
    needs: build
    runs-on: [self-hosted, stable]

    steps:
      - name: Download workspace
        uses: actions/download-artifact@v2
        with:
          name: workspace-${{ env.node-version }}
          path: ~/
      - name: Restore workspace
        run: tar -xhf ~/workspace.tar
      - name: Use Node.js ${{ env.node-version }}
        uses: actions/setup-node@v2.1.5
        with:
          node-version: ${{ env.node-version }}
      - run: npm install -g yarn
      - run: make lint-js

@toast-gear
Copy link
Collaborator

toast-gear commented Jun 9, 2021

Have you got an example you can see? Can you confirm in this example the runner that gets assigned and fails is unique? (see the Set up job section of a workflow log. I'm wondering if the --once flag we use isn't working as designed, this is going to get enabled by GitHub for truly guaranteed run once jobs soon

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 9, 2021

Yeah, this might be related to #466. A warm of actions jobs can result in occasional failure due to epehemral github actions runners mistakenly dequeues jobs while shutting down, being unable to complete the jobs.

@jwalters-gpsw Could you try upgrading to the latest controller 0.19.0 and try setting ephemeral: false? I think the race issue of ephemeral runners can be alleviated with that feature.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 9, 2021

QUESTION: What is terminating the DinD container?

So, back to your original question, it is k8s that stopping the dind container on pod deletion. By default a runner is ephemeral, which means it tries its best to shut down after a single job run.

@jwalters-gpsw
Copy link
Author

I've hooked the system up to Datadog and am now capturing the all the logs (which were disappearing after the job run). You can close this issue until I have something more specific and concrete.

On the shutdown of the DinD container... I want to know what the actual mechanism shutting it down is. A SIGTERM and if so where is it sent? Etc.

@mumoshu
Copy link
Collaborator

mumoshu commented Jun 11, 2021

@jwalters-gpsw It follows K8s' standard pod termination process as documented in https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination. Does it answer your question?

@stale
Copy link

stale bot commented Jul 11, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 11, 2021
@stale stale bot closed this as completed Jul 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants