QUESTION: What is terminating the DinD container? #616

jwalters-gpsw · 2021-06-08T22:25:44Z

About 5% of the time my workflows are failing because the workspace is "dirty" (contain the contents of a prior checkout which causes an untar to fail). Because it's inconsistent when it happens (sometimes the job succeeds, sometimes it doesn't) I'm guessing it must be a timing issue with the restarts of the pod's containers. I thought I would modify the runner container wait on the DinD container restart but couldn't figure out what was triggering the DinD container to restart.

Something in runsvc.sh?

Log of the DinD container shutdown below.

runner
Runner listener exited with error code 0
runner
Runner listener exit with 0 return code, stop the service, no retry needed.
docker
time="2021-06-08T22:14:39.462407666Z" level=info msg="Processing signal 'terminated'"
docker
time="2021-06-08T22:14:39.463396975Z" level=info msg="Daemon shutdown complete"
docker
time="2021-06-08T22:14:39.463454993Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=moby
docker
time="2021-06-08T22:14:39.463507528Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
docker
time="2021-06-08T22:14:39.463605980Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
docker
rpc error: code = Unknown desc = Error: No such container: 10da79bf7edf42ac2629bea2a8dd9b4e603357090b05f62c29ac5628e53a308e

The text was updated successfully, but these errors were encountered:

mumoshu · 2021-06-08T23:27:40Z

@jwalters-gpsw Hey. This shouldn't happen as long as you use it normally. Are you bind mounting /var/lib/docker and/or /var/run/docker.sock onto runner pods?

jwalters-gpsw · 2021-06-09T12:41:07Z

Not doing anything special with the controller Helm install or the RunnerDeployment. No changes to the controller install. It happens a small percentage of the time. I have also tried to move the runner to our private docker repo (to avoid rate limit issue) and it still occasionally happens. Here is the RunnerDeployment:

apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
  name: v0.0.1-runnerdeployment
spec:
  replicas: 6
  template:
    spec:
      group: OrgGroup
      labels:
      - stable
      - v0.0.1
      organization: xxx
      resources:
        limits:
          cpu: "2"
          memory: "4Gi"
        requests:
          cpu: "600m"
          memory: "300Mi"

jwalters-gpsw · 2021-06-09T12:51:43Z

For more context, the failing job pulls down a tarfile of the workspace in a previous job and untars it. The untar fails becausw there are files already in the workspace. No checkout in the job.

  lint-js:
    needs: build
    runs-on: [self-hosted, stable]

    steps:
      - name: Download workspace
        uses: actions/download-artifact@v2
        with:
          name: workspace-${{ env.node-version }}
          path: ~/
      - name: Restore workspace
        run: tar -xhf ~/workspace.tar
      - name: Use Node.js ${{ env.node-version }}
        uses: actions/setup-node@v2.1.5
        with:
          node-version: ${{ env.node-version }}
      - run: npm install -g yarn
      - run: make lint-js

toast-gear · 2021-06-09T12:56:15Z

Have you got an example you can see? Can you confirm in this example the runner that gets assigned and fails is unique? (see the Set up job section of a workflow log. I'm wondering if the --once flag we use isn't working as designed, this is going to get enabled by GitHub for truly guaranteed run once jobs soon

mumoshu · 2021-06-09T23:40:42Z

Yeah, this might be related to #466. A warm of actions jobs can result in occasional failure due to epehemral github actions runners mistakenly dequeues jobs while shutting down, being unable to complete the jobs.

@jwalters-gpsw Could you try upgrading to the latest controller 0.19.0 and try setting ephemeral: false? I think the race issue of ephemeral runners can be alleviated with that feature.

mumoshu · 2021-06-09T23:41:54Z

QUESTION: What is terminating the DinD container?

So, back to your original question, it is k8s that stopping the dind container on pod deletion. By default a runner is ephemeral, which means it tries its best to shut down after a single job run.

jwalters-gpsw · 2021-06-10T16:49:36Z

I've hooked the system up to Datadog and am now capturing the all the logs (which were disappearing after the job run). You can close this issue until I have something more specific and concrete.

On the shutdown of the DinD container... I want to know what the actual mechanism shutting it down is. A SIGTERM and if so where is it sent? Etc.

mumoshu · 2021-06-11T00:28:53Z

@jwalters-gpsw It follows K8s' standard pod termination process as documented in https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination. Does it answer your question?

stale · 2021-07-11T00:29:07Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale bot added the stale label Jul 11, 2021

stale bot closed this as completed Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

QUESTION: What is terminating the DinD container? #616

QUESTION: What is terminating the DinD container? #616

jwalters-gpsw commented Jun 8, 2021

mumoshu commented Jun 8, 2021

jwalters-gpsw commented Jun 9, 2021

jwalters-gpsw commented Jun 9, 2021 •

edited

Loading

toast-gear commented Jun 9, 2021 •

edited

Loading

mumoshu commented Jun 9, 2021 •

edited

Loading

mumoshu commented Jun 9, 2021 •

edited

Loading

jwalters-gpsw commented Jun 10, 2021

mumoshu commented Jun 11, 2021

stale bot commented Jul 11, 2021

QUESTION: What is terminating the DinD container? #616

QUESTION: What is terminating the DinD container? #616

Comments

jwalters-gpsw commented Jun 8, 2021

mumoshu commented Jun 8, 2021

jwalters-gpsw commented Jun 9, 2021

jwalters-gpsw commented Jun 9, 2021 • edited Loading

toast-gear commented Jun 9, 2021 • edited Loading

mumoshu commented Jun 9, 2021 • edited Loading

mumoshu commented Jun 9, 2021 • edited Loading

jwalters-gpsw commented Jun 10, 2021

mumoshu commented Jun 11, 2021

stale bot commented Jul 11, 2021

jwalters-gpsw commented Jun 9, 2021 •

edited

Loading

toast-gear commented Jun 9, 2021 •

edited

Loading

mumoshu commented Jun 9, 2021 •

edited

Loading

mumoshu commented Jun 9, 2021 •

edited

Loading