Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

native healthchecks are unreliable/inconsistent #5915

Closed
cjeanner opened this issue Apr 21, 2020 · 3 comments · Fixed by #6009
Closed

native healthchecks are unreliable/inconsistent #5915

cjeanner opened this issue Apr 21, 2020 · 3 comments · Fixed by #6009
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.

Comments

@cjeanner
Copy link

Is this a BUG REPORT or FEATURE REQUEST? (leave only one on its own line)

/kind bug

Description
Using native podman healthcheck feature is unreliable since there are inconsistencies between live state and the state reported within "inspect". For instance:

podman inspect neutron_api | jq '.[0]["State"]["Healthcheck"]'                                                                                                                                              
{                                                                                                                                                                                                                                             
  "Status": "healthy",
  "FailingStreak": 0,
  "Log": [
    {
      "Start": "2020-04-21T11:21:38.122154731Z",
      "End": "2020-04-21T11:21:39.199659552Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:38.917118406Z",
      "End": "2020-04-21T11:21:39.575425172Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:39.251824671Z",
      "End": "2020-04-21T11:21:39.767073492Z",
      "ExitCode": 0,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:47.97086488Z",
      "End": "2020-04-21T11:21:48.443987962Z",
      "ExitCode": 1,
      "Output": ""
    },
    {
      "Start": "2020-04-21T11:21:47.987444237Z",
      "End": "2020-04-21T11:21:48.497487617Z",
      "ExitCode": 0,
      "Output": ""
    }
  ]
}

while:

[CentOS-8.1 - root@undercloud ~]# podman healthcheck run neutron_api
unhealthy

The right state is the latter, I've broken on purpose the healthcheck in order to do some tests....

Steps to reproduce the issue:

  1. Get a container with a working healthcheck
  2. Break that healthcheck
  3. Check the healthcheck using both "podman healthcheck run" and "podman inspect"

Describe the results you received:
We don't get the right result in the "inspect" output

Describe the results you expected:
We should get the correct result

Additional information you deem important (e.g. issue happens only occasionally):
The full inspect output, since there are other healthcheck-related things in the Config block. Maybe we did it wrong?
https://gist.github.com/cjeanner/1b1ba82eae7331c2c688ded4dcf8d111

Output of podman version:

Version:            1.6.4
RemoteAPI Version:  1
Go Version:         go1.13.4
OS/Arch:            linux/amd64

Output of podman info --debug:

debug:
  compiler: gc
  git commit: ""
  go version: go1.13.4
  podman version: 1.6.4
host:
  BuildahVersion: 1.12.0-dev
  CgroupVersion: v1
  Conmon:
    package: conmon-2.0.6-1.module_el8.1.0+298+41f9343a.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.0.6, commit: 2721f230f94894671f141762bd0d1af2fb263239'
  Distribution:
    distribution: '"centos"'
    version: "8"
  MemFree: 868073472
  MemTotal: 29264334848
  OCIRuntime:
    name: runc
    package: runc-1.0.0-64.rc9.module_el8.1.0+298+41f9343a.x86_64
    path: /usr/bin/runc
    version: 'runc version spec: 1.0.1-dev'
  SwapFree: 0
  SwapTotal: 0
  arch: amd64
  cpus: 16
  eventlogger: journald
  hostname: undercloud.localdomain
  kernel: 4.18.0-147.8.1.el8_1.x86_64
  os: linux
  rootless: false
  uptime: 4h 17m 54.55s (Approximately 0.17 days)
registries:
  blocked: null
  insecure:
  - undercloud.ctlplane.localdomain
  - 192.168.24.1
  - 192.168.24.3
  - undercloud.ctlplane
  search:
  - registry.redhat.io
  - registry.access.redhat.com
  - registry.fedoraproject.org
  - registry.centos.org
  - docker.io
store:
  ConfigFile: /etc/containers/storage.conf
  ContainerStore:
    number: 84
  GraphDriverName: overlay
  GraphOptions: {}
  GraphRoot: /var/lib/containers/storage
  GraphStatus:
    Backing Filesystem: xfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  ImageStore:
    number: 35
  RunRoot: /var/run/containers/storage
  VolumePath: /var/lib/containers/storage/volumes

Package info (e.g. output of rpm -q podman or apt list podman):

podman-1.6.4-4.module_el8.1.0+298+41f9343a.x86_64
@openshift-ci-robot openshift-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Apr 21, 2020
@mheon
Copy link
Member

mheon commented Apr 21, 2020

@baude PTAL

@cjeanner
Copy link
Author

Hello!
Sooo I took some more time to get a reproducer, and detected some other inconsistencies.

Reproducer

Dockerfile:

FROM alpine:latest

RUN apk add --no-cache vim
COPY healthcheck /healthcheck
RUN chmod +x /healthcheck

COPY init.sh /init.sh
RUN chmod +x /init.sh

CMD ["/init.sh"]

healthcheck

#!/bin/sh
echo 'This is a success'
exit 0

init.sh

#!/bin/sh
while :; do
  sleep 30
done

Methodology

  1. build the container using podman build -t healthcheck .
  2. run the container using podman run --health-cmd /healthcheck --health-interval 60s --health-retries 3 -d --name healthcheck healthcheck
  3. check the healthcheck state using both podman healthcheck run healthcheck and podman inspect healthcheck | jq '.[0]["State"]'

Break the healthcheck

  1. podman exec -uroot -ti healthcheck sh
  2. vi healthcheck
  3. modify its message and exit code - something like echo "this is a failure" and exit 1 is enough
  4. Check the healthcheck via podman inspect healthcheck | jq '.[0]["State"]'

You can run the following in order to always have the healthcheck status as reported by podman inspect: watch "podman inspect healthcheck | jq '.[0][\"State\"]'"

Issues

  1. podman inspect doesn't report the right state. From time to time you might see a failure, but you won't get the output, only the exit-code set to 1, and the FailingStreak will raise to 1. It will later go back to 0 as the failure is removed, replaced by a wrong success
  2. Running podman healthcheck run healthcheck actually DOES show the failure in podman inspect. Running the healthcheck manually over 3 times will toggle the container to "unhealthy"
  3. Starting a container with a failing healthcheck will keep it in "Starting" state without any failure reported in the inspect output. In fact, it seems that it doesn't refresh the healthcheck. In my case, there isn't any report of new healthcheck run since I started the container, a couple of minutes ago. Last report is from yesterday:
{
  "OciVersion": "1.0.1-dev",
  "Status": "running",
  "Running": true,
  "Paused": false,
  "Restarting": false,
  "OOMKilled": false,
  "Dead": false,
  "Pid": 70624,
  "ConmonPid": 70612,
  "ExitCode": 0,
  "Error": "",
  "StartedAt": "2020-04-23T07:50:06.660901138Z",
  "FinishedAt": "0001-01-01T00:00:00Z",
  "Healthcheck": {
    "Status": "starting",
    "FailingStreak": 0,
    "Log": [
      {
        "Start": "2020-04-22T14:46:01.170792682Z",
        "End": "2020-04-22T14:46:01.618066347Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:02.233518944Z",
        "End": "2020-04-22T14:46:02.690192626Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:03.171000823Z",
        "End": "2020-04-22T14:46:03.616105101Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:04.112180387Z",
        "End": "2020-04-22T14:46:04.54862729Z",
        "ExitCode": 0,
        "Output": ""
      },
      {
        "Start": "2020-04-22T14:46:04.97645628Z",
        "End": "2020-04-22T14:46:05.431782197Z",
        "ExitCode": 0,
        "Output": ""
      }
    ]
  }
}

@cjeanner
Copy link
Author

cjeanner commented Apr 23, 2020

After even more digging, I found a potential reason:
it seems a log is pushed here: /var/log/containers/stdouts/healthcheck.log
I'm not really sure how this is done/managed - but apparently every healthcheck is overriding the file on its own. It's not in an append fashion, meaning container-1 will override container-2 result.

Since the log doesn't have any mention of the container, it might explain why "podman inspect" doesn't report the right state.

Now, about the location:
/var/log/containers/stdouts is the base directory of the container log we've set up in ALL our containers.
For instance:

"LogPath": "/var/log/containers/stdouts/swift_proxy.log"

So. Question: is the healthcheck.log location built on some "dirname(LogPath)"? This would explain a lot. And a correction would be really simple and easy, something like:
os.path.join(os.path.dirname(LogPath), container_name_healthcheck.log))
(of course, this is just to show the idea)

Does it make any sense? Did I just find the root cause? :)

baude added a commit to baude/podman that referenced this issue Apr 27, 2020
instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user.

Fixes: containers#5915

Signed-off-by: Brent Baude <bbaude@redhat.com>
baude added a commit to baude/podman that referenced this issue Apr 27, 2020
instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user.

Fixes: containers#5915

Signed-off-by: Brent Baude <bbaude@redhat.com>
snj33v pushed a commit to snj33v/libpod that referenced this issue May 31, 2020
instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user.

Fixes: containers#5915

Signed-off-by: Brent Baude <bbaude@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Jun 24, 2020
instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user.

Fixes: containers#5915

Signed-off-by: Brent Baude <bbaude@redhat.com>
mheon pushed a commit to mheon/libpod that referenced this issue Sep 16, 2020
instead of using the container log path to derive where to put the healthchecks, we now put them into the rundir to avoid collision of health check log files when the log path is set by user.

Fixes: containers#5915

Signed-off-by: Brent Baude <bbaude@redhat.com>
@github-actions github-actions bot added the locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments. label Sep 23, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 23, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
kind/bug Categorizes issue or PR as related to a bug. locked - please file new issue/PR Assist humans wanting to comment on an old issue or PR with locked comments.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants