-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Script checks can overlap #3565
Milestone
Comments
Hi @nh2 that's not the intended behavior. Will take a look. |
kyhavlov
added a commit
that referenced
this issue
Oct 10, 2017
Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around. Fixes #3565.
slackpad
pushed a commit
that referenced
this issue
Oct 11, 2017
* Kill check processes after the timeout is reached Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around. Fixes #3565. * Set err to non-nil when timeout occurs * Fix check timeout test * Kill entire process subtree on check timeout * Add a docs note about windows subprocess termination
For people who read this, here's a question whether this entire probelm is indeed already solved: #3567 (comment) |
nh2
added a commit
to nh2/nixops-gluster-example
that referenced
this issue
Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down.
nh2
added a commit
to nh2/nixops-gluster-example
that referenced
this issue
Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down.
nh2
added a commit
to nh2/nixops-gluster-example
that referenced
this issue
Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down. This commit also switches to using `services.glusterfs.tlsSettings` as implemented in NixOS/nixpkgs#27340 which revealed a lot of the above issues.
nh2
added a commit
to nh2/nixops-gluster-example
that referenced
this issue
Oct 22, 2017
So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down. This commit also switches to using `services.glusterfs.tlsSettings` as implemented in NixOS/nixpkgs#27340 which revealed a lot of the above issues.
nh2
added a commit
to nh2/nixops-gluster-example
that referenced
this issue
Oct 22, 2017
So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down. This commit also switches to using `services.glusterfs.tlsSettings` as implemented in NixOS/nixpkgs#27340 which revealed a lot of the above issues.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To my surprise Consul runs script checks in an overlapping fashion instead of serially.
This can lead to lots of processes and memory leakage if the script check is short or the
timeout
is long.It also means that an older script check can overwrite the result written by a younger script check.
For example, assume you put
date
into the script check. Then, it can theoretically happen that the date output captured by the script and show time going backwards (for example, if the firstdate
was CPU-scheduled unluckily).I would have assumed that consul runs no more than 1 instance of a script check at any given time.
At least not unless some special flag is set.
consul version
for both Client and ServerClient: irrelevant
Server:
Consul v0.9.3
consul info
for both Client and ServerClient: irrelevant
Server:
Operating system and Environment details
NixOS
Description of the Issue (and unexpected/desired result)
Consul runs script checks in an overlapping fashion instead of serially.
Reproduction steps
Define a check with
interval
1s
, make itsleep 10
seconds. Observe processes inps
orhtop
.The text was updated successfully, but these errors were encountered: