Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script checks can overlap #3565

Closed
nh2 opened this issue Oct 10, 2017 · 2 comments
Closed

Script checks can overlap #3565

nh2 opened this issue Oct 10, 2017 · 2 comments
Milestone

Comments

@nh2
Copy link

nh2 commented Oct 10, 2017

To my surprise Consul runs script checks in an overlapping fashion instead of serially.

This can lead to lots of processes and memory leakage if the script check is short or the timeout is long.

It also means that an older script check can overwrite the result written by a younger script check.

For example, assume you put date into the script check. Then, it can theoretically happen that the date output captured by the script and show time going backwards (for example, if the first date was CPU-scheduled unluckily).

I would have assumed that consul runs no more than 1 instance of a script check at any given time.

At least not unless some special flag is set.

consul version for both Client and Server

Client: irrelevant
Server: Consul v0.9.3

consul info for both Client and Server

Client: irrelevant

Server:

# consul info
agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = 
	revision = 
	version = 0.9.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.0.0.2:8300
	server = true
raft:
	applied_index = 95568
	commit_index = 95568
	fsm_pending = 0
	last_contact = 0
	last_log_index = 95568
	last_log_term = 495
	last_snapshot_index = 90278
	last_snapshot_term = 479
	latest_configuration = [{Suffrage:Voter ID:10.0.0.3:8300 Address:10.0.0.3:8300} {Suffrage:Voter ID:10.0.0.1:8300 Address:10.0.0.1:8300} {Suffrage:Voter ID:10.0.0.2:8300 Address:10.0.0.2:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 495
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 231
	max_procs = 1
	os = linux
	version = go1.9
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 165
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 468
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 231
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

NixOS

Description of the Issue (and unexpected/desired result)

Consul runs script checks in an overlapping fashion instead of serially.

Reproduction steps

Define a check with interval 1s, make it sleep 10 seconds. Observe processes in ps or htop.

@slackpad
Copy link
Contributor

Hi @nh2 that's not the intended behavior. Will take a look.

@slackpad slackpad added this to the 1.0 milestone Oct 10, 2017
kyhavlov added a commit that referenced this issue Oct 10, 2017
Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around.

Fixes #3565.
slackpad pushed a commit that referenced this issue Oct 11, 2017
* Kill check processes after the timeout is reached

Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around.

Fixes #3565.

* Set err to non-nil when timeout occurs

* Fix check timeout test

* Kill entire process subtree on check timeout

* Add a docs note about windows subprocess termination
@nh2
Copy link
Author

nh2 commented Oct 11, 2017

For people who read this, here's a question whether this entire probelm is indeed already solved: #3567 (comment)

nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.
nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.
nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 16, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 22, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
nh2 added a commit to nh2/nixops-gluster-example that referenced this issue Oct 22, 2017
So far, after a full reboot of all machines, some would sometimes
have failed systemd units.

Key changes:

* A mount-only machine is added to test that this use case works.
  This made me find all the below troubles.

* Fix SSH hang by using .mount unit instead of fstab converter.
  This apparently works around NixOS/nixpkgs#30348 for me.
  No idea why the fstab converter would have this problem.
  The nasty
    pam_systemd(sshd:session): Failed to create session: Connection timed out
  error would slow down SSH logins by 25 seconds, also making reboots slower
  (because nixops keys upload uses SSH).
  It would also show things like `session-1.scope` as failed in systemctl.

* More robustly track (via Consul) whether the Gluster volume is already
  mountable from the client (that is, up and running on the servers).
  This has come a long way; to implement this, I've tried now
  * manual sessions, but those have 10 second min TTL which gets auto-extended
    even longer when rebooting, so I tried
  * script checks, which don't kill the subprocess even when you give a
    `timeout` and don't allow to set a TTL, so I tried
  * TTL checks + manual update script, and not even those set the check to
    failed when the TTL expires
  See my filed Consul bugs:
  * hashicorp/consul#3569
  * hashicorp/consul#3563
  * hashicorp/consul#3565
  So I am using a more specific workaround now:
  A TTL check + manual update script, AND a script
  (`consul-scripting-helper.py waitUntilService --wait-for-index-change`)
  run by a service (`glusterReadyForClientMount.service`)
  that waits until the TTL of a check for the service is observed
  to be bumped at least once during the life-time of the script.
  When the script observes a TTL bump, we can be sure that at least
  one of the gluster servers has its volume up.

* `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used
  to check whether the volume is actually up.

* Using consul's DNS feature to automatically pick an available server
  for the mount.
  dnsmasq is used to forward DNS queries to the *.consul domain
  to the consul agent.
  `allow_stale = false` is used to ensure that the DNS queries
  are not outdated.

* Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings
  (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237).

* `consul-scripting-helper.py` received some fixes and extra loops
  to retry when Consul is down.

This commit also switches to using `services.glusterfs.tlsSettings`
as implemented in NixOS/nixpkgs#27340
which revealed a lot of the above issues.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants