Script checks can overlap #3565

nh2 · 2017-10-10T17:11:31Z

To my surprise Consul runs script checks in an overlapping fashion instead of serially.

This can lead to lots of processes and memory leakage if the script check is short or the timeout is long.

It also means that an older script check can overwrite the result written by a younger script check.

For example, assume you put date into the script check. Then, it can theoretically happen that the date output captured by the script and show time going backwards (for example, if the first date was CPU-scheduled unluckily).

I would have assumed that consul runs no more than 1 instance of a script check at any given time.

At least not unless some special flag is set.

`consul version` for both Client and Server

Client: irrelevant
Server: Consul v0.9.3

`consul info` for both Client and Server

Client: irrelevant

Server:

# consul info
agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = 
	revision = 
	version = 0.9.3
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.0.0.2:8300
	server = true
raft:
	applied_index = 95568
	commit_index = 95568
	fsm_pending = 0
	last_contact = 0
	last_log_index = 95568
	last_log_term = 495
	last_snapshot_index = 90278
	last_snapshot_term = 479
	latest_configuration = [{Suffrage:Voter ID:10.0.0.3:8300 Address:10.0.0.3:8300} {Suffrage:Voter ID:10.0.0.1:8300 Address:10.0.0.1:8300} {Suffrage:Voter ID:10.0.0.2:8300 Address:10.0.0.2:8300}]
	latest_configuration_index = 1
	num_peers = 2
	protocol_version = 2
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 495
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 231
	max_procs = 1
	os = linux
	version = go1.9
serf_lan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 165
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 468
	members = 5
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = false
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 231
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

NixOS

Description of the Issue (and unexpected/desired result)

Consul runs script checks in an overlapping fashion instead of serially.

Reproduction steps

Define a check with interval 1s, make it sleep 10 seconds. Observe processes in ps or htop.

The text was updated successfully, but these errors were encountered:

slackpad · 2017-10-10T17:18:03Z

Hi @nh2 that's not the intended behavior. Will take a look.

Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around. Fixes #3565.

* Kill check processes after the timeout is reached Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around. Fixes #3565. * Set err to non-nil when timeout occurs * Fix check timeout test * Kill entire process subtree on check timeout * Add a docs note about windows subprocess termination

nh2 · 2017-10-11T21:05:05Z

For people who read this, here's a question whether this entire probelm is indeed already solved: #3567 (comment)

So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down.

So far, after a full reboot of all machines, some would sometimes have failed systemd units. Key changes: * A mount-only machine is added to test that this use case works. This made me find all the below troubles. * Fix SSH hang by using .mount unit instead of fstab converter. This apparently works around NixOS/nixpkgs#30348 for me. No idea why the fstab converter would have this problem. The nasty pam_systemd(sshd:session): Failed to create session: Connection timed out error would slow down SSH logins by 25 seconds, also making reboots slower (because nixops keys upload uses SSH). It would also show things like `session-1.scope` as failed in systemctl. * More robustly track (via Consul) whether the Gluster volume is already mountable from the client (that is, up and running on the servers). This has come a long way; to implement this, I've tried now * manual sessions, but those have 10 second min TTL which gets auto-extended even longer when rebooting, so I tried * script checks, which don't kill the subprocess even when you give a `timeout` and don't allow to set a TTL, so I tried * TTL checks + manual update script, and not even those set the check to failed when the TTL expires See my filed Consul bugs: * hashicorp/consul#3569 * hashicorp/consul#3563 * hashicorp/consul#3565 So I am using a more specific workaround now: A TTL check + manual update script, AND a script (`consul-scripting-helper.py waitUntilService --wait-for-index-change`) run by a service (`glusterReadyForClientMount.service`) that waits until the TTL of a check for the service is observed to be bumped at least once during the life-time of the script. When the script observes a TTL bump, we can be sure that at least one of the gluster servers has its volume up. * `gluster volume status VOLUME_NAME detail | grep "^Online.*Y"` is used to check whether the volume is actually up. * Using consul's DNS feature to automatically pick an available server for the mount. dnsmasq is used to forward DNS queries to the *.consul domain to the consul agent. `allow_stale = false` is used to ensure that the DNS queries are not outdated. * Create `/etc/ssl/dhparam.pem` to avoid spurious Gluster warnings (see https://bugzilla.redhat.com/show_bug.cgi?id=1398237). * `consul-scripting-helper.py` received some fixes and extra loops to retry when Consul is down. This commit also switches to using `services.glusterfs.tlsSettings` as implemented in NixOS/nixpkgs#27340 which revealed a lot of the above issues.

slackpad added this to the 1.0 milestone Oct 10, 2017

kyhavlov mentioned this issue Oct 10, 2017

Kill check processes after the timeout is reached #3567

Merged

kyhavlov added a commit that referenced this issue Oct 10, 2017

Kill check processes after the timeout is reached

01d2c3f

Kill the subprocess spawned by a script check once the timeout is reached. Previously Consul just marked the check critical and left the subprocess around. Fixes #3565.

slackpad closed this as completed in #3567 Oct 11, 2017

slackpad mentioned this issue Oct 18, 2017

Incredible memory leak caused by rampant service checks #3457

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Script checks can overlap #3565

Script checks can overlap #3565

nh2 commented Oct 10, 2017

slackpad commented Oct 10, 2017

nh2 commented Oct 11, 2017

Script checks can overlap #3565

Script checks can overlap #3565

Comments

nh2 commented Oct 10, 2017

consul version for both Client and Server

consul info for both Client and Server

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

slackpad commented Oct 10, 2017

nh2 commented Oct 11, 2017

`consul version` for both Client and Server

`consul info` for both Client and Server