Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul reload - watch configs leaks file descriptors #4010

Closed
toomasp opened this issue Apr 4, 2018 · 7 comments
Closed

consul reload - watch configs leaks file descriptors #4010

toomasp opened this issue Apr 4, 2018 · 7 comments
Milestone

Comments

@toomasp
Copy link

toomasp commented Apr 4, 2018

Duplicating #3018 as requested

Description of the Issue (and unexpected/desired result)

  • on consul reload, consul opens all new connections per watch, without closing the old connections
  • eventually consul hits the max file descriptor limit if we reload enough times with enough watch configs
  • "too many open files" log proliferates

Reproduction steps

Have some watches and do a consul reload

consul version for both Client and Server

Client: Consul v1.0.6
Server: Consul v1.0.6

consul info for both Client and Server

Client:

agent:
	check_monitors = 1
	check_ttls = 0
	checks = 1
	services = 1
build:
	prerelease = 
	revision = 9a494b5f
	version = 1.0.6
consul:
	known_servers = 3
	server = false
runtime:
	arch = amd64
	cpu_count = 4
	goroutines = 60
	max_procs = 4
	os = linux
	version = go1.9.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 89
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1412
	members = 95
	query_queue = 0
	query_time = 16

Server:

agent:
	check_monitors = 2
	check_ttls = 0
	checks = 2
	services = 2
build:
	prerelease = 
	revision = 9a494b5f
	version = 1.0.6
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 10.65.177.81:8300
	server = true
raft:
	applied_index = 8159411
	commit_index = 8159411
	fsm_pending = 0
	last_contact = 0
	last_log_index = 8159411
	last_log_term = 448
	last_snapshot_index = 8154709
	last_snapshot_term = 445
	latest_configuration = [{Suffrage:Voter ID:99581ecb-1c8d-d9c0-4c82-c86d1c2be39c Address:10.65.177.82:8300} {Suffrage:Voter ID:fdda80c0-fb2f-eff0-0813-af4215e6c026 Address:10.65.177.81:8300} {Suffrage:Voter ID:3978bc89-f2d2-112a-be2d-3a1f85e39c05 Address:10.65.177.80:8300}]
	latest_configuration_index = 314183
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 448
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 494
	max_procs = 1
	os = linux
	version = go1.9.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 89
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 1412
	members = 95
	query_queue = 0
	query_time = 16
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 94
	members = 3
	query_queue = 0
	query_time = 1

Operating system and Environment details

# uname -a
Linux infra-conf-02.infraci.ptec 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
# cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core) 

Log Fragments

# netstat -antlp|grep consul|grep 8500|wc -l
6
# su - consul.io -c 'consul reload'
Configuration reload triggered
# netstat -antlp|grep consul|grep 8500|wc -l
9
# su - consul.io -c 'consul reload'
Configuration reload triggered
# netstat -antlp|grep consul|grep 8500|wc -l
14
# su - consul.io -c 'consul reload'
Configuration reload triggered
# netstat -antlp|grep consul|grep 8500|wc -l
15
# su - consul.io -c 'consul reload'
Configuration reload triggered
# netstat -antlp|grep consul|grep 8500|wc -l
24
# su - consul.io -c 'consul reload'
Configuration reload triggered
# netstat -antlp|grep consul|grep 8500|wc -l
34

This keeps happening growing on every reload.. but, if you wait for a while ( over a minute or so ) then it goes back down.

# netstat -antlp|grep consul|grep 8500|wc -l; su - consul.io -c 'consul reload'; netstat -antlp|grep consul|grep 8500|wc -l; echo sleeping; sleep 60; netstat -antlp|grep consul|grep 8500|wc -l
21
Configuration reload triggered
33
sleeping
23
@mkeeler
Copy link
Member

mkeeler commented Apr 5, 2018

@toomasp Does the fd count ever drop back down to the pre-reload value given enough time?

@toomasp
Copy link
Author

toomasp commented Apr 5, 2018

@mkeeler It seems to depend on the watches configured. One of my coworkers is currently investigating to determine which kind of watches exactly cause the FD limit not to decrease and what is special about them. Will get back to you once we know more.

Otherwise yes, with some watches, it returns to the pre-reload value given enough time.

@mihkelader
Copy link

The coworker here :)

I have bisected the issue to commit 10e0be6

The following minimal config reproduces the issue in CentOS 7 and openSUSE Tumbleweed running official linux-amd64 Consul binary.

Configuration:
{ "bind_addr": "127.0.0.1", "data_dir": "/tmp", "server": true, "bootstrap_expect": 1, "watches": [ { "type": "services", "args": [ "/usr/bin/true" ] } ] }

  • Run consul agent -config-file consul.json
  • Run consul reload (several times)
  • Observe growing number of established connections: netstat -tnp | grep ESTA.*consul | wc

@ab-fuze
Copy link

ab-fuze commented May 14, 2018

the issue is still present in 1.0.7 and 1.1.0

@mkeeler mkeeler modified the milestones: Upcoming, 1.2.0 May 16, 2018
@mkeeler
Copy link
Member

mkeeler commented Jun 1, 2018

@mihkelader Does this still "leak" fds with master. I think it may not have been leaking but rather was erroneously duplicating the watches.

See #4179 and the fix.

For me at least with a build of master the problem seems to be resolved. Please let me know if its fixed for you too and I will close this out.

@mihkelader
Copy link

Yes, the problem is gone now. We no longer run out of file descriptors when running Consul built from master.

@mkeeler
Copy link
Member

mkeeler commented Jun 4, 2018

@mihkelader Glad to hear that its fixed for you.

@mkeeler mkeeler closed this as completed Jun 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants