Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

recurring log "serf: attempting reconnect" to left server #3361

Closed
danilobuerger opened this issue Aug 4, 2017 · 15 comments · Fixed by #6028
Closed

recurring log "serf: attempting reconnect" to left server #3361

danilobuerger opened this issue Aug 4, 2017 · 15 comments · Fixed by #6028
Assignees
Labels
needs-investigation The issue described is detailed and complex. type/bug Feature does not function as expected
Milestone

Comments

@danilobuerger
Copy link

consul version

Server: 0.9.0

consul info

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = b79d951
	version = 0.9.0
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.31.1.83:8300
	server = true
raft:
	applied_index = 182734
	commit_index = 182734
	fsm_pending = 0
	last_contact = 0
	last_log_index = 182734
	last_log_term = 4
	last_snapshot_index = 180231
	last_snapshot_term = 4
	latest_configuration = [{Suffrage:Voter ID:61567a07-7122-5ebd-677b-e5f437e9558c Address:172.31.3.20:8300} {Suffrage:Voter ID:c95cdd96-f493-3c9e-35d7-c9b0370ccbf9 Address:172.31.1.83:8300} {Suffrage:Voter ID:bbcbad42-a097-0fd0-f813-7e68c46d7178 Address:172.31.7.163:8300}]
	latest_configuration_index = 174747
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 4
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 102
	max_procs = 1
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 4
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 4
	member_time = 34
	members = 10
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 12
	members = 4
	query_queue = 0
	query_time = 1

Operating system and Environment details

Amazon AMI 2017.03.1

Description of the Issue (and unexpected/desired result)

consul members:

[ec2-user@ip-172-31-1-83 ~]$ consul members
Node              Address             Status  Type    Build  Protocol  DC
ip-172-31-1-151   172.31.1.151:8301   left    server  0.9.0  2         eu-west-1
ip-172-31-1-83    172.31.1.83:8301    alive   server  0.9.0  2         eu-west-1
ip-172-31-10-77   172.31.10.77:8301   alive   client  0.9.0  2         eu-west-1
ip-172-31-11-106  172.31.11.106:8301  left    client  0.9.0  2         eu-west-1
ip-172-31-11-9    172.31.11.9:8301    alive   client  0.9.0  2         eu-west-1
ip-172-31-2-237   172.31.2.237:8301   left    client  0.9.0  2         eu-west-1
ip-172-31-3-20    172.31.3.20:8301    alive   server  0.9.0  2         eu-west-1
ip-172-31-4-153   172.31.4.153:8301   alive   client  0.9.0  2         eu-west-1
ip-172-31-7-163   172.31.7.163:8301   alive   server  0.9.0  2         eu-west-1
ip-172-31-7-176   172.31.7.176:8301   left    client  0.9.0  2         eu-west-1

I am seeing recurring logs of:

2017/08/04 12:00:41 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:02:44 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:03:17 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:03:50 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:06:23 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302

However, the server has left.

Reproduction steps

I have a AWS Autoscaling Group (Desired Count 3) with the following consul config:

{
  "server": true,
  "datacenter": "${AWS::Region}",
  "data_dir": "/var/consul",
  "encrypt": "${EncryptionKey}",
  "bootstrap_expect": 3,
  "retry_join_ec2": {
    "region": "${AWS::Region}",
    "tag_key": "consul",
    "tag_value": "server"
  },
  "raft_protocol": 3,
  "disable_update_check": true
}

I then increase the desired count to 4 and then back to 3. After that the then terminated consul server has left the cluster but the logs "serf: attempting reconnect" keep coming.

@sdot257
Copy link

sdot257 commented Sep 1, 2017

I'm having the same issues. We're also using ASG and the nodes are attemptiong to connect to all the servers that have "left" the cluster.

@sdot257
Copy link

sdot257 commented Sep 1, 2017

@danilobuerger hey as a side note, i'm curious about the bootstrap_expect option, is that needed if one uses the retry_join_ec2 option?

@danilobuerger
Copy link
Author

I don't know if it's needed.

@webengineer
Copy link

consul force-leave 172.31.1.151 should fix it.

@danilobuerger
Copy link
Author

@webengineer it does not.

@slackpad
Copy link
Contributor

This should be fixed by #3611.

@danilobuerger
Copy link
Author

@slackpad Nope. I just tried with consul 1.0.2, same problem. Logs keep on coming, force-leaving them as suggested by the CHANGELOG doesn't work either.

@slackpad slackpad reopened this Dec 20, 2017
@TomGudman
Copy link

TomGudman commented Jan 29, 2018

Same issue with 1.0.3.

@slackpad slackpad added this to the Next milestone Feb 2, 2018
@slackpad slackpad added the type/bug Feature does not function as expected label Feb 2, 2018
@stanvarlamov
Copy link

1.0.3 - reconnect attempts went away after force-leave (still there for many hours) and OS-level restart of the server's process (immediately gone)

@psyhomb
Copy link

psyhomb commented Mar 22, 2018

force-leave doesn't work for me neither => v1.0.3

# consul members
...
ip-10-28-11-230.ec2.internal  10.28.11.230:8301  left    server  1.0.3  2         aws  <all>
...
# consul monitor
2018/03/22 16:54:29 [INFO] Force leaving node: ip-10-28-11-230.ec2.internal
2018/03/22 16:55:23 [INFO] serf: attempting reconnect to ip-10-28-11-230.ec2.internal.aws 10.28.11.230:8302
2018/03/22 16:56:23 [INFO] serf: attempting reconnect to ip-10-28-11-230.ec2.internal.aws 10.28.11.230:8302
2018/03/22 16:57:23 [INFO] serf: attempting reconnect to ip-10-28-11-230.ec2.internal.aws 10.28.11.230:8302

@avoidik
Copy link

avoidik commented Apr 13, 2018

for member in $(consul members -status=failed | awk 'NR>1{print $1;}'); do
consul force-leave $member
done

@1oglop1
Copy link

1oglop1 commented Aug 27, 2018

I have the same issue here: version1.2.2

@flyinprogrammer
Copy link

flyinprogrammer commented Sep 28, 2018

Still seeing this issue in 1.2.3


edit:

running consul force-leave <node name>.<dc>
seemed to clean up the issue, as after doing this it mentions:

consul: Handled member-leave event for server "<node name>.<dc>" in area "wan"

and now life is happy again.

@banks banks added the needs-investigation The issue described is detailed and complex. label Nov 28, 2018
@aashitvyas
Copy link

We are seeing same issue on version 1.2.2.

@vilva42
Copy link

vilva42 commented Jun 5, 2019

Seeing this on 1.5.1 when I use -retry-join with ASG. Stopping and starting consul on the server fixes the attempting reconnect error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-investigation The issue described is detailed and complex. type/bug Feature does not function as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.