recurring log "serf: attempting reconnect" to left server #3361

danilobuerger · 2017-08-04T12:10:29Z

`consul version`

Server: 0.9.0

`consul info`

Server:

agent:
	check_monitors = 0
	check_ttls = 0
	checks = 0
	services = 0
build:
	prerelease = 
	revision = b79d951
	version = 0.9.0
consul:
	bootstrap = false
	known_datacenters = 1
	leader = true
	leader_addr = 172.31.1.83:8300
	server = true
raft:
	applied_index = 182734
	commit_index = 182734
	fsm_pending = 0
	last_contact = 0
	last_log_index = 182734
	last_log_term = 4
	last_snapshot_index = 180231
	last_snapshot_term = 4
	latest_configuration = [{Suffrage:Voter ID:61567a07-7122-5ebd-677b-e5f437e9558c Address:172.31.3.20:8300} {Suffrage:Voter ID:c95cdd96-f493-3c9e-35d7-c9b0370ccbf9 Address:172.31.1.83:8300} {Suffrage:Voter ID:bbcbad42-a097-0fd0-f813-7e68c46d7178 Address:172.31.7.163:8300}]
	latest_configuration_index = 174747
	num_peers = 2
	protocol_version = 3
	protocol_version_max = 3
	protocol_version_min = 0
	snapshot_version_max = 1
	snapshot_version_min = 0
	state = Leader
	term = 4
runtime:
	arch = amd64
	cpu_count = 1
	goroutines = 102
	max_procs = 1
	os = linux
	version = go1.8.3
serf_lan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 4
	failed = 0
	health_score = 0
	intent_queue = 0
	left = 4
	member_time = 34
	members = 10
	query_queue = 0
	query_time = 1
serf_wan:
	coordinate_resets = 0
	encrypted = true
	event_queue = 0
	event_time = 1
	failed = 1
	health_score = 0
	intent_queue = 0
	left = 0
	member_time = 12
	members = 4
	query_queue = 0
	query_time = 1

Operating system and Environment details

Amazon AMI 2017.03.1

Description of the Issue (and unexpected/desired result)

consul members:

[ec2-user@ip-172-31-1-83 ~]$ consul members
Node              Address             Status  Type    Build  Protocol  DC
ip-172-31-1-151   172.31.1.151:8301   left    server  0.9.0  2         eu-west-1
ip-172-31-1-83    172.31.1.83:8301    alive   server  0.9.0  2         eu-west-1
ip-172-31-10-77   172.31.10.77:8301   alive   client  0.9.0  2         eu-west-1
ip-172-31-11-106  172.31.11.106:8301  left    client  0.9.0  2         eu-west-1
ip-172-31-11-9    172.31.11.9:8301    alive   client  0.9.0  2         eu-west-1
ip-172-31-2-237   172.31.2.237:8301   left    client  0.9.0  2         eu-west-1
ip-172-31-3-20    172.31.3.20:8301    alive   server  0.9.0  2         eu-west-1
ip-172-31-4-153   172.31.4.153:8301   alive   client  0.9.0  2         eu-west-1
ip-172-31-7-163   172.31.7.163:8301   alive   server  0.9.0  2         eu-west-1
ip-172-31-7-176   172.31.7.176:8301   left    client  0.9.0  2         eu-west-1

I am seeing recurring logs of:

2017/08/04 12:00:41 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:02:44 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:03:17 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:03:50 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302
2017/08/04 12:06:23 [INFO] serf: attempting reconnect to ip-172-31-1-151.eu-west-1 172.31.1.151:8302

However, the server has left.

Reproduction steps

I have a AWS Autoscaling Group (Desired Count 3) with the following consul config:

{
  "server": true,
  "datacenter": "${AWS::Region}",
  "data_dir": "/var/consul",
  "encrypt": "${EncryptionKey}",
  "bootstrap_expect": 3,
  "retry_join_ec2": {
    "region": "${AWS::Region}",
    "tag_key": "consul",
    "tag_value": "server"
  },
  "raft_protocol": 3,
  "disable_update_check": true
}

I then increase the desired count to 4 and then back to 3. After that the then terminated consul server has left the cluster but the logs "serf: attempting reconnect" keep coming.

The text was updated successfully, but these errors were encountered:

sdot257 · 2017-09-01T14:14:09Z

I'm having the same issues. We're also using ASG and the nodes are attemptiong to connect to all the servers that have "left" the cluster.

sdot257 · 2017-09-01T14:41:22Z

@danilobuerger hey as a side note, i'm curious about the bootstrap_expect option, is that needed if one uses the retry_join_ec2 option?

danilobuerger · 2017-09-01T17:04:55Z

I don't know if it's needed.

webengineer · 2017-09-11T08:44:45Z

consul force-leave 172.31.1.151 should fix it.

danilobuerger · 2017-09-11T08:46:14Z

@webengineer it does not.

slackpad · 2017-12-18T20:25:20Z

This should be fixed by #3611.

danilobuerger · 2017-12-20T11:48:54Z

@slackpad Nope. I just tried with consul 1.0.2, same problem. Logs keep on coming, force-leaving them as suggested by the CHANGELOG doesn't work either.

TomGudman · 2018-01-29T04:38:00Z

Same issue with 1.0.3.

stanvarlamov · 2018-02-03T04:40:35Z

1.0.3 - reconnect attempts went away after force-leave (still there for many hours) and OS-level restart of the server's process (immediately gone)

psyhomb · 2018-03-22T17:06:01Z

force-leave doesn't work for me neither => v1.0.3

# consul members
...
ip-10-28-11-230.ec2.internal  10.28.11.230:8301  left    server  1.0.3  2         aws  <all>
...

# consul monitor
2018/03/22 16:54:29 [INFO] Force leaving node: ip-10-28-11-230.ec2.internal
2018/03/22 16:55:23 [INFO] serf: attempting reconnect to ip-10-28-11-230.ec2.internal.aws 10.28.11.230:8302
2018/03/22 16:56:23 [INFO] serf: attempting reconnect to ip-10-28-11-230.ec2.internal.aws 10.28.11.230:8302
2018/03/22 16:57:23 [INFO] serf: attempting reconnect to ip-10-28-11-230.ec2.internal.aws 10.28.11.230:8302

avoidik · 2018-04-13T15:36:39Z

for member in $(consul members -status=failed | awk 'NR>1{print $1;}'); do
consul force-leave $member
done

1oglop1 · 2018-08-27T08:22:24Z

I have the same issue here: version1.2.2

flyinprogrammer · 2018-09-28T17:06:51Z

Still seeing this issue in 1.2.3

edit:

running consul force-leave <node name>.<dc>
seemed to clean up the issue, as after doing this it mentions:

consul: Handled member-leave event for server "<node name>.<dc>" in area "wan"

and now life is happy again.

aashitvyas · 2018-12-18T03:56:58Z

We are seeing same issue on version 1.2.2.

vilva42 · 2019-06-05T17:19:59Z

Seeing this on 1.5.1 when I use -retry-join with ASG. Stopping and starting consul on the server fixes the attempting reconnect error.

ozlevka-work mentioned this issue Oct 19, 2017

Consul multi node and HA EricomSoftwareLtd/Shield#26

Merged

slackpad closed this as completed Dec 18, 2017

slackpad reopened this Dec 20, 2017

slackpad added this to the Next milestone Feb 2, 2018

slackpad added the type/bug Feature does not function as expected label Feb 2, 2018

avoidik mentioned this issue Apr 13, 2018

Cluster becomes unresponsive and does not elect new leader after disk latency spike on leader #3552

Closed

banks added the needs-investigation The issue described is detailed and complex. label Nov 28, 2018

hanshasselberg assigned schristoff May 14, 2019

schristoff mentioned this issue Jun 26, 2019

Remove failed nodes from serfWAN #6028

Merged

schristoff closed this as completed in #6028 Jun 28, 2019

qbiqing mentioned this issue Aug 18, 2020

chore(fluentd): redirect logs from hashi services dsaidgovsg/terraform-modules#257

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

recurring log "serf: attempting reconnect" to left server #3361

recurring log "serf: attempting reconnect" to left server #3361

danilobuerger commented Aug 4, 2017

sdot257 commented Sep 1, 2017

sdot257 commented Sep 1, 2017

danilobuerger commented Sep 1, 2017

webengineer commented Sep 11, 2017

danilobuerger commented Sep 11, 2017

slackpad commented Dec 18, 2017

danilobuerger commented Dec 20, 2017

TomGudman commented Jan 29, 2018 •

edited

Loading

stanvarlamov commented Feb 3, 2018

psyhomb commented Mar 22, 2018 •

edited

Loading

avoidik commented Apr 13, 2018

1oglop1 commented Aug 27, 2018

flyinprogrammer commented Sep 28, 2018 •

edited

Loading

aashitvyas commented Dec 18, 2018

vilva42 commented Jun 5, 2019

recurring log "serf: attempting reconnect" to left server #3361

recurring log "serf: attempting reconnect" to left server #3361

Comments

danilobuerger commented Aug 4, 2017

consul version

consul info

Operating system and Environment details

Description of the Issue (and unexpected/desired result)

Reproduction steps

sdot257 commented Sep 1, 2017

sdot257 commented Sep 1, 2017

danilobuerger commented Sep 1, 2017

webengineer commented Sep 11, 2017

danilobuerger commented Sep 11, 2017

slackpad commented Dec 18, 2017

danilobuerger commented Dec 20, 2017

TomGudman commented Jan 29, 2018 • edited Loading

stanvarlamov commented Feb 3, 2018

psyhomb commented Mar 22, 2018 • edited Loading

avoidik commented Apr 13, 2018

1oglop1 commented Aug 27, 2018

flyinprogrammer commented Sep 28, 2018 • edited Loading

aashitvyas commented Dec 18, 2018

vilva42 commented Jun 5, 2019

`consul version`

`consul info`

TomGudman commented Jan 29, 2018 •

edited

Loading

psyhomb commented Mar 22, 2018 •

edited

Loading

flyinprogrammer commented Sep 28, 2018 •

edited

Loading