Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deregister orphan checks and service instances after accessorid token is lost and set to anonymous #9577

Closed
idrennanvmware opened this issue Jan 15, 2021 · 33 comments
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected

Comments

@idrennanvmware
Copy link

idrennanvmware commented Jan 15, 2021

Consul info for both Client and Server

Consul 1.9.1

After experiencing an unexpected cluster scenario (we had some nodes that had VMs crash) we have a number of orphaned services and checks in the Consul Service Discovery of a cluster (the clusters are in a healthy state besides these checks)..... However we are completely unable to deregister them.

Here's what we see in the logs for Consul

 check=service:_nomad-task-9523fe99-72eb-5742-a85a-c17f759c04c8-group-device-log-consumer-group-device-log-consumer-device_log_consumer_http-s>
 check=service:_nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http-sidecar-proxy:1 accessorID=>
 check=_nomad-check-583c1b1a627be50caeb61c1bb2de8ba2ecad9d1e accessorID=00000000-0000-0000-0000-000000000002
 check=_nomad-check-91cb4d33d9a27e0cea69eaea8c00cf17702ccde4 accessorID=00000000-0000-0000-0000-000000000002
 check=fabio-uem-services-qe-genesis-csl2-29998-ttl accessorID=00000000-0000-0000-0000-000000000002
 check=_nomad-check-d6a48d1f861f0e27a18b9f99608110f2fd1ff4c0 accessorID=00000000-0000-0000-0000-000000000002
 check=service:_nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http-sidecar-proxy:2 accessorID=>
 check=_nomad-check-d7f9b9aa185bbc9dfdee2bafa655816d37f34ef0 accessorID=00000000-0000-0000-0000-000000000002
 check=_nomad-check-3d4c0b7928a4fd3306f6e88a0af2bf931ca11adc accessorID=00000000-0000-0000-0000-000000000002
 check=_nomad-check-485e79b78ae284830003473c8300f23c3ef9d22f accessorID=00000000-0000-0000-0000-000000000002

Note the AccessorID - these services were not registered with that, but interestingly enough using the CLI and HTTP we are unable to clean these up.

./consul services deregister -id _nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http
Error registering service "": Unexpected response code: 500 (Unknown service {"_nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http" {}})

Some other variants of the above too (shooting in the dark)

./consul services deregister -id _nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http-sidecar-proxy:2
Error registering service "": Unexpected response code: 500 (Unknown service {"_nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http-sidecar-proxy:2" {}})

root [ /usr/local/bin ]# ./consul services deregister -id _nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http-sidecar-proxy
Error registering service "": Unexpected response code: 500 (Unknown service {"_nomad-task-a8ca2701-18f0-074b-cfd1-fb7a315f8ecc-group-device-log-api-group-device-log-api-api_http-sidecar-proxy" {}})

We have tried at the agent and the catalog with no success. We even shut down every single service registered in Nomad but the job registrations still remain.

At a bit of a loss on how to proceed here, but we have still kept the cluster in this state (fortunately it's not prod) so we can try any suggestions on how to recover and clean up the registry

Thanks!

@idrennanvmware
Copy link
Author

Additional attempt - Tried actually doing a graceful leave of the agent. Still no success in removing the checks either after the leave, or when we joined it back in

@fredwangwang
Copy link
Contributor

Here are more logs we saw (from a different time):

2021-01-15T14:58:56.206Z [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-494dd868-0b0b-15b2-dbd6-1b19d91995e0-filebeat-task-control-plane-logging- accessorID=00000000-0000-0000-0000-000000000002
2021-01-15T14:58:56.206Z [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-127c43e1-e5c5-36f6-8669-0141541c0ef7-kafka-telegraf-task-telegraf-kafka-sidecar- accessorID=00000000-0000-0000-0000-000000000002
2021-01-15T14:58:56.208Z [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-50fa8306-19a4-bf93-a607-2464605697dd-ingestion-service-api-task-ingestion-service-api-ingestion_service_http accessorID=00000000-0000-0000-0000-000000000002
2021-01-15T14:58:56.208Z [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-9ccbb3f2-d7d1-6dff-67e9-8fc9e7af7ee7-postgres-telegraf-task-telegraf-postgres-sidecar- accessorID=00000000-0000-0000-0000-000000000002
2021-01-15T14:58:56.208Z [WARN]  agent: Service deregistration blocked by ACLs: service=_nomad-task-fecff501-2ab3-ecae-3f58-c49016e8b7ed-redis-uem-rate-limiting-db accessorID=00000000-0000-0000-0000-000000000002
2021-01-15T14:58:56.208Z [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-826514bade19df62a49b31259d946f0e3483e8eb accessorID=00000000-0000-0000-0000-000000000002
2021-01-15T14:58:56.208Z [WARN]  agent: Check deregistration blocked by ACLs: check=_nomad-check-2e3314a1008447e4cdbba614a26c28d1212aea21 accessorID=00000000-0000-0000-0000-000000000002

basically the same behavior as described in this thread:
https://discuss.hashicorp.com/t/wrong-consul-token-sometimes-used-for-deregistering-checks-services/19076

@fredwangwang
Copy link
Contributor

fredwangwang commented Jan 15, 2021

Here is a memory dump of the State.services:

map[github.com/hashicorp/consul/agent/structs.ServiceID]*github.com/hashicorp/consul/agent/local.ServiceState [
	{ID: "_nomad-server-sabnopufu6xn2vcfgkjrn7gpzyszwfnf", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *(*"github.com/hashicorp/consul/agent/structs.NodeService")(0xc001a3c3c0),
		Token: "8ce25f9a-073a-f68f-75f9-31a194d06f2e",
		InSync: true,
		Deleted: false,
		WatchCh: chan struct {} {
			qcount: 1,
			dataqsiz: 1,
			buf: *[1]struct struct {} [
				{},
			],
			elemsize: 0,
			closed: 0,
			elemtype: *runtime._type {size: 0, ptrdata: 0, hash: 670477339, tflag: tflagExtraStar|tflagRegularMemory (10), align: 1, fieldAlign: 1, kind: 25, equal: runtime.memequal0, gcdata: *1, str: 120957, ptrToThis: 2418592},
			sendx: 0,
			recvx: 0,
			recvq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			sendq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			lock: runtime.mutex {
				lockRankStruct: runtime.lockRankStruct {},
				key: 0,},},},
	{ID: "_nomad-client-o3snibdanfasqkr4pzngjbylhgnhiy7z", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *(*"github.com/hashicorp/consul/agent/structs.NodeService")(0xc001a3c500),
		Token: "8ce25f9a-073a-f68f-75f9-31a194d06f2e",
		InSync: true,
		Deleted: false,
		WatchCh: chan struct {} {
			qcount: 1,
			dataqsiz: 1,
			buf: *[1]struct struct {} [
				{},
			],
			elemsize: 0,
			closed: 0,
			elemtype: *runtime._type {size: 0, ptrdata: 0, hash: 670477339, tflag: tflagExtraStar|tflagRegularMemory (10), align: 1, fieldAlign: 1, kind: 25, equal: runtime.memequal0, gcdata: *1, str: 120957, ptrToThis: 2418592},
			sendx: 0,
			recvx: 0,
			recvq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			sendq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			lock: runtime.mutex {
				lockRankStruct: runtime.lockRankStruct {},
				key: 0,},},},
	{ID: "consul-template", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *(*"github.com/hashicorp/consul/agent/structs.NodeService")(0xc00104e780),
		Token: "8ce25f9a-073a-f68f-75f9-31a194d06f2e",
		InSync: true,
		Deleted: false,
		WatchCh: chan struct {} {
			qcount: 1,
			dataqsiz: 1,
			buf: *[1]struct struct {} [
				{},
			],
			elemsize: 0,
			closed: 0,
			elemtype: *runtime._type {size: 0, ptrdata: 0, hash: 670477339, tflag: tflagExtraStar|tflagRegularMemory (10), align: 1, fieldAlign: 1, kind: 25, equal: runtime.memequal0, gcdata: *1, str: 120957, ptrToThis: 2418592},
			sendx: 0,
			recvx: 0,
			recvq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			sendq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			lock: runtime.mutex {
				lockRankStruct: runtime.lockRankStruct {},
				key: 0,},},},
	{ID: "_nomad-server-2qktkkm6fr2qxgxzegjmwl6b3xxdl3f3", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *(*"github.com/hashicorp/consul/agent/structs.NodeService")(0xc001a3c780),
		Token: "8ce25f9a-073a-f68f-75f9-31a194d06f2e",
		InSync: true,
		Deleted: false,
		WatchCh: chan struct {} {
			qcount: 1,
			dataqsiz: 1,
			buf: *[1]struct struct {} [
				{},
			],
			elemsize: 0,
			closed: 0,
			elemtype: *runtime._type {size: 0, ptrdata: 0, hash: 670477339, tflag: tflagExtraStar|tflagRegularMemory (10), align: 1, fieldAlign: 1, kind: 25, equal: runtime.memequal0, gcdata: *1, str: 120957, ptrToThis: 2418592},
			sendx: 0,
			recvx: 0,
			recvq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			sendq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			lock: runtime.mutex {
				lockRankStruct: runtime.lockRankStruct {},
				key: 0,},},},
	{ID: "_nomad-server-omrhoe5kekio37t5s3pghimnbfwpy3t7", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *(*"github.com/hashicorp/consul/agent/structs.NodeService")(0xc001a3c8c0),
		Token: "8ce25f9a-073a-f68f-75f9-31a194d06f2e",
		InSync: true,
		Deleted: false,
		WatchCh: chan struct {} {
			qcount: 1,
			dataqsiz: 1,
			buf: *[1]struct struct {} [
				{},
			],
			elemsize: 0,
			closed: 0,
			elemtype: *runtime._type {size: 0, ptrdata: 0, hash: 670477339, tflag: tflagExtraStar|tflagRegularMemory (10), align: 1, fieldAlign: 1, kind: 25, equal: runtime.memequal0, gcdata: *1, str: 120957, ptrToThis: 2418592},
			sendx: 0,
			recvx: 0,
			recvq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			sendq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			lock: runtime.mutex {
				lockRankStruct: runtime.lockRankStruct {},
				key: 0,},},},
	{ID: "_nomad-task-2b12fe99-52b4-17d7-7109-916990937ad6-device-state-ka...+57 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
	{ID: "_nomad-task-3aa949d3-71d6-28ac-8c3b-131a363f9947-linux-sampling-...+48 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
	{ID: "vault:10.0.1.2:8200", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *(*"github.com/hashicorp/consul/agent/structs.NodeService")(0xc001a3ca00),
		Token: "8ce25f9a-073a-f68f-75f9-31a194d06f2e",
		InSync: true,
		Deleted: false,
		WatchCh: chan struct {} {
			qcount: 1,
			dataqsiz: 1,
			buf: *[1]struct struct {} [
				{},
			],
			elemsize: 0,
			closed: 0,
			elemtype: *runtime._type {size: 0, ptrdata: 0, hash: 670477339, tflag: tflagExtraStar|tflagRegularMemory (10), align: 1, fieldAlign: 1, kind: 25, equal: runtime.memequal0, gcdata: *1, str: 120957, ptrToThis: 2418592},
			sendx: 0,
			recvx: 0,
			recvq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			sendq: waitq<struct {}> {
				first: *sudog<struct {}> nil,
				last: *sudog<struct {}> nil,},
			lock: runtime.mutex {
				lockRankStruct: runtime.lockRankStruct {},
				key: 0,},},},
	{ID: "_nomad-task-f2c25dd3-333e-d9a7-c22c-d97fb6eae9f9-device-state-ev...+53 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
	{ID: "_nomad-task-4979a6ba-a714-83c6-a182-39f6bdedf083-group-leader_co...+35 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
	{ID: "_nomad-task-2b12fe99-52b4-17d7-7109-916990937ad6-device-state-ka...+70 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
	{ID: "_nomad-task-d1ccbc84-9e5b-81e2-89be-9617715fd682-linux-sampling-...+68 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
	{ID: "_nomad-task-6985c71e-3307-bb97-6b4f-30ea47d1e7e8-group-replica_c...+51 more", EnterpriseMeta: github.com/hashicorp/consul/agent/structs.EnterpriseMeta {}}: *{
		Service: *github.com/hashicorp/consul/agent/structs.NodeService nil,
		Token: "",
		InSync: true,
		Deleted: true,
		WatchCh: chan struct {} {},},
]

For all the "ghost" service registration entries, i.e _nomad-task-.* which all have been stopped in Nomad, have empty Token.

@dnephin
Copy link
Contributor

dnephin commented Jan 15, 2021

This sounds like it may be similar to #8078. The logging output is a bit different, but that may be related to the version of Consul (the first comment mentions this "changed in 1.8.0 to output the accessor id of the anonymous token"). There's a bit of discussion in that issue that may be helpful.

@idrennanvmware
Copy link
Author

idrennanvmware commented Jan 15, 2021

Thanks @dnephin it does seem very similar. The thing is that we can't deploy to this cluster anymore because the jobs always have something unhealthy (the old check). I didn't see anything in that thread on how to resolve that, did I miss something? We tried our deregister actions with a management token in Consul

@idrennanvmware
Copy link
Author

idrennanvmware commented Jan 16, 2021

So we were able to resolve the issue (still don't know why it happened) by temporarily granting elevated access to the anonymous token (obviously that's highly undesirable). Would definitely like to get to the bottom of this and have a better understanding of how this happens and ideally how to prevent it

Spoke too soon. They're back

@idrennanvmware
Copy link
Author

idrennanvmware commented Jan 16, 2021

so I tried using the curl api - ran

curl http://127.0.0.1:8500/v1/catalog/service/enrollment-api

got back (this service is most definitely not running)

[
   {
      "ID":"f2a6aa0c-b694-5e13-b809-53a13d91a1ac",
      "Node":"x",
      "Address":"x",
      "Datacenter":"x",
      "TaggedAddresses":{
         "lan":"x",
         "lan_ipv4":"x",
         "wan":"x",
         "wan_ipv4":"x"
      },
      "NodeMeta":{
         X
      },
      "ServiceKind":"",
      "ServiceID":"_nomad-task-67da3bbd-b53c-9571-f22e-0fa107a71397-enrollment-api-task-enrollment-api-api_http",
      "ServiceName":"enrollment-api",
      "ServiceTags":[
         "http",
         "urlprefix-/enrollment/v1"
      ],
      "ServiceAddress":"X",
      "ServiceTaggedAddresses":{
         "lan_ipv4":{
            "Address":"X",
            "Port":24681
         },
         "wan_ipv4":{
            "Address":"X",
            "Port":24681
         }
      },
      "ServiceWeights":{
         "Passing":1,
         "Warning":1
      },
      "ServiceMeta":{
         "external-source":"nomad"
      },
      "ServicePort":24681,
      "ServiceEnableTagOverride":false,
      "ServiceProxy":{
         "MeshGateway":{
            
         },
         "Expose":{
            
         }
      },
      "ServiceConnect":{
         
      },
      "CreateIndex":6195951,
      "ModifyIndex":6195951
   }
]

tried

./consul services deregister -id="_nomad-task-67da3bbd-b53c-9571-f22e-0fa107a71397-enrollment-api-task-enrollment-api-api_http"

got back

Error registering service "": Unexpected response code: 500 (Unknown service {"_nomad-task-67da3bbd-b53c-9571-f22e-0fa107a71397-enrollment-api-task-enrollment-api-api_http" {}})

So I'm kinda stuck in an endless loop at this stage. If I have NO jobs running (just a regular consul, vault, nomad setup) I see see these 'ghost/orphan' checks and services AND they look healthy (but nothing is running!)

Adding a screenshot of the orphaned services. The first 2 (mesh enabled) we can't even click on but they do show in the results for the catalog (no where else). The 3rd and 4th can be clicked on, there's an instance (but no instance running). In addition I checked each nodes "service and check" folders. None of the orphaned services have an entry there.

Screen Shot 2021-01-15 at 11 28 18 PM

I also noticed that if I cycle the consul nodes (restart the consul service so that a leader changes) then the registrations in the UI change (the orphans) but there's always some. Its as though there's a local cache somewhere feeding this

Relevant logs indicate the issue may be something with "delta will cause a negative count"

2021-01-16T04:44:35.844Z [INFO]  agent.server: Handled event for server in area: event=member-update server=uem-services-qe-genesis-csl1.uem-services-qe-genesis area=wan
2021-01-16T04:44:37.001Z [INFO]  agent: Deregistered service: service=_nomad-task-3cce6ac8-6a47-b9ff-1f39-06042abccad2-group-device-log-scheduler-group-device-log-scheduler-scheduler_http-sidecar-proxy
2021-01-16T04:44:37.005Z [WARN]  agent.fsm: DeleteNodeService failed: error="failed to insert usage entry for "service-names": delta will cause a negative count"
2021-01-16T04:44:37.007Z [INFO]  agent: Deregistered service: service=_nomad-task-310c6b7f-dceb-cc9b-be5d-e736309d2758-group-device-state-consumer-group-device-log-device-state-consumer-device_state_consumer_http-sidecar-proxy
2021-01-16T04:44:37.012Z [WARN]  agent.fsm: DeleteNodeService failed: error="failed to insert usage entry for "service-names": delta will cause a negative count"
2021-01-16T04:44:37.013Z [INFO]  agent: Deregistered service: service=_nomad-task-fba52610-d5a1-6b6f-6d66-a295ff4a5c67-group-ingress-group-ingress-inbound
2021-01-16T04:44:37.018Z [WARN]  agent.fsm: DeleteNodeService failed: error="failed to insert usage entry for "service-names": delta will cause a negative count"
2021-01-16T04:44:37.019Z [INFO]  agent: Deregistered service: service=_nomad-task-67da3bbd-b53c-9571-f22e-0fa107a71397-enrollment-api-task-enrollment-api-api_http
2021-01-16T04:44:37.025Z [WARN]  agent.fsm: DeleteNodeService failed: error="failed to insert usage entry for "service-names": delta will cause a negative count"

@idrennanvmware
Copy link
Author

idrennanvmware commented Jan 16, 2021

Just checked for the log message and I see #9433 has a similar one (different scenario). Is this related?

Addedum - ran some more tests. Interestingly there's ALWAYS 4 services that have the log error above. The names of the services can switch around depending on deployment order and stops - but there are always four.

Thats about as much info as we can give at this point - thanks for taking the time to read it all 👍

@fredwangwang
Copy link
Contributor

Just a wild guess of whats going on:

  1. One Nomad Job stops and Nomad sends de-registration request to the Consul Client as post stop hook
  2. Consul Client received the request correctly
  3. Consul Client remove the service entry from its local Agent state and issue the DeleteNodeService RPC request to the Consul Server to remove it from catalog.
  4. Consul Server responded HTTP: 200, true (or equivalent in RPC response), so Consul Client thinks everything's happy at this point.
  5. However Consul Server is actually not able to process the request: agent.fsm: DeleteNodeService failed: error="failed to insert usage entry for "service-names": delta will cause a negative count" So the change request gets dropped by the Consul Server and entry still exist in the Catalog.
  6. Next time anti-entropy happens, Consul Client will try to remove the service registration again. But because this time it does not have local state anymore, thus it loss the token that is associated with the service. So it instead uses Anonymous token. Then we got the agent: Service deregistration blocked by ACLs: service=_nomad-task-... accessorID=00000000-0000-0000-0000-000000000002

@dnephin Is this probable?

Btw we never saw this happen when using Consul 1.8.5, and this only become an issue when running Consul 1.9.1, we also switched to Raft 3 when doing the upgrade.

@hynek
Copy link

hynek commented Jan 17, 2021

Just quickly adding some context (I'm the author of the forum thread) since it just happened again to us (Nomad v1.0.2, Consul v1.9.1):

  • I get the exact same error back when I try to deregister using consul services deregister.
  • I can resolve it by draining and rebooting the cluster node. Given Nomad's binpacking and where it's most likely to happen, (= busiest servers) this is super disruptive. To emphasize: nomad stop / deploy of affected services is not enough (the healthy services go away, zombie stays). Just draining is not enough (same).
  • We have a bunch of [WARN] agent.fsm: DeleteNodeService failed: error="failed to insert usage entry for "service-names": delta will cause a negative count" entries on the cluster leader.

Btw we never saw this happen when using Consul 1.8.5, and this only become an issue when running Consul 1.9.1, we also switched to Raft 3 when doing the upgrade.

This is good to know, I'll check with ops if we can downgrade. Fixing the cluster on Sunday 5:30am is getting old. :(

@hynek
Copy link

hynek commented Jan 17, 2021

I've been pointed to #9440 – could it be, this bug is already fixed? I don't have the time to dig into the diff but the correlation with that warning that has been fixed there is quite strong.

@idrennanvmware
Copy link
Author

I get the exact same error back when I try to deregister using consul services deregister.

Interestingly enough this is not the case with us (Check out the commands in the first post. We actually get the 500 back. now this is AFTER having tried to fix the deregister by temporarily elevating our anonymous token so it may be an unreliable comparision.

This is good to know, I'll check with ops if we can downgrade. Fixing the cluster on Sunday 5:30am is getting old. :(

If you take this path, do you mind updating this thread if the downgrade is successful and if you see the issue disappear? Like you we are spending a good amount of time playing whack a mole and we aren't relying on any 1.9 features at this time so downgrading is an option if we need it (although it would take a while to propagate through our systems)

Also, curiously, what is the frequency with which you see this happen?

@hynek
Copy link

hynek commented Jan 18, 2021

I get the exact same error back when I try to deregister using consul services deregister.
Interestingly enough this is not the case with us (Check out the commands in the first post. We actually get the 500 back. now this is AFTER having tried to fix the deregister by temporarily elevating our anonymous token so it may be an unreliable comparision.

I'm confused right now but maybe I just worded it wrong. What I meant is:

"I tried your consul services deregister command line and I got something resembling Error registering service "": Unexpected response code: 500 (Unknown service {"_nomad-task-67da3bbd-b53c-9571-f22e-0fa107a71397-enrollment-api-task-enrollment-api-api_http" {}}) back."

Makes sense?

This is good to know, I'll check with ops if we can downgrade. Fixing the cluster on Sunday 5:30am is getting old. :(
If you take this path, do you mind updating this thread if the downgrade is successful and if you see the issue disappear? Like you we are spending a good amount of time playing whack a mole and we aren't relying on any 1.9 features at this time so downgrading is an option if we need it (although it would take a while to propagate through our systems)

I'm somewhat hoping for 1.9.2 to fix this as #9440 makes me somewhat hopeful.

It would be great if we could get an estimate when it might drop @dnephin @blake?

Also, curiously, what is the frequency with which you see this happen?

So, this is a bit fuzzy to answer but I've only noticed it with two service that consist of 82 and 39 allocations each, with the added property that 80 resp. 38 of those allocations are groups with count=1. Which seems to be exactly what might trigger #9440 if I squint at the diff correctly.

As a caveat I must add that these services are the only ones where it's reliably detectable, because they use a home-grown DNS-SD-based mesh solution (I need to stop procrastinating to look at Nomad's solution :() that relies on each services existing only once and that crashes when there's a zombie service hanging around. So I basically learn about this happening via Sentry. I can't rule out though, that it happens more often but some load balancer is hiding it from me because the health checks are failing and it's just kept out of rotation. Which is also my suspicion why not many more people are running into this problem.

The frequency definitely correlates with how often I deploy these two services.

@ianmdrennan
Copy link

ianmdrennan commented Jan 18, 2021

I need to stop procrastinating to look at Nomad's solution :()

We love the mesh integration FWIW (will be awesome when they support Mesh Gateway and Terminating gateway natively but ingress and regular mess is pretty sweet) HOWEVER these issues are are seeing in our environment are also native mesh services. Dont think it'll shield you from this particular issue.

Another question - do you ever restart your nomad agents, and if so - do any of the above line up with anything like that?

@hynek
Copy link

hynek commented Jan 19, 2021

I need to stop procrastinating to look at Nomad's solution :()

We love the mesh integration FWIW (will be awesome when they support Mesh Gateway and Terminating gateway natively but ingress and regular mess is pretty sweet) HOWEVER these issues are are seeing in our environment are also native mesh services. Dont think it'll shield you from this particular issue.

Well, I didn't assume however I'd kinda expect Envoy to be smart enough to navigate the situation of two backends, one of them being hard down…

Another question - do you ever restart your nomad agents, and if so - do any of the above line up with anything like that?

Rarely, although I did it once to no avail when trying to fix this situation (see my forum post).

When I upgrade Nomad or Docker, I drain the node, run a full upgrade and reboot it (we're on metal). I've run into this bug while draining in this situation before.

@pySilver
Copy link

Got hit again by the same issue. Consul became useless once node ip changed. Services simply will not go away as well as node. Oh, I wish there is a tool to force service/node/check removal. It is so painful to deal with stale/zombie items created by Nomad. I have tried literally everything to delete them and nothing ever worked.

@hynek
Copy link

hynek commented Jan 21, 2021

I've updated to Consul 1.9.2 and deployed the service that triggered this bug before a few times and it didn't manifest. So here's to hope… 🤞

@pySilver
Copy link

Hope so! Still, we need a tool that would allow easier data modification/removal.

Btw. Have you tried something like this to clear zombies? Power down consul nodes (or/and execute consul leave), so only master nodes are alive. Disable ACL & attempt to remove stale data.

@idrennanvmware
Copy link
Author

We are watching this thread carefully. Will be upgrading to 1.9.2 next week late if no issues arise that look like blockers

@lisongmin
Copy link

May be the same problem here, with consul 1.9.2.
Deregister via consul services deregister failed, but deregister via api success.

root@bypass-route:~# consul version
Consul v1.9.2
Revision 6530cf370
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

The steps are below:

  1. I want to deregister vault-sidecar-proxy on local node, but failed with service is unknown

    root@bypass-route:~# consul catalog services --node bypass-route
    consul
    nomad
    nomad-client
    vault
    vault-sidecar-proxy
    root@bypass-route:~# consul services deregister -id vault-sidecar-proxy
    Error registering service "": Unexpected response code: 500 (Unknown service "vault-sidecar-proxy")
    
  2. Query the service id via api, the service id is not wrong.

    root@bypass-route:~# curl -L -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" http://localhost:8500/v1/catalog/service/vault-sidecar-proxy|jq
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    100   867  100   867    0     0   211k      0 --:--:-- --:--:-- --:--:--  211k
    [
      {
        ...
        "ServiceKind": "connect-proxy",
        "ServiceID": "vault-sidecar-proxy",
        "ServiceName": "vault-sidecar-proxy",
        ...
      }
    ]
    
  3. try again, no luck

    root@bypass-route:~# consul services deregister -id vault-sidecar-proxy
    Error registering service "": Unexpected response code: 500 (Unknown service "vault-sidecar-proxy")
    
  4. deregister via api, success

    root@bypass-route:~# curl -L -X PUT -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" --data '{"Node": "bypass-route", "ServiceID": "vault-sidecar-proxy"}'  http://localhost:8500/v1/catalog/deregister
    true
    
  5. query via consul catalog serivces, confirm that the vault-sidecar-proxy is already deleted.

root@bypass-route:~# consul catalog services
consul
nomad
nomad-client
vault

@lisongmin
Copy link

May be the same problem here, with consul 1.9.2.
Deregister via consul services deregister failed, but deregister via api success.

root@bypass-route:~# consul version
Consul v1.9.2
Revision 6530cf370
Protocol 2 spoken by default, understands 2 to 3 (agent will automatically use protocol >2 when speaking to compatible agents)

The steps are below:

1. I want to deregister vault-sidecar-proxy on local node, but failed with service is unknown
   ```
   root@bypass-route:~# consul catalog services --node bypass-route
   consul
   nomad
   nomad-client
   vault
   vault-sidecar-proxy
   root@bypass-route:~# consul services deregister -id vault-sidecar-proxy
   Error registering service "": Unexpected response code: 500 (Unknown service "vault-sidecar-proxy")
   ```

2. Query the service id via api, the service id is not wrong.
   ```
   root@bypass-route:~# curl -L -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" http://localhost:8500/v1/catalog/service/vault-sidecar-proxy|jq
     % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                    Dload  Upload   Total   Spent    Left  Speed
   100   867  100   867    0     0   211k      0 --:--:-- --:--:-- --:--:--  211k
   [
     {
       ...
       "ServiceKind": "connect-proxy",
       "ServiceID": "vault-sidecar-proxy",
       "ServiceName": "vault-sidecar-proxy",
       ...
     }
   ]
   ```

3. try again, no luck
   ```
   root@bypass-route:~# consul services deregister -id vault-sidecar-proxy
   Error registering service "": Unexpected response code: 500 (Unknown service "vault-sidecar-proxy")
   ```

4. deregister via api, success
   ```
   root@bypass-route:~# curl -L -X PUT -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" --data '{"Node": "bypass-route", "ServiceID": "vault-sidecar-proxy"}'  http://localhost:8500/v1/catalog/deregister
   true
   ```

5. query via `consul catalog serivces`, confirm that the vault-sidecar-proxy is already deleted.
root@bypass-route:~# consul catalog services
consul
nomad
nomad-client
vault

Seems i should remove via

curl -X PUT -H "X-Consul-Token: $CONSUL_HTTP_TOKEN" http://127.0.0.1:8500/v1/agent/service/deregister/vault-sidecar-proxy

@dnephin dnephin added theme/acls ACL and token generation type/bug Feature does not function as expected labels Jan 26, 2021
@idrennanvmware
Copy link
Author

We are upgrading to 1.9.2 right now - will report back here. We have a number of systems in our pipeline showing the deregistration error (anonymous token) and leaving orphan services.

@fredwangwang
Copy link
Contributor

tldr: Upgrade to 1.9.2 helps to resolve the issue -- mostly.

After upgrading to 1.9.2 we still have deregistration blocked by ACLs: service=_nomad-task-... accessorID=00000000-0000-0000-0000-000000000002 spamming in the logs.
So steps we followed:

  1. temporarily escalate the privilege of anon token (00000000-0000-0000-0000-000000000002) by giving it service_prefix "" { policy = "write" } node_prefix "" { policy = "write" } agent_prefix "" { policy = "write" } which helps to stop the spamming immediately, and we are not seeing negative count anymore from the server, which seems to confirm Fix bug in usage metrics when multiple service instances are changed in a single transaction #9440 has fixed the cause
  2. redeploy nomad jobs, in case the deployment fails, purge the job to force cleanup the registration entries. (lisongmin's Unable to deregister orphan checks and service instances after accessorid token is lost and set to anonymous #9577 (comment) might also work for force deregistering, but we didn't try that)
  3. restore the anon token privilege.

@pySilver
Copy link

@fredwangwang same issue here. logs are full of "Check deregistration blocked by ACLs"

@hynek
Copy link

hynek commented Jan 27, 2021

To clarify: are those new cases or existing ones that didn't get resolved by the upgrade?

I've double checked our cluster and we have zero but I've also resolved them before upgrading.

@pySilver
Copy link

pySilver commented Jan 27, 2021

@hynek that's hard to say. I will monitor cluster logs for new cases. At the moment I've resolved all issues by temporarily disabling (and enabling it back) ACL.

@dnephin
Copy link
Contributor

dnephin commented Feb 1, 2021

Thank you for all the bug reports! Sorry it has taken us some time to respond.

I'm not sure that "agent.fsm: DeleteNodeService failed" is related, but it was definitely fixed in 1.9.2.

The original issue of deregistrations being blocked by ACL tokens has been around since 1.7.2 as you can see from issues #7669 and #8078.

Some context about the problem:

  • Consul has two views of services and checks. There is a centralized "catalog" view that is maintained by the Servers, and there is a local "agent" view on each client agent.
  • Generally, the centralized catalog view is updated by the individual agents periodically.
  • To perform an update of a service (including deleting it) the client agent uses the ACL token that was used to register the service. If no token was used (or the reference to the token is gone) it tries to use the default token configured for that agent, and when no default token is specified, it uses the anonymous token.

This problem can occur when the ACL token that was used to register the service is deleted before the service itself.

To fix the problem I believe there are a couple options. You should be able to use the catalog deregister API instead of the agent deregister API. Once the service is removed from the catalog, the local agents will stop attempting the de-registration, and the log lines will stop.

Another option is to set the default token on all client agents, so that the sync between client and server uses that token instead of the anonymous token. That token will need the node:write permission for the node name used to register the client agent.

If one or both of those options do not work, please do let me know. There may be more to this issue.

I think the suggestion from @mkeeler in one of the linked issues would be a good fix. Instead of using the service token, we could default to using the agent token. That way removing the ACL token used to create the service won't cause this issue. However, it will still require the client agents be configured with either the default token or an agent token.

@idrennanvmware
Copy link
Author

Hi @dnephin !

We were just looking at the consul 1.9.4 release notes and we don't see this merged in the changelog - did we misread it?

https://github.com/hashicorp/consul/blob/master/.changelog/9683.txt

Just wanted to check before we upgraded our Consul Systems

Thanks!

@dnephin
Copy link
Contributor

dnephin commented Mar 5, 2021

Hmm, ya, it looks like I missed the backport label on that PR, so it did not make it into the 1.9.4 release. All that PR did was to change the fallback to use the agent token before the default token. Setting a default token with node:write can still be used as a workaround for the problem.

@idrennanvmware
Copy link
Author

@dnephin is this slated for the 1.10.0 release?

@dnephin
Copy link
Contributor

dnephin commented Mar 22, 2021

Yes, this will be in the 1.10 release and any future 1.9.x releases.

@blake
Copy link
Member

blake commented Jun 18, 2021

This fix was released in Consul 1.9.5. https://github.com/hashicorp/consul/releases/tag/v1.9.5

@cshintov
Copy link

For me none of the above methods worked. Had to restart the nomad agents which successfully removed stale checks and services from Consul.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
theme/acls ACL and token generation type/bug Feature does not function as expected
Projects
None yet
Development

No branches or pull requests

9 participants