-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to deregister orphan checks and service instances after accessorid token is lost and set to anonymous #9577
Comments
Additional attempt - Tried actually doing a graceful leave of the agent. Still no success in removing the checks either after the leave, or when we joined it back in |
Here are more logs we saw (from a different time):
basically the same behavior as described in this thread: |
Here is a memory dump of the
For all the "ghost" service registration entries, i.e |
This sounds like it may be similar to #8078. The logging output is a bit different, but that may be related to the version of Consul (the first comment mentions this "changed in 1.8.0 to output the accessor id of the anonymous token"). There's a bit of discussion in that issue that may be helpful. |
Thanks @dnephin it does seem very similar. The thing is that we can't deploy to this cluster anymore because the jobs always have something unhealthy (the old check). I didn't see anything in that thread on how to resolve that, did I miss something? We tried our deregister actions with a management token in Consul |
Spoke too soon. They're back |
so I tried using the curl api - ran
got back (this service is most definitely not running)
tried
got back
So I'm kinda stuck in an endless loop at this stage. If I have NO jobs running (just a regular consul, vault, nomad setup) I see see these 'ghost/orphan' checks and services AND they look healthy (but nothing is running!) Adding a screenshot of the orphaned services. The first 2 (mesh enabled) we can't even click on but they do show in the results for the catalog (no where else). The 3rd and 4th can be clicked on, there's an instance (but no instance running). In addition I checked each nodes "service and check" folders. None of the orphaned services have an entry there. I also noticed that if I cycle the consul nodes (restart the consul service so that a leader changes) then the registrations in the UI change (the orphans) but there's always some. Its as though there's a local cache somewhere feeding this Relevant logs indicate the issue may be something with "delta will cause a negative count"
|
Just checked for the log message and I see #9433 has a similar one (different scenario). Is this related? Addedum - ran some more tests. Interestingly there's ALWAYS 4 services that have the log error above. The names of the services can switch around depending on deployment order and stops - but there are always four. Thats about as much info as we can give at this point - thanks for taking the time to read it all 👍 |
Just a wild guess of whats going on:
@dnephin Is this probable? Btw we never saw this happen when using Consul 1.8.5, and this only become an issue when running Consul 1.9.1, we also switched to Raft 3 when doing the upgrade. |
Just quickly adding some context (I'm the author of the forum thread) since it just happened again to us (Nomad v1.0.2, Consul v1.9.1):
This is good to know, I'll check with ops if we can downgrade. Fixing the cluster on Sunday 5:30am is getting old. :( |
I've been pointed to #9440 – could it be, this bug is already fixed? I don't have the time to dig into the diff but the correlation with that warning that has been fixed there is quite strong. |
Interestingly enough this is not the case with us (Check out the commands in the first post. We actually get the 500 back. now this is AFTER having tried to fix the deregister by temporarily elevating our anonymous token so it may be an unreliable comparision.
If you take this path, do you mind updating this thread if the downgrade is successful and if you see the issue disappear? Like you we are spending a good amount of time playing whack a mole and we aren't relying on any 1.9 features at this time so downgrading is an option if we need it (although it would take a while to propagate through our systems) Also, curiously, what is the frequency with which you see this happen? |
I'm confused right now but maybe I just worded it wrong. What I meant is: "I tried your Makes sense?
I'm somewhat hoping for 1.9.2 to fix this as #9440 makes me somewhat hopeful. It would be great if we could get an estimate when it might drop @dnephin @blake?
So, this is a bit fuzzy to answer but I've only noticed it with two service that consist of 82 and 39 allocations each, with the added property that 80 resp. 38 of those allocations are groups with As a caveat I must add that these services are the only ones where it's reliably detectable, because they use a home-grown DNS-SD-based mesh solution (I need to stop procrastinating to look at Nomad's solution :() that relies on each services existing only once and that crashes when there's a zombie service hanging around. So I basically learn about this happening via Sentry. I can't rule out though, that it happens more often but some load balancer is hiding it from me because the health checks are failing and it's just kept out of rotation. Which is also my suspicion why not many more people are running into this problem. The frequency definitely correlates with how often I deploy these two services. |
I need to stop procrastinating to look at Nomad's solution :() We love the mesh integration FWIW (will be awesome when they support Mesh Gateway and Terminating gateway natively but ingress and regular mess is pretty sweet) HOWEVER these issues are are seeing in our environment are also native mesh services. Dont think it'll shield you from this particular issue. Another question - do you ever restart your nomad agents, and if so - do any of the above line up with anything like that? |
Well, I didn't assume however I'd kinda expect Envoy to be smart enough to navigate the situation of two backends, one of them being hard down…
Rarely, although I did it once to no avail when trying to fix this situation (see my forum post). When I upgrade Nomad or Docker, I drain the node, run a full upgrade and reboot it (we're on metal). I've run into this bug while draining in this situation before. |
Got hit again by the same issue. Consul became useless once node ip changed. Services simply will not go away as well as node. Oh, I wish there is a tool to force service/node/check removal. It is so painful to deal with stale/zombie items created by Nomad. I have tried literally everything to delete them and nothing ever worked. |
I've updated to Consul 1.9.2 and deployed the service that triggered this bug before a few times and it didn't manifest. So here's to hope… 🤞 |
Hope so! Still, we need a tool that would allow easier data modification/removal. Btw. Have you tried something like this to clear zombies? Power down consul nodes (or/and execute consul leave), so only master nodes are alive. Disable ACL & attempt to remove stale data. |
We are watching this thread carefully. Will be upgrading to 1.9.2 next week late if no issues arise that look like blockers |
May be the same problem here, with consul 1.9.2.
The steps are below:
|
Seems i should remove via
|
We are upgrading to 1.9.2 right now - will report back here. We have a number of systems in our pipeline showing the deregistration error (anonymous token) and leaving orphan services. |
tldr: Upgrade to 1.9.2 helps to resolve the issue -- mostly. After upgrading to 1.9.2 we still have
|
@fredwangwang same issue here. logs are full of "Check deregistration blocked by ACLs" |
To clarify: are those new cases or existing ones that didn't get resolved by the upgrade? I've double checked our cluster and we have zero but I've also resolved them before upgrading. |
@hynek that's hard to say. I will monitor cluster logs for new cases. At the moment I've resolved all issues by temporarily disabling (and enabling it back) ACL. |
Thank you for all the bug reports! Sorry it has taken us some time to respond. I'm not sure that "agent.fsm: DeleteNodeService failed" is related, but it was definitely fixed in 1.9.2. The original issue of deregistrations being blocked by ACL tokens has been around since 1.7.2 as you can see from issues #7669 and #8078. Some context about the problem:
This problem can occur when the ACL token that was used to register the service is deleted before the service itself. To fix the problem I believe there are a couple options. You should be able to use the catalog deregister API instead of the agent deregister API. Once the service is removed from the catalog, the local agents will stop attempting the de-registration, and the log lines will stop. Another option is to set the default token on all client agents, so that the sync between client and server uses that token instead of the anonymous token. That token will need the If one or both of those options do not work, please do let me know. There may be more to this issue. I think the suggestion from @mkeeler in one of the linked issues would be a good fix. Instead of using the service token, we could default to using the |
Hi @dnephin ! We were just looking at the consul 1.9.4 release notes and we don't see this merged in the changelog - did we misread it? https://github.com/hashicorp/consul/blob/master/.changelog/9683.txt Just wanted to check before we upgraded our Consul Systems Thanks! |
Hmm, ya, it looks like I missed the backport label on that PR, so it did not make it into the 1.9.4 release. All that PR did was to change the fallback to use the agent token before the default token. Setting a default token with |
@dnephin is this slated for the 1.10.0 release? |
Yes, this will be in the 1.10 release and any future 1.9.x releases. |
This fix was released in Consul 1.9.5. https://github.com/hashicorp/consul/releases/tag/v1.9.5 |
For me none of the above methods worked. Had to restart the nomad agents which successfully removed stale checks and services from Consul. |
Consul info for both Client and Server
Consul 1.9.1
After experiencing an unexpected cluster scenario (we had some nodes that had VMs crash) we have a number of orphaned services and checks in the Consul Service Discovery of a cluster (the clusters are in a healthy state besides these checks)..... However we are completely unable to deregister them.
Here's what we see in the logs for Consul
Note the AccessorID - these services were not registered with that, but interestingly enough using the CLI and HTTP we are unable to clean these up.
Some other variants of the above too (shooting in the dark)
We have tried at the agent and the catalog with no success. We even shut down every single service registered in Nomad but the job registrations still remain.
At a bit of a loss on how to proceed here, but we have still kept the cluster in this state (fortunately it's not prod) so we can try any suggestions on how to recover and clean up the registry
Thanks!
The text was updated successfully, but these errors were encountered: