-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enabling Auth Plugin failing with context deadline exceeded #19192
Comments
I wonder if this is actually a continuation from the problem reported on: #18323 (comment) If we compare the issue we were having before (the Now, instead of panicking, it looks like the Please, let me know if you need any extra information. |
I recall reading somewhere that there were changes in Vault 1.12 which mean plugins may now be started briefly just so Vault can retrieve some metadata about them. Therefore, I wouldn't take the fact it is being started and then killed soon after as necessarily an indication that there is any problem. Although... maybe this is itself the problem. It may be that your plugin is attempting initialization that writes to storage, when Vault only wants to run it briefly to query metadata, and storage writing is not enabled. I am seeing in your logs that it appears to be attempting an AppRole login, and then exactly 60 seconds later that is timing out:
I believe you might find that you now need to move some initialization logic to not run when the plugin is initialized, but when it first receives a request, to remedy this. |
Hello @maxb , thanks for your response. Yes, as pointed out, we are indeed injecting the Throughout our analyse, we identified that the context is hanging after invoking, For reference,
and we configured our client as:
You should note that even after increasing the Client timeout to 200s it still failed after 1m30s.
I hope this complementary information is useful. |
So, the real issue here, is why is your request for an AppRole login taking an incredibly long time. Fundamentally, an AppRole login is just a few reads/writes to Vault's internal storage. If that's taking anywhere close to a minute to complete (EDIT: especially given you're using Are you able to log in using the AppRole RoleID and SecretID that the plugin is trying to use, if you manually make the login request yourself, outside of the plugin? I believe the 1m30s timeout you are now observing, is probably the default timeout applied to all requests by the Vault server - i.e. since you have increased the client timeout beyond this, you now see the Vault server deciding to time out your requests. Just out of curiosity... how are you getting the RoleID and SecretID into the plugin? AFAIK you usually have to enable a plugin, and then configure it in a separate request to a config endpoint? |
Yes! In fact, that's how I figured that the secret-id was expiring. I was able to successfully create the login token using the
For our local environment we are running this in a docker container, right. There is a script that is invoked with the The TL;DR is:
This is the operation that actually hangs in the backend. Which I described to you before. Once again @maxb , thanks so much for trying to assist. |
OK... I think the key question remains "Why is the AppRole login hanging?"... unfortunately I'm running out of ideas on how to investigate that further. One thing that might be interesting, just to eliminate things, would be take Docker and the associated docker-compose environment, and your local Vault configuration out of the picture. A Vault server can be started in dev mode with plugins enabled and full logging just by running
When you do this, it automatically scans the directory and registers plugins in it so you skip the requirement to use
I've skipped assigning any policy to the test approle here, since all we really care about is whether the login works. I'm not sure if this will be fruitful or not, but hopefully it should be fairly easy to try out, just in case. Beyond that, I think your next step should be to remove your AppRole login from executing during the plugin The rest of my reply is not directly related to the issue at hand, it's just reactions to a couple of the background details: On the use of On the need for a Vault plugin to itself log in to Vault: I too have encountered this need, but it's pretty unusual, as it is only needed when one Vault plugin wants to consume other Vault APIs as part of its operation. It's particularly unusual for an auth method, as usually they just need to confirm the user's credentials and return a confirmed username and metadata to Vault. Are you able to indulge my curiousity on what this plugin needs to call back in to Vault for? |
Hello again @maxb, (sorry for the late reply)
Yes, I reckon that's what we should probably be looking at. I'm also start to run out of ideas. :( After adding a little bit more of logs, I noticed
In regards to your suggestion of testing the plugin out of the docker context, I reckon it would be a bit harder to do, because the plugin does expect the
Hmmmm, I'm not sure I follow the question here... Based on the policy attached to I'll continue trying to dig on my end on this issue, but I'm also running out of ideas at this stage. :( |
Yes, that would be expected given the previously shared logs.
Um, yes, but that's why I wrote out commands to create and supply those in my previous comment? |
Yep, but what I mean is that the server itself won't even be up when I run: The server won't even be started successfully. See logs attached. I have also tried increasing I noticed that even methods the methods logical.read they are hanging. Very confusing. Example:
The final logs were printed as follow:
|
@maxb, a quick update. I ended up cloning If I figured anything else I will update here, but that's what I have got so far. :( |
I do not understand... of course the server itself won't be up until you run the command to start the server... that is what the
Did you mean to paste something here?
From the posted logs, I am more convinced than ever that your plugin is performing initialization actions too soon. Do what I said in an earlier comment:
|
Howdy @maxb , 👋
I ended up attaching the logs and forgot to remove it, sorry about that 😭 .
I'm still trying to get my head around this to be honest. Was there any change surround plugin initialisations that we should be aware of and I might be missing it? I don't think it will be easy for us to refactor it, I'd be happy to have a look at it, but I'm quite not 100% sure if this will solve the issue.
In regards to your previous comment, I'm sort of confused if removing the App Role login would help with anything. One of my previous comments, I mentioned that I left the App Role Login mechanism aside and replaced it with the Root Token. Which means that we create Also, I was not able to find any
There is something super odd happening when invoking the http request. For some reason it just gets stuck. I'll try to revisit the initialisation of the plugin, but at this stage, I'm no longer sure what I should try. Once again, thanks for trying to assist. |
Can you share more of the source code? Ideally as much as you can, and just redact any particularly internal bits, but definitely |
@heliomartins Hello! What version of the Vault sdk and api are you using? Vault 1.12 did introduce a change to how plugins are loaded. Prior to Vault 1.12 plugins are lazy loaded on startup. This means that plugin processes are killed after a successful mount and then respawned when a request is routed to them. Vault 1.12 introduced AutoMTLS for secrets/auth plugins so we do not lazy load them on startup anymore. I will get this added to our 1.12 upgrade guide ASAP. |
@maxb , unfortunately I won't be able to share all the contents of the plugin, but I extracted as much as I could related to plugin initialisation. I hope this helps. @fairclothjm , thanks for jumping in: We are trying to update to the latest version
|
Ah, I see you're already trying to detect the plugin being launched in metadata mode and disable certain initialization in that case. I suspect the issue is that this is no longer enough of a workaround. I believe there's kind of an unwritten rule that Vault plugins should only be filling in the fields of the A relatively simple refactor that would allow you to prove this, would be to move your clientFactory/SysLink initialization to a helper method of your That would ensure it only gets triggered once Vault routes a request which needs it to your plugin. |
I don't suppose you would have any example for this scenario would you? Or even pointing to any documentation that perhaps could help me attempt this migration? |
I did some other tests Today, and I noticed that if we invoke those I'm attaching the logs with the difference between the initialisation using The following logs does seem interesting:
I wonder if there is anything we need to do to handle this auto mTLS thing. @fairclothjm, is there any documentation guiding on how to do that? |
One more update: Max, I was able to perform the test (disregarding the docker compose) by doing:
The problem still persists. :( One of my teammates suggested trying to get a crash dump where it is hanging. See the coroutines crash logs. Not sure if this will help though. At this stage, I'm not sure what else I should try. |
Here is a toy example of what I'm suggesting you do: https://github.com/maxb/simple-vault-plugin/blob/main/main.go |
I think @maxb is right on the money here. I would suggest trying out his proposed changes to your plugin. Maybe you could move your initialization code to the InitializeFunc which you would pass to the Max is correct that plugins should not be doing any initialization with external dependencies in the Factory function. I will also add that having a dependency on MetadataMode in your plugin is not recommended. If Vault ever removes the concept of MetadataMode then you will again run into issues. @maxb Thank you for all the help here! |
I updated my example to use I went with |
Hello folks, I hope you all had a good weekend. @maxb, thank you very much for sharing this snippet. I'm going to have a look into this. In the meanwhile, when you or @fairclothjm get the chance, do you mind elaborating a little bit more why we should not be using Metadata mode? I'm not 100% sure I understand the impact of using it. I'm afraid we don't have much context why it was created like that. Maybe an old issue on Vault that was later resolved? Thanks once again for assisting with this. |
Hello @maxb and @fairclothjm, unfortunately we were still not able to successfully enable the I've just pushed a simplified version of the plugin to GitHub (https://github.com/heliobmartins/vault-context-deadline) that should allow you to reproduce the problem. The README.md contains the steps to reproduce the issue. Also, you should be able to see the Pull Request I raised to use the InitializeFunc (heliobmartins/vault-context-deadline#1) which fails with the same error. I hope the repository help you to reproduce the error. |
I believe the point is that it is unclear whether the environment variable to signify metadata mode can be considered a public API or not, and thus, whether there is a commitment to it continuing to exist and/or behave the same way in the future.
Ah, now it is really getting interesting. Experimenting with this, a few things become clear:
These observations strongly hint at lock contention within Vault. I suspect some change in Vault 1.12 has introduced additional lock contention so that it is now impossible to make other requests to Vault whilst an auth backend is being enabled. |
A bit of git bisect idenfies #16157 as the culprit - which adds taking a read-lock on |
Nice sleuthing @maxb! I agree with your analysis, though I'm a bit hesitant to vilify #16157. As you say now pretty much all requests need to grab that lock, but even before that change many requests would, e.g. pathInternalUIMountsRead is invoked for most |
No vilification intended - just the observation that that's the commit at which the behaviour changed. |
Right, I just meant that while that PR is the culprit in a sense, insofar as it's made an existing problem worse, I'd prefer to focus on reducing how long we hold the lock than on trying to change what that PR is doing. |
@heliobmartins Now that we understand the underlying issue, you might want to give the
at the top of all your request handling methods, but it should be enough to defer initialization to first use of the auth method, working around this locking issue. |
Hello @maxb, 👋 🤦 I probably should've pushed a reproducible version of the plugin earlier. My bad. I will try to understand a bit more how I'm going to refactor the plugin to use the Now that you better understand the issue, do you think that this is a bug that was introduced in Vault core? Or do you still think that this is related to our plugin initialisation? Because if that's the latter, I reckon I probably need to raise a ticket in our backlog to improve it as soon as possible, because ideally, we should be following the best practices for Plugin Development anyway. Thanks once again for all your help. I will get back to you soon. |
Hello @maxb, just a quick follow-up here, we've also tried to implement the Adding the reference for the changes here. What are the potential consequences if we actually try to add a new context prior invoking the initialisation? Example: https://github.com/heliobmartins/vault-context-deadline/pull/4/files /\ With the above, we were actually able to load the plugin at least:
|
Please refer back to https://github.com/maxb/simple-vault-plugin/blob/main/main.go - the whole point here is that you do not trigger the initialization anywhere that gets called during In your 4 you have created a race condition, as you now have no guarantee initialization has actually completed before the first actual request to your plugin occurs.
Vault never documented either way, whether it was allowed to depend upon the Vault API during enabling a backend. You relied on this, but Vault has subsequently, accidentally, introduced lock contention which makes things break. Is this is a bug in Vault? I think it is unclear whether it is or not, as the plugin development docs do not rigorously document what a plugin author may or may not rely upon. I do think there are some issues in your plugin that would ideally addressed:
|
Hey @maxb, 👋 Thanks for all your input.
Does Hashicorp provide any documentation to follow this best practices? I mean, I couldn't even find any reference for
As I mentioned in my comment #19192 (comment) this is not the real plugin, it is actually missing a lot of things there... What I pushed is just a reproducible version of the problem we are trying to investigate. The Paths are not meant to be relevant, since the whole problem we are trying to identify is the
We are making usage of
In regards to token expiration, etc... this isbeing handled by our plugin, through a set of methods. (Maybe this is not ideal, but I'm not 100% sure what was the motivation on doing like that. Unfortunately, I didn't belong to the team responsible for the plugin when it was implemented.
Maybe this is something our team needs to revisit... We probably need to understand more what's being done here.
This is also probably we might need to understand better. I might not have all the context here. I'm going to re-try the sync.Once suggestion, but I'm not 100% sure if this is going to work as expected due to the complexity of our plugin. It would be nice to understand if this extra Once again, thanks for your support in trying to assist here. |
I don't think so... everything I know about this, I have learnt by reading the source code - both that of the plugin SDK, and all the built in plugins within the Vault source tree.
@ncabatoff has indicated in earlier comments in this issue that the currently preferred direction is to mostly leave it as is, but try to reduce the length of time it is being held during the auth enable operation, so that it is no longer being held during the Regardless, moving your plugin's initialization to happen on first request via |
Hello @maxb 👋 I ended up spending some time trying to refactor the plugin to use I just have a couple of follow up questions after making the change...
I'm not sure I understand this part. Does this means we (plugin developers) can actually control the timeout of something through the InitializeFunc? Once again, thanks so much for your help. |
The That's clearly not great, but it's the error handling situation you already had with your previous implementation, so I proposed the smallest possible change to fix the issue without further refactoring. Longer term, you would replace
Contexts are being used here mostly to control the scope of cancellation. Reverting to
Um, not really. Or, all of it? You can't really summarize general familiarization with a few specific points.
If Vault itself only held the relevant lock for certain critical sections of enabling a new auth method, and not whilst its
The token is a short-lived credential which can (presumably?) be replaced at any time by doing another AppRole login, yet you are writing it to clustered network storage (the Vault store). It's not exactly wrong, but it's an inefficient choice, for which I can't see an obvious reason.
When it comes to communication between Vault and any Vault plugin, the communication necessary to actually integrate with Vault as a plugin is via gRPC. But your plugin is a bit unusual, as it interacts with Vault both as a plugin (via gRPC), and as a regular client of the Vault API (via HTTP). That's fine, and there are no alternatives - Vault does not provide any special way for plugins to interact with the regular Vault API, so the rare plugins that need that, have to interact with it in two ways. But, you're not forwarding HTTP requests from another source - you're just making them. |
Hello again Max, Once again, thank you very much for providing such assistance on this issue.
✅ In that case I assume we are handling it correct. 🙌🏽
Alright, no worries! I understand it can be hard. What I was mainly asking is if you would be able to point in Vault repository, where should we start reading to get to know more about plugin implementations. But don't worry about this.
❓ In that case, it would be more efficient to just issue Vault Tokens using the App Roles based on requests? Maybe we could reduce the TTL of these batch tokens, instead of have to handle operations such as Token Renew, etc? Is this how we should be handling it?
✅ Alright! So it is fine and it is not wrong what we are doing. Thanks We'll hopefully be rolling out the changes to Production Today. I will follow-up with the final result. 🙏🏾 |
Hello again @maxb , As promised, we released the changes to Prod and it looks good so far. Thank you once again for all your support throughout this issue. In regards to the Github issue, I will let you guys decide if this extra |
Hello again team, 👋
Describe the bug
Vault is returning
error: code = Canceled desc = context canceled" took=59.750225246s
when trying to register an externalauth
plugin. Thedebug
is followed by the error2023-02-15T00:58:57.417Z [ERROR] secrets.system.system_7f9e5cfe: error occurred during enable credential: path=$PLUGIN_PATH/ error="rpc error: code = Canceled desc = context canceled"
To Reproduce
Steps to reproduce the behavior:
vault plugin register -sha256=${PLUGIN_SHA256} -command=${PLUGIN_COMMAND} auth ${PLUGIN_MOUNT}
eval vault auth enable -path=${PLUGIN_MOUNT} ${PLUGIN_MOUNT}
Expected behavior
The registration of the plugin should and it should not be returning
error occurred during enable credential
Environment:
vault status
):vault version
):Other Dependencies
Additional context
We are currently running a Vault Cluster with version 1.11.4 in our
production
andstaging
environment and we are trying to upgrade toVault 1.12.3
following the upgrade guide https://developer.hashicorp.com/vault/docs/upgrading/plugins.The error we are currently facing is on our local environment (we haven't yet tried to update the cluster since our Integration tests that depends on the external plugin is failing because of this issue).
After enabling
LOG_LEVEL=trace
we noticed that during the initialisation it looks like Vault core is successfully loading the plugin and for some reasoncore
kills it. (not sure if this expected)During the initialisation, we can confirm through the logs that the registration of the plugin was successful:
Running
vault plugin info auth $PLUGIN_NAME
does return the proper information of the plugin.Note_1: We are using App Roles to enable the plugin
Note_2: I'm not sure if this would help, but I extracted
vault-debug-logs
and you can find it attached.Note_3: Increasing the timeout of Vault Client didn't help either. We still got the same error.
vault-debug-2023-02-14T23-26-56Z.tar.gz
Once again, thanks for your help into this.
Helio
The text was updated successfully, but these errors were encountered: