-
Notifications
You must be signed in to change notification settings - Fork 29.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AsyncLocalStorage does not work with http.Agent #34401
Comments
I think removing |
@Flarna even if we create a patch for this particular issue - similar problems would still be caused by other core or user-land modules. Whenever there is a pooling involved - there is a potential for a thread-local storage to get out of sync. From my experience, it is hard to impossible to debug and leads to security issues. I've personally seen developers using CLS as a storage for security data and application users performing activities for other random users as a result. The issue with having it in the core is that core API comes with a promise both in the form of the documentation and provided by the fact that it is a part of the core and thus is can be trusted. Should we make a promise for something as frail and susceptible to issues as thread-local storage? |
Perhaps, instead of shipping its version of CLS implementation - Node.js core should be more vocal about why CLS should not be used in user code at all? |
I understand your point that we removing The problem with not using CLS is that it disallows some usecases - concrete any APM tool I know uses CLS and I'm quite sure there is more then these. |
I think removal of APM and any form of tracing/monitoring does not have an effect on the logic of the application so it is indeed safer. However, imagine a different scenario: thread-local storage is used for logging and the support team uses logs when the ground-truth is unavailable. What would happen if the logs of one request will get mixed with the logs of another? One can't argue that this is an incorrect use of the thread-local storage, because, well, it is what it was advertised for. However, evidently there are still possible ways in which this could go wrong even if this doesn't change the behavior of the server in general. FWIW, I know plenty of tools that use CLS and contributed to removal of CLS at least in some of them. The fact that there exists software that uses faulty API/module doesn't mean that this functionality has to be present in broken form. |
I have some exposure to CLS impl & integrations, and I'm coming at this from a high-level/black-box point of view. CLS is high value in my estimation. It seems pretty clear that things that do pooling need to take steps to ensure reasonable CLS semantics for their callers. In the use-case mentioned, the Agent instance should have a CLS (if needed) branched from its point of creation; the http.request() should have CLS branched from its point of creation, with any Agent interactions passing the request's CLS through so they can be used in alignment with expected semantics. Choosing one of these contexts accurately inside the Agent would depend on the situation, though. For instance, a) pool activity isn't request-specific, so logging cleanup of idle connections would use the Agent's CLS logger if that were a thing. Again with a hypothetical, b) logging "yay, this connection is still active, we'll hang onto it longer" might ALSO use the Agent's CLS logger even though an in-scope request having its own CLS context was the causation for it. ...but then c) it would use that request's CLS/async context explicitly, for a callback such as the one mentioned in this issue. Maybe the issue title should be centered on http.request() and/or http.Agent not playing their expected cooperative role with CLS? |
@rjharmon what are you are saying is that we have a leaky abstraction that requires a full-stack support. Yet most Node.js apps are built around public npm modules that do not yet provide (and many won't for a long time) such support. Sure, it is possible to "fix" http.Agent to make it play better with this particular CLS and even introduce an API for doing so, but until whole ecosystem catches up - there are still going to be incompatibilities in the 3rd party modules and these modules may not even be in the direct dependencies of the application (as it was in my case some time ago). |
Chiming in on this as a former user of CLS for sensitive data: this is a tough issue to thread. There's a legitimate desire for CLS for operations like logging, APM, or associating resources with requests (like postgres connections, for example.) Doing this requires a lot of care, as @indutny notes -- if a dependency anywhere in your tree does any sort of connection pooling & doesn't use the AsyncResource API to correctly re-associate asyncwrap ids, context might be lost or swapped incorrectly. It's very hard to debug when this happens. However. Authors have been doing this with Node for a very long time – even before AsyncLocalStorage, through monkeypatching or via domains. There's a real desire for this feature; in my experience Node has not had much success in telling authors not to do something that they want to do. Providing a blessed API for doing this work gives Node an place to document the potential pitfalls, and maybe some amount of leverage over the situation (could Node provide an I'm in complete agreement that the feature, as it is now, is dangerous. However, it's a feature that authors continue to try to implement themselves. By acknowledging that authors want this functionality, we can better educate them on the pitfalls. Hopefully, eventually, we can work with the community to make the feature less dangerous. Backing out the feature would be doing node's users a disservice by continuing to let them fend for themselves. |
Fair points, for sure. I'd agree that the http module has a leaky abstraction. Possibly that's true more generally of AsyncLocalStorage too; that's not really how I'm seeing it, though I recognize I may be seeing wrongly. The facts of this issue's existence, and that both modules are included with nodejs, makes these distinctions pretty much boring, though. With some clear guidelines, maybe we can help those 3rd-party module providers shorten "a long time" somewhat (perhaps in a way like @chrisdickinson elaborates) - the API seems straightforward enough, once one gets their head around it. Helping them to identify how/whether they're impacted, and to add not-v14 and not->v13.whatever clauses to their package.json might also be in order. Shipping broken bundled modules in a v14-LTS in October would be a real bummer. |
@chrisdickinson beautiful take on it! I can see the motivation behind having it in core. What I disagree with is that 3rd party modules will catch up and provide support for this. As you said single 3rd party module in indirect dependencies may mix up the async contexts for whole app and result in very unpredictable and dangerous behavior. Having this API means that the ecosystem is going to fracture into modules that were modified to support this and the modules without the support. If we are to assume full cooperation from all module authors - there are still pure JavaScript code that is no longer (or lightly) maintained, but does not depend on Node.js core specifically and thus is perfectly usable. It is okay to have a breaking an API change and we do it from time to time. What's unfortunate in the case of CLS is that the fact that the module does not support it is completely invisible to the module user up to the moment when it blows up in production. |
@rjharmon unfortunately I can't agree with you. http module implementation has its pitfalls, but in the context of this particular issue it suffers from the same symptoms that the rest of JavaScript modules might. There is nothing specifically incorrect about how it is written. |
There's already many modules which properly use AsyncResource to bind the context and that's with async_hooks being marked as experimental. With AsyncLocalStorage we have a simpler API surface area than async_hooks and so it has a better chance of leaving experimental and thus a better chance of convincing module authors that they should use AsyncResource to support it properly. Removing AsyncLocalStorage from core doesn't fix the issues, it just drops them all on the users to fix in userland where it can't be fixed safely or in a way that benefits everyone. The cost of fixing context propagation issues is large and so individuals tend to just leave them broken which has resulted in an ecosystem full of APM vendors that are likely to crash your app, which is obviously undesirable. By having it in core, we're committing to a goal of reaching greater stability and are thus making more fixes to do that, which is evidenced by the creation of executionAsyncResource, the thenables fix I made to V8, the promise hook fast path when not using the destroy hook in async_hooks--yes, it's not perfect, but in core we can work toward something more solid than the horribly broken and fragmented solutions the ecosystem has had for the last decade. We've made more progress on async_hooks performance, usability and stability in the last six months than in the entire lifetime of Node.js leading up to that, because we finally were able to get some of this stuff in core. |
@Qard I feel like this doesn't address the fact that one 3rd party module that does not use Note that it is not the same as being imperfect or having issues that can be fixed. In my opinion, the problem that was brought up is a serious design flaw. Similar issue (but in original CLS module and |
It depends how you think of "unstable" here. APM vendors are used to coding very defensively, and we DO communicate the importance of using AsyncResource to maintain context. It's not going to crash your app unless you use it fairly recklessly. The feature is also considered experimental and arguably should maybe stay that way until a language spec can be decided and proven. As long as it remains experimental, I think it's not reasonable to expect the level of stability you seem to be expecting from it. Besides, it's not like there isn't plenty of other APIs in Node.js that can blow up in your face if you use them wrong. |
"unstable" not in a sense of crashing the app, of course. As mentioned previously, the instability is due to information mixing between different contexts caused by 3rd party modules out of user's control. The feature is experimental and so is whole |
@indutny every single mid-large company I consulted with use an APM and each one of them must use We could design a better API for this use case, but I'm failing to see one that will not require a mechanism similar to |
I agree that a better API would be desirable but I am not sure what such an API would look like. I think the "single library can mess things for everyone else" is a problem Node.js has in general - one way we can help is provide better tooling to figure out where the chain "broke".
Like Matteo said - We haven't given people a choice, it's either "use async hooks" or "don't use promises, async await and modern JavaScript" (or transpile bluebird promises and use domains with them or some other abomination which wouldn't help since domains are deprecated). |
As Matteo said, basically every production Node.js app in existence has some sort of APM running in it, and every single one of them needs to have a CLS implementation of some sort. For years we left that to the community to implement and it resulted in lots of very unstable diagnostics tools. I know dozens of companies that switched to Go exclusively because of this. Our small core ethos has been actively harmful to any attempts at building a good developer experience for the platform. Every other popular platform has features marked as experimental that can be just as "unsafe" as this. Some features like this are large and complicated enough that it's pretty much impossible for them to be built completely stable from the start. The needs and edge cases are very complex and most importantly require a lot of experimentation and iteration to land on an adequate solution which solves the appropriate set of problems without being unstable or becoming a leaky abstraction. Promises iterated in userland for awhile before a clearly defined spec came along and introduced it to the language. The same path is being followed here with AsyncLocalStorage informing the spec proposals to introduce corresponding functionality at the language level. The async_hooks API we have now is an evolution of AsyncListener, because the first attempt had some really difficult to understand edge cases that didn't surface until long after the feature was released and people started to use it. In the last few years we've had similar challenges with async_hooks, like the emergence of promises and async/await causing issues, particularly the fact that promise hooks wasn't designed with thenables in mind so V8 needed a fix to make thenables work. There's always going to be new features coming along that throw a wrench in your diagnostics efforts--that's just the nature of the space. That doesn't mean you should throw your diagnostics features away, it just means you should acknowledge that you are working with low-level constructs and that has associated risks and complications that need to be considered. This feature is in extremely high demand. Denying it only serves to make Node.js as a platform worse to use for the majority of users. |
It seems to me that we are just re-iterating following two views on it:
I've heard a lot of opinions about the first bullet point, but most of what have been said (except @chrisdickinson 's comment and mine) does not even attempt to address the second issue. You have said that people moved to Go because there was no CLS, but it happens in reverse order as well - people come to Node.js from Go and Java and look for a thread-local storage to help them write business logic. The easiest thing to do - is to just use CLS instead of changing the paradigm about the asynchronous state. As long as there is a support for this in core - these developers would use it regardless of what risks it poses for their applications. |
As a person with some Java background, I'd say that using IMO the same rule should be applied to CLS APIs in node ecosystem. Storing any business logic related data in CLS context should be considered an anti-pattern and should be avoided in favor of passing that data explicitly across application modules. |
I don't think anyone here disagreed with the second point or claimed to like this sort of state. I think everyone was focusing on the first point because it means that in order to build production systems using modern JavaScript in Node.js they must have something like AsyncLocalStorage (or async_hooks). I think the main reason Node hasn't iterated more on the API is that not a lot of people are spearheading this effort. I think for starters Node can explore better documentation and clarifications on when to avoid using it because right now users are using it (and userland alternatives) anyway. We can also explore tools to figure out where chains "break" to help developers fix their code. |
I'm not claiming that this is a right thing to do in either of languages/platforms, but of course APIs like that are used for purposes that we may not deem to be correct. For Java it would be "okay", however, because ThreadLocal has specific guarantees about its contents and won't switch them out of blue. For the time being, we could go with a large disclaimer or warning section for AsyncLocalStorage. Currently we have no mention in the documentation of the fact that it is generally unsafe to use it for anything that can affect the logic of the application. The way docs are worded now:
brings wrong impression that ALS can be used for storing any kind of asynchronous state (which it is not). Can everyone here agree that at least the language of the documentation should emphasize unsafety of that feature and warn users of potential security hazards of ALS if used incorrectly? |
Not really. If we can find where the chain broke, why would it broke in the first place? This is precisely the issue with ALS that worries me. |
I'm not claiming this, but it seems that instead of discussing major fault of the API we are glorifying a subset of its applications. |
I agree that some warnings/notes regarding the risks and when/why it can fail would be helpful. And maybe even how to improve/reduce this. In special as userland modules can "break" a core API. There are also warnings/notes in docs to avoid sync FS APIs and that workers will not help to speed up async tasks. I know this is not the same but it's also a wrong use of APIs we warn.
I doubt that we can find something generic here as the problem is usually not breaking the chain, it's wrong linking. And there are even cases where correct/wrong is a matter of taste (I don't go further into details here as this is a different topic).
I don't think it's glorifying. Yes, APM tools use a bad infrastructure - but there is nothing better available in Javascript language. |
So is the use of home-grown cryptography. I don't see how marking it as "not secure for public use" is hurting APM tools. What such note means is that the person using this should very well be aware of the risks that it poses and understand that when used in an incorrect way ALS will leak sensitive information. |
Using "not secure for public use" can cause some users to think APM tools that use ALS are insecure, while that is not exactly true. How about saying "Users should avoid using it directly. Improper utilization may lead to security issues."? Or something like it... |
I know at least about one other edge case that breaks CLS-like APIS: queues. I have seen apps having a kind of rate-limiting feature by queueing requests, in that case, the current store can be missasociated. My general opinion on the Async Hooks/AsyncLocalStorage situation is probably the same as expressed by many here:
The use of AsyncLocalStorage is challenging for a lot of people. I was asked to review implementation a few times since it was released into node. But I don't think that's really a bad thing since in my mind, this API is mostly here for people building tools (APM, RASP, Framework maintainers, Tracing tool builders,...) rather than directly by the end user. |
We have a troubleshooting section in the doc: https://nodejs.org/api/async_hooks.html#async_hooks_troubleshooting, which is not much, but better than nothing. To provide package maintainers with more information we could write a guide (https://nodejs.org/en/docs/guides/) dedicated to ALS. It could provide an overview of the API with necessary (not all of them) implementation details, mention performance cost and other quirks, like potential context loss due to various reasons (might be a good idea to list some of them), name some good use cases, as well as anti-patterns, and describe integration guidelines for package maintainers. |
We just need better docs. The AsyncResource docs explain a bit about why you should be using that to fix gaps in the graph, but you wouldn't know to look for that if you're looking at the AsyncLocalStorage part of the docs. It's not particularly clear that the two APIs are linked. As for the associated risks of context loss or adopting an incorrect context, that seems like a somewhat lesser risk to me than literally every production Node.js app on the internet deploying with a pile of ridiculously unsafe monkey patches and introducing thousands of subtle behavioural differences to the entire core API. I'd rather have the risk we can at least quantify and control. The productive action here would be to identify where we could improve the docs and help to find solutions to the soundness problems of the current API and contribute solutions to Node.js core or the corresponding TC39 proposal. Regardless of how you feel about this API, every Node.js app out there will be using it whether it's in core or not, and keeping it out of core just means it's going to be a lot less safe and a lot less maintained. I agree, it's not perfect, but there's no better alternative. If we had bytecode instrumentation I'd just use that, but we don't. |
Better documentation for |
Yep, I mean we should have some more prominent references to |
I'm seeing a similar issue as OP and it's not clear from this discussion where the breakage is or how one would go about dealing with it. My understanding of the docs is that |
@puzpuzpuz I don't have a reproducer. My concern is mainly about |
@lagnat I see. Then I would say that whether you face a similar (context loss) issue really depends on how pooling (or queuing) is implemented and exposed in the public API. Say, if operations in a driver are queued and then sent over a single TCP socket, there might be ALS context loss because of that. But if those operations are exposed via a Promise-based API, there will be no issues. If you have a concrete scenario where things go wrong, I can try to advice how to deal with it. We also have this short troubleshooting guide in the doc which might be helpful in case of context loss issues: https://nodejs.org/api/async_hooks.html#async_hooks_troubleshooting |
What steps will reproduce the bug?
How often does it reproduce? Is there a required condition?
Always. No conditions.
What is the expected behavior?
I expect this code to print:
What do you see instead?
Additional information
This is very much expected behavior of CLS storage. Any sort of pooling would result in similar issues. What's concerning is that it can happen with use of an existing core primitive (
http.Agent
). In fact, use of any agent withmaxSockets !== Infinity
orkeepAlive
set totrue
would result in similar behavior.This behavior can be documented, but I'd imagine that such documentation would look like an incompatibility within Node.js core. Given that whole
async_hooks
module experimental - should we consider removingAsyncLocalStorage
instead?cc @nodejs/collaborators
The text was updated successfully, but these errors were encountered: