-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(sub-channels): clear the cleanup interval when all channels are unrefed. #1083
Conversation
…nrefed Co-authored-by: Natan Sągol <m@merlinnot.com>
Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement. After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.
Regards, |
I signed it |
Have you confirmed that this problem is not just a Jest bug? My own tests always terminate quickly without even explicitly closing channels. |
There might be a bit of an overstatement here: it probably won't fix #1083. However it will still fix one of the issues we're experiencing. This
We're trying to introduce some gradual changes to help resolve issues with the library we are facing at the moment. I hope that this change is small enough to be reviewed with confidence and given the examples above you'll see a point of incorporating it into the library. We tested it on our internal projects too and it seems to work perfectly fine. |
Clearing the require cache is something I hadn't considered. I'm OK with turning the timer on and off based on whether there is at least one subchannel in the pool, but I don't like forcing the pool to clear out every time a channel is closed. Part of the point here is to allow subchannel reuse to improve connection time, including when a channel is closed and then a new one is opened. If all of the channels are closed, then the subchannel pool should still get completely cleared out the next time the timer runs. |
Great. To align on this, so me and my colleagues can improve the PR tomorrow: The expectation is that instead of trying to unref as much as possible and check if we unrefed everything, we should first check if we can unref everything and do so only if that's the case? |
I'm not suggesting any logic change. I'm suggesting using the existing logic, but turning the timer on if there is a subchannel in the pool and off when the last subchannel is removed from the pool. A natural consequence of this is that if there are no open channels, every subchannel will be removed and then the timer will be disabled. We can also talk about changing the |
I'm not sure if that would solve the issue. Let me clarify my expectations first, so we're on the same page: If all gRPC clients are closed (using the Do you agree that it should be the case? |
No, I do not agree with that. It is beneficial to have subchannels stick around for a brief time after the last channel closes in case a new channel is created. |
Could you share the use cases that you have in mind that would benefit from the current behavior? My understanding it is that it would help if users are rapidly opening and deliberately closing connections, using the exact same options (addresses, credentials, ...). Are there some other use cases? |
Yes, the use-case for this is that someone closes a client connected to a specific target, and then later (but not much later) creates a new identical client with the same target. |
Do you feel it's an expected way of using the library? Or is it an optimization towards users misusing it? I'm starting to wonder if we should have this timer at all, maybe we should only clean it up when a client is explicitly closed? |
The timer is still needed to handle clients that aren't closed explicitly and subchannels that become unused that were previously used by still-active channels. |
Thanks for the explanation. Do you think that explicitly closing the client and recreating it repeatedly is an expected way of using this library? |
No, closing and recreating identical clients is not an expected or intended use of this library. I'm retracting my previous comment here; I agree now that there is not significant inherent value in preserving subchannels past the channel lifetime. Still, I believe that the timer is necessary for other usage patterns and that cleaning up the subchannel pool on channel close should be essentially unnecessary. So far, as far as I am aware, the only benefit to cleaning up immediately is to avoid this problem with Jest, and I do not want to start addressing issues in other libraries and tools in our code. My preference then is to keep the functionality simple by only having the timer. I do still support turning the timer off while the pool has no subchannels. In addition, memory could be further saved by creating a new empty |
Maybe using Jest in the issue wasn’t the best idea, I didn’t intend to focus on it. It affects any testing setup if you clean require cache or you have memory leak detection implemented. Conceptually it seems that “I have no resources allocated if I cleaned everything up” would be an expected behavior. Maybe we could find some middle ground here, so we can have this behavior, but it’s implemented in a way that is more to your liking? Do you feel that the current implementation is too complex? Maybe we should split it in two parts, so that it’s easier to review? |
I simply don't think it's necessary to clean everything up as soon as channels close. I don't think there is a significant impact even if you clean the require cache a hundred times in one process, and I expect that this will be noise in most leak checks. I really don't think it matters that much if resources are cleaned up immediately after a channel closes, as opposed to several seconds later. They're already unrefed, so they won't hold the process open. |
I understand that it doesn’t matter for you. I think we have both pros and cons here. Pros:
Cons:
Are there any things I should add to this list? Maybe we could work together to come up with an implementation that is simpler and more clean? I tried to make the smallest amount of changes possible, take care of naming and nice structure of the code, but I have a feeling that it’s not up to your standards? Could you maybe provide me with some more detailed feedback if you feel like we can improve this PR in such a way that you’d be comfortable with merging it? |
Here's how I see it: Option 1 (the current behavior of this PR): Use the timer and a call at channel close to clear out the pool. Enable the timer only when the pool contains at least one subchannel. Pros:
Option 2 (my proposal): Use only the timer to clear out the pool. Enable the timer only when the pool contains at least one subchannel. Pros:
For me the relative simplicity of the second option outweighs the brief and occasional value of the first. |
For us, as a user of this library, a solution proposed by you would force us to either:
The first option would dramatically lower our productivity and increase costs of CIs. The second option would increase complexity of the test setup and might interfere with timers used by other libraries, which might result in unexpected behavior and inability to test real behavior of these libraries. Maybe we could make eager cleanup optional if you’re concerned by additional CPU usage? grpc.setOptions({ eagerSubchannelCleanup: true }) I’m trying to find a solution that would allow my company to keep using this library. |
I don't understand these requirements you are describing. My own tests finish promptly without even explicitly closing clients, because these objects we're discussing are unrefed. And I don't see what a fake timer would do. |
I understand that your particular testing setup exits and I do acknowledge that unrefing allows the process to exit cleanly if there are no other tasks in the event loop. However:
|
Are you saying that you are currently using one of these memory leak detection tools and the only excess memory usage it is detecting is those And I honestly don't see the problem with temporarily having 20+ timers pending. They might consume a little CPU when they run the last time but then that's it. |
No, I'm not saying it's the only leak it detects. As my company is using this library, we'd like to contribute to it the best way we can. As an example, I recently provided a meaningful reproduction that allowed to fix the issue we and other users were facing: #1085 (that's what is being released today). This PR provides a solution to one of the issues which is affecting us. We'd like to continue investing into this library by both providing reproductions of issues and creating pull requests to fix some of these issues if we are able to pinpoint a root cause. |
I am just trying to understand what actual problem would be solved by the change you propose. If the goal is to fix Jest reporting an error because it doesn't understand that the timer is unrefed as reported in #1080, that should be taken up with Jest; I don't want to publish code that primarily addresses bugs in other libraries. If there is some other problem separate from #1080 that is preventing tests from finishing cleanly, tell me what that problem is. It may not even be related to the subchannel pool. If this library was blocking the adoption of leak checking in tests because it was the only memory sticking around at the end of the test, I would consider that a strong reason to make this change or one like it. But if it's just one leak among many then don't think it's grpc's responsibility to make fixes for that purpose. If the existence of many pending timers as a result of clearing the require cache is causing significant problems with running tests, tell me what those problems are and I'll be more inclined to make this change. |
Sure. Here's an illustration, simple JavaScript that you can actually execute with const assert = require('assert');
let heapUsed = NaN;
const beforeEach = () => {
global.gc();
heapUsed = process.memoryUsage().heapUsed;
};
const afterEach = () => {
global.gc();
console.log(`Before: ${heapUsed}, after: ${process.memoryUsage().heapUsed}.`);
};
const definitelyNotLeaking = () => {
const buffer = Array.from({ length: 1024 * 1024 });
assert(buffer.length === 1048576, "Math doesn't work anymore.");
};
const definitelyLeaking = () => {
const buffer = Array.from({ length: 1024 * 1024 });
setTimeout(() => {
console.log(
`Hey look, test finished, but I'm still running! - ${buffer.size}`,
);
}, 500).unref();
};
beforeEach();
definitelyNotLeaking();
afterEach();
beforeEach();
definitelyNotLeaking();
afterEach();
beforeEach();
definitelyNotLeaking();
afterEach();
beforeEach();
definitelyLeaking();
afterEach();
beforeEach();
definitelyLeaking();
afterEach(); You should see an output similar to:
As you can see, the memory usage is radically growing with From what I've seen most memory leak detection tools uses V8's C++ API directly, as it exposes more functionalities, but I hope this simple example will help you understand the issue. |
OK, yes, I do understand that a timer that closes over things will hold those things in memory for as long as the timer is pending. And I know that unrefing the timer doesn't change that because that is addressing a different problem. And I do understand that you can write a function that checks how much memory is allocated and see that more memory is allocated. What I am trying to understand here is the actual practical impact of these temporary memory usage spikes, in terms of test failures or integration problems or application issues or users' bug reports or whatever. |
If you have such a tool enabled, ever test which uses gRPC, even if cleaned up correctly (.close etc) will fail the memory check unless you (given we’d start disabling the timer in a way you proposed):
What’s more, if you clean require cache, memory allocated by gRPC will start to pile up (as I described above, if the test execution takes half a second, you have up to 20 pending timers). In our scenario tests are running on a 32-core VM in full parallelization and isolation. That means that we’d have 32 * 20 = 640 gRPCs allocated at the same time, which is a significant additional memory pressure. I didn’t see any other library that would behave in such a way, that after deliberately closing it, resources would remain allocated (even for a few seconds). Do you know some other libraries that do so? |
I understand that if you were to use such a tool and set it up that way, its report would include those objects. I just want to know what the actual impact of this is. Is this preventing you from using such a memory leak detection tool effectively? As for the memory usage, subchannels use something on the order of kilobytes or tens of kilobytes of memory. So in the situation you describe, you will see a peak of something on the order of several megabytes of total excess memory usage. That shouldn't be a problem on just about any modern system. |
Yes, it does prevent usage of such tools. It's very hard to determine where a leak comes from, as it's usually a chain of references from one thing to another. You probably know how hard it is sometimes to get some understanding of different memory leaks if you use a debugger, it's also hard to determine for these tools where does a memory leak come from, so you just get a "yes/no" response. At least I personally didn't ever come across a tool that somehow allows you to exclude some paths. Situation with native grpc was a little different, as it's a native add-on and memory leak detection tools don't keep track of these, only actual JS heap. So at the moment I can't enable memory leak detection for any tests, as we're using Google SDKs extensively, especially Firestore and PubSub (with emulators). Having an actually clean state (synchronously) would allow me and other developers to use these tools. It's really useful for us to detect memory leaks in the CI :) I know at my company we're a little paranoid about tests, but given the speed we're developing at (sometimes dozens of features and deployment daily) and having a "master = production" mentality, we really need to take a good care of our tests. We're trying our best to contribute to open source, especially the tools we use daily. You can take a look at my public history (e.g. 25 merged and 1 open PR to firebase-functions) and my company's public profile with PubSub, Firestore and Flatbuffers containers, GitHub Actions etc. We created this PR as it's an important functionality for us and we prefer to contribute to the community instead of just using forks, as we believe it's a more sustainable way for the community and our company. We also believe that it would ease adoption of memory leak detection tools for others, which might be a good idea for many. If this implementation is not something that is acceptable for you, maybe you could propose a different solution (e.g. what I mentioned above, to add it as an option)? We'd be happy to discuss it and adopt our approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for all of the back-and-forth. I feel sufficiently convinced now that cleaning the pool up when channels close is the best way to improve this behavior for some use-cases.
packages/grpc-js/src/channel.ts
Outdated
@@ -288,6 +288,10 @@ export class ChannelImplementation implements Channel { | |||
close() { | |||
this.resolvingLoadBalancer.destroy(); | |||
this.updateState(ConnectivityState.SHUTDOWN); | |||
|
|||
if (this.subchannelPool !== undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type-wise, this.subchannelPool
can't be undefined
, so this check is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 93f8169.
const allSubchannelsUnrefed = this.unrefUnusedSubchannels(); | ||
|
||
if (allSubchannelsUnrefed && this.cleanupTimer !== undefined) { | ||
clearInterval(this.cleanupTimer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition should be that the timer is started if there is at least one subchannel in the pool, and stopped if there are no longer any subchannels in the pool. The timer should not be disabled just because a single channel is closed. So, this check should be in unrefUnusedSubchannels
.That allows a simplification where unrefUnusedSubchannels
returns nothing, and is public, and is the function that Channel#close
calls directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one.
Fixed in 5f271de
/** | ||
* A timer of a task performing a periodic subchannel cleanup. | ||
*/ | ||
private cleanupTimer: NodeJS.Timer | undefined; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I slightly prefer not allowing undefined
here, and instead having a separate boolean indicating whether the timer is running. I find it cleaner type-wise to allow as few types as possible. The downside is that the constructor needs to do cleanupTimer=...
and clearInterval(cleanupTimer)
and starting the timer is slightly more code, but the upside is that you never need to check the value of cleanupTimer
itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that you prefer it to be more explicit, rather than having undefined
to indicate that it's not running? I'd personally prefer to also destroy the reference to the timer if we're not running it anymore, as it ideally should not be used anymore. I thought we might be more explicit, but destroy the reference too:
type CleanupTask = { running: true; timer: Nodejs.Timer } | { running: false };
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I like keeping the timer reference is it allows the code for disabling the timer to work even if the timer is already disabled, which simplifies code:
clearInterval(timer);
running = false;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @murgatroid99,
I couldn't find any documentation, explaining if the callback of setInterval
is garbage collected when clearInterval
is called on its timer. This makes me uncertain if we should keep the timer on the instance or if we should remove it. I consulted with my colleagues and no one was certain about it. So I think it might cause more confusion than clarity.
What do you think, maybe we could leave it as is?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair point. One alternative would be to do the same thing I suggested in the constructor and replace it with a do-nothing interval and then immediately cancel it, but that kind of removes the benefit I mentioned so I'm OK with keeping it as it is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, actually, this is a little minor, but can we have the alternate type be null
instead of undefined
. I dislike using undefined
as an explicit type, and null
would be more consistent with other code in this project. It shouldn't change much; this line can just be private cleanupTimer: NodeJS.Timer | null = null
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to null and merged master, as far as I remember you made a fix to macOS builds.
* Ensure that the cleanup task is spawned. | ||
*/ | ||
ensureCleanupTask(): void { | ||
if (this.global === true && this.cleanupTimer === undefined) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=== true
shouldn't be needed here. global
is always a boolean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 93f8169.
Thanks for the feedback! We'll rebase on master and address your comments on Monday :) |
I don't think you need to rebase. The relevant parts of these files haven't changed since you opened this PR. |
@murgatroid99 Thank you for your feedback, this should be ready for another review. |
Thank you for your pull request. Before we can look at your contribution, we need to ensure all contributors are covered by a Contributor License Agreement. After the following items are addressed, please respond with a new comment here, and the automated system will re-verify.
Regards, |
@thelinuxfoundation I signed it. |
@murgatroid99 Could you re-run Kokoro? |
Yay, thanks! 😃 |
I have published this change in grpc-js 0.6.11. |
Thanks! |
Fixes #1080
This forces a cleanup of subchannels whenever a channel is closed and disables the background task if there are no more refed subchannels. This ensure that if all clients using grpc are closed, there are no background tasks running.
Co-authored-by: Natan Sągol m@merlinnot.com