-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When sharing the terminal with child nodes, wait for the children to terminate before exiting ourselves. #6053
Conversation
for the children to terminate before exiting ourselves. This avoids issues where the child changes terminal configuration after our own exit.
Deploy-MSBuild with the Core note at the bottom. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making this change!
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Outdated
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Outdated
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Outdated
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Outdated
Show resolved
Hide resolved
/// <summary> | ||
/// The amount of time to wait for an out-of-proc node to exit. | ||
/// </summary> | ||
private const int TimeoutForWaitForExit = 30000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think this this is the right timeout? Node creation takes longer than exiting, I would have thought.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends if we expect it to be common for nodes to fail to shut down when asked to.
If we think in the common case they should be able to shut down, we can keep the long timeout.
If in the common case they won't shut down when asked to and we don't care about the reason, we can use a lower timeout.
With these changes the scenario from microsoft/vstest#2282 (comment) results in this behavior:
Note that every process that gets started ( |
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Outdated
Show resolved
Hide resolved
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Outdated
Show resolved
Hide resolved
@tmds This might make the code more complicated, but can you please log a warning whenever a node has to be forcefully killed after the timeout? Nodes should terminate gracefully when sent the shutdown packet. A node not terminating when asked to is a failure, since the node should terminate running targets / tasks and exit. The warning should contain the node id and process id. A different approach is to not kill the node at all, but log an error including the node id and process id. Then, users can dump the node process and look at its call stacks. And hopefully open an issue on msbuild to get it fixed :) |
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Show resolved
Hide resolved
@@ -761,7 +787,7 @@ public void SendData(INodePacket packet) | |||
#else | |||
_serverToClientStream.WriteAsync(writeStreamBuffer, i, lengthToWrite); | |||
#endif | |||
return; | |||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as an FYI, this will conflict with another PR:
https://github.com/dotnet/msbuild/pull/6023/files#diff-f0e57cbf17e0f7fb207f255deb11445afccbb5c7bf0499a24b8065ff057fcdf3L764
I'm happy to resolve conflicts if needed in either PR (depending on whichever one goes first).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also keep in mind @tmds that the logic in SendData here is as follows: for large packets, we break them up in 1 MB chunks, and we write all chunks synchronously, except for the last one. The last chunk is fire-and-forget (the WriteAsync starts the write and returns). For small packets (the vast majority), the write is async and we never wait for the write to finish. In fact we see cases where the next invocation comes in and issues another write before the previous write has finished. We're just getting lucky that the framework seems to be good at serializing this.
In my PR #6023 I optimize memory allocations, which required to make all the writes synchronous (so we can reuse the byte array buffer that we pass to Write). To compensate for that, we made the entire SendData method fire-and-forget.
I haven't look into this PR deeply enough to decide whether one needs to await writing packets to the socket connected to the node process before any shutdown/killing happens. Before my PR 6023 it would be a bit harder to await each async write before killing or shutting down. After my PR you can just await the _packetWriteDrainTask
here: https://github.com/dotnet/msbuild/pull/6023/files#diff-f0e57cbf17e0f7fb207f255deb11445afccbb5c7bf0499a24b8065ff057fcdf3R774 But again, I'm not sure if blocking or awaiting here is needed or not. Can we be in a situation where we still have outstanding packet writes to the node and the node dies? Do we need a try/catch somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, we shouldn't set _isExiting
when we were unable to flush all pending data.
If we get to that point, we assume the node will act on our request to terminate.
And if it doesn't we'll end up waiting TimeoutForWaitForExit
.
Let's look at how this needs to be changed after #6023 is merged.
Adding @dsplaisted as an expert on cross-node communication - this is a very tricky area so needs careful review. Things can break if you look at it too hard. And we need to think of Windows, MacOS and Linux. |
We'll need to decide in which order we want to merge this vs. #6023, because after the first PR goes in the second PR will need some work to adapt and resolve conflicts and re-test. I'm OK to go in later if needed. |
Team triage: |
I've added a test.
How do I get hold of a logger in |
@KirillOsenkov I've removed timeouts on I can add back the timeouts if you consider them necessary on some paths. |
I'd probably count this as related to communication, so I'd use CommunicationsUtilities.Trace. It won't show up in the normal log, but it would show up if you turn on comm traces. It won't show up as an actual warning, but I'm not sure how to do that from here. |
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good!
src/Build/BackEnd/Components/Communications/NodeProviderOutOfProcBase.cs
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending making the warning string localized. In theory this is a breaking change. Up until now an msbuild node that received a shutdown message kept on executing long running user tasks while the main node disconnected async. After this change, long running user tasks in out of proc nodes will be killed if they don't finish within 30 seconds of the node receiving a shutdown message. I think it's a benign breakage and won't affect anybody, but worst case it will and then we'll have to redo this.
4b17058
to
1b0177c
Compare
7f7375c
to
6f67a09
Compare
@Forgind @KirillOsenkov ptal at the recent changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for all your work on this!
I wanted to verify this once more using the deploy script, but it is no longer working for me. My patched sdk crashes with:
I see there is a |
We hit that internally, too. Our deploy-msbuild copies files over but for .NET Core, dependencies have to be listed in MSBuild.deps.json as well. If you're patching a an older SDK without StringTools this dependency is missing and runtime is not even trying to find the assembly. I put an updated MSBuild.deps.json below. (Credit to ladipro and rokonec) |
@Forgind thanks. I'm not sure if I did something wrong, but I still got an exception. I managed to get past it by using another SDK at that step.
As shown in the trace: every process that gets started ( |
Sounds good! We'll merge this pretty soon. |
This is an attempt to fix microsoft/vstest#2282.
@Forgind ptal.
How can I use the build output of this repository and combine it with an sdk so I can try out some things?
cc @rainersigwald