Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When sharing the terminal with child nodes, wait for the children to terminate before exiting ourselves. #6053

Merged
merged 19 commits into from
Feb 25, 2021

Conversation

tmds
Copy link
Member

@tmds tmds commented Jan 21, 2021

This is an attempt to fix microsoft/vstest#2282.

@Forgind ptal.

How can I use the build output of this repository and combine it with an sdk so I can try out some things?

cc @rainersigwald

for the children to terminate before exiting ourselves.

This avoids issues where the child changes terminal configuration
after our own exit.
@Forgind
Copy link
Member

Forgind commented Jan 21, 2021

Deploy-MSBuild with the Core note at the bottom.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change!

@tmds
Copy link
Member Author

tmds commented Jan 25, 2021

Deploy-MSBuild with the Core note at the bottom.

@Forgind @cdmihai I'm giving it a try and it is not working as expected yet. I'll be making some more changes.

/// <summary>
/// The amount of time to wait for an out-of-proc node to exit.
/// </summary>
private const int TimeoutForWaitForExit = 30000;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this this is the right timeout? Node creation takes longer than exiting, I would have thought.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends if we expect it to be common for nodes to fail to shut down when asked to.
If we think in the common case they should be able to shut down, we can keep the long timeout.
If in the common case they won't shut down when asked to and we don't care about the reason, we can use a lower timeout.

scripts/Deploy-MSBuild.ps1 Show resolved Hide resolved
@tmds
Copy link
Member Author

tmds commented Jan 26, 2021

With these changes the scenario from microsoft/vstest#2282 (comment) results in this behavior:

execve("/tmp/dotnet-5.0/dotnet", ["dotnet", "test"], 0x7ffe69586fa8 /* 62 vars */) = 0
[pid 115692] execve("/tmp/dotnet-5.0/dotnet", ["dotnet", "exec", "/tmp/dotnet-5.0/sdk/5.0.102/MSBu"..., "-maxcpucount", "-verbosity:m", "-restore", "-target:VSTest", "-nodereuse:false", "-nologo", "-distributedlogger:Microsoft.Dot"...], 0x7f2d08022e00 /* 67 vars */) = 0
[pid 115716] execve("/tmp/dotnet-5.0/dotnet", ["/tmp/dotnet-5.0/dotnet", "/tmp/dotnet-5.0/sdk/5.0.102/MSBu"..., "/tmp/dotnet-5.0/sdk/5.0.102/MSBu"..., "/nologo", "/nodemode:1", "/nodeReuse:false", "/low:false"], 0x7f1c28065ac0 /* 68 vars */) = 0
[pid 115741] execve("/tmp/dotnet-5.0/dotnet", ["/tmp/dotnet-5.0/dotnet", "/tmp/dotnet-5.0/sdk/5.0.102/MSBu"..., "/tmp/dotnet-5.0/sdk/5.0.102/MSBu"..., "/nologo", "/nodemode:1", "/nodeReuse:false", "/low:false"], 0x7f1bfc000f60 /* 68 vars */) = 0
  Determining projects to restore...
  All projects are up-to-date for restore.
  Project2 -> /tmp/repro/TestTerminalIssues/Project2/bin/Debug/net5.0/Project2.dll
Test run for /tmp/repro/TestTerminalIssues/Project2/bin/Debug/net5.0/Project2.dll (.NETCoreApp,Version=v5.0)
[pid 115808] execve("/tmp/dotnet-5.0/dotnet", ["dotnet", "exec", "/tmp/dotnet-5.0/sdk/5.0.102/vste"..., "--testAdapterPath:/home/tmds/.nu"..., "--framework:.NETCoreApp,Version="..., "/tmp/repro/TestTerminalIssues/Pr"...], 0x7f8888003400 /* 67 vars */) = 0
  Project1 -> /tmp/repro/TestTerminalIssues/Project1/bin/Debug/net5.0/Project1.dll
Test run for /tmp/repro/TestTerminalIssues/Project1/bin/Debug/net5.0/Project1.dll (.NETCoreApp,Version=v5.0)
[pid 115816] execve("/tmp/dotnet-5.0/dotnet", ["dotnet", "exec", "/tmp/dotnet-5.0/sdk/5.0.102/vste"..., "--testAdapterPath:/home/tmds/.nu"..., "--framework:.NETCoreApp,Version="..., "/tmp/repro/TestTerminalIssues/Pr"...], 0x7f1bc8001e40 /* 68 vars */) = 0
Microsoft (R) Test Execution Command Line Tool Version 16.8.3
Copyright (c) Microsoft Corporation.  All rights reserved.

Microsoft (R) Test Execution Command Line Tool Version 16.8.3
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
Starting test execution, please wait...
A total of 1 test files matched the specified pattern.
[pid 115846] execve("/tmp/dotnet-5.0/dotnet", ["/tmp/dotnet-5.0/dotnet", "exec", "--runtimeconfig", "/tmp/repro/TestTerminalIssues/Pr"..., "--depsfile", "/tmp/repro/TestTerminalIssues/Pr"..., "/home/tmds/.nuget/packages/micro"..., "--port", "45027", "--endpoint", "127.0.0.1:045027", "--role", "client", "--parentprocessid", "115808", "--telemetryoptedin", "false"], 0x7f5a4c0090f0 /* 67 vars */) = 0
[pid 115850] execve("/tmp/dotnet-5.0/dotnet", ["/tmp/dotnet-5.0/dotnet", "exec", "--runtimeconfig", "/tmp/repro/TestTerminalIssues/Pr"..., "--depsfile", "/tmp/repro/TestTerminalIssues/Pr"..., "/home/tmds/.nuget/packages/micro"..., "--port", "44499", "--endpoint", "127.0.0.1:044499", "--role", "client", "--parentprocessid", "115816", "--telemetryoptedin", "false"], 0x7fa7d8007f40 /* 68 vars */) = 0
[pid 115841] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115846, si_uid=1000, si_status=0, si_utime=126, si_stime=18} ---
[pid 115835] wait4(115846, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115846

Passed!  - Failed:     0, Passed:     1, Skipped:     0, Total:     1, Duration: 2 ms - /tmp/repro/TestTerminalIssues/Project2/bin/Debug/net5.0/Project2.dll (net5.0)
[pid 115759] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115808, si_uid=1000, si_status=0, si_utime=76, si_stime=14} ---
[pid 115731] wait4(115808, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115808
[pid 115843] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115850, si_uid=1000, si_status=0, si_utime=124, si_stime=23} ---
[pid 115838] wait4(115850, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115850

Passed!  - Failed:     0, Passed:     1, Skipped:     0, Total:     1, Duration: 5 ms - /tmp/repro/TestTerminalIssues/Project1/bin/Debug/net5.0/Project1.dll (net5.0)
[pid 115769] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115816, si_uid=1000, si_status=0, si_utime=86, si_stime=14} ---
[pid 115701] wait4(115816, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115816
[pid 115705] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115741, si_uid=1000, si_status=0, si_utime=182, si_stime=39} ---
[pid 115701] wait4(115741, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115741
[pid 115705] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115716, si_uid=1000, si_status=0, si_utime=293, si_stime=61} ---
[pid 115701] wait4(115716, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115716
[pid 115672] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=115692, si_uid=1000, si_status=0, si_utime=377, si_stime=96} ---
[pid 115691] wait4(115692, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 115692

Note that every process that gets started (exec) is explicitly waited for (wait4).
This PR makes it happen for the msbuild node processes: 115716, 115741.

src/Utilities/ProcessExtensions.cs Outdated Show resolved Hide resolved
src/Build/Microsoft.Build.csproj Outdated Show resolved Hide resolved
src/Utilities/ProcessExtensions.cs Show resolved Hide resolved
@cdmihai
Copy link
Contributor

cdmihai commented Jan 28, 2021

@tmds This might make the code more complicated, but can you please log a warning whenever a node has to be forcefully killed after the timeout? Nodes should terminate gracefully when sent the shutdown packet. A node not terminating when asked to is a failure, since the node should terminate running targets / tasks and exit. The warning should contain the node id and process id.

A different approach is to not kill the node at all, but log an error including the node id and process id. Then, users can dump the node process and look at its call stacks. And hopefully open an issue on msbuild to get it fixed :)

@@ -761,7 +787,7 @@ public void SendData(INodePacket packet)
#else
_serverToClientStream.WriteAsync(writeStreamBuffer, i, lengthToWrite);
#endif
return;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as an FYI, this will conflict with another PR:
https://github.com/dotnet/msbuild/pull/6023/files#diff-f0e57cbf17e0f7fb207f255deb11445afccbb5c7bf0499a24b8065ff057fcdf3L764

I'm happy to resolve conflicts if needed in either PR (depending on whichever one goes first).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also keep in mind @tmds that the logic in SendData here is as follows: for large packets, we break them up in 1 MB chunks, and we write all chunks synchronously, except for the last one. The last chunk is fire-and-forget (the WriteAsync starts the write and returns). For small packets (the vast majority), the write is async and we never wait for the write to finish. In fact we see cases where the next invocation comes in and issues another write before the previous write has finished. We're just getting lucky that the framework seems to be good at serializing this.

In my PR #6023 I optimize memory allocations, which required to make all the writes synchronous (so we can reuse the byte array buffer that we pass to Write). To compensate for that, we made the entire SendData method fire-and-forget.

I haven't look into this PR deeply enough to decide whether one needs to await writing packets to the socket connected to the node process before any shutdown/killing happens. Before my PR 6023 it would be a bit harder to await each async write before killing or shutting down. After my PR you can just await the _packetWriteDrainTask here: https://github.com/dotnet/msbuild/pull/6023/files#diff-f0e57cbf17e0f7fb207f255deb11445afccbb5c7bf0499a24b8065ff057fcdf3R774 But again, I'm not sure if blocking or awaiting here is needed or not. Can we be in a situation where we still have outstanding packet writes to the node and the node dies? Do we need a try/catch somewhere?

Copy link
Member Author

@tmds tmds Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we shouldn't set _isExiting when we were unable to flush all pending data.
If we get to that point, we assume the node will act on our request to terminate.
And if it doesn't we'll end up waiting TimeoutForWaitForExit.

Let's look at how this needs to be changed after #6023 is merged.

@KirillOsenkov
Copy link
Member

Adding @dsplaisted as an expert on cross-node communication - this is a very tricky area so needs careful review. Things can break if you look at it too hard. And we need to think of Windows, MacOS and Linux.

@KirillOsenkov
Copy link
Member

We'll need to decide in which order we want to merge this vs. #6023, because after the first PR goes in the second PR will need some work to adapt and resolve conflicts and re-test. I'm OK to go in later if needed.

@Forgind
Copy link
Member

Forgind commented Feb 1, 2021

Team triage:
Would you mind adding a unit test and a warning message when you're forced to kill a node?

@tmds
Copy link
Member Author

tmds commented Feb 2, 2021

adding a unit test

I've added a test.

and a warning message when you're forced to kill a node?

How do I get hold of a logger in NodeProviderOutOfProcBase?

@tmds
Copy link
Member Author

tmds commented Feb 2, 2021

@KirillOsenkov I've removed timeouts on WaitForExit calls. These timeouts are not needed because the processes that are waited for are expected to terminate in a timely fashion. In case the processes take a bit longer and the timeout is triggered we end up in rare cases like the one causing #5506. By removing timeouts we avoid that.

I can add back the timeouts if you consider them necessary on some paths.

@Forgind
Copy link
Member

Forgind commented Feb 2, 2021

How do I get hold of a logger in NodeProviderOutOfProcBase?

I'd probably count this as related to communication, so I'd use CommunicationsUtilities.Trace. It won't show up in the normal log, but it would show up if you turn on comm traces. It won't show up as an actual warning, but I'm not sure how to do that from here.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good!

src/Shared/ProcessExtensions.cs Show resolved Hide resolved
src/Shared/ProcessExtensions.cs Outdated Show resolved Hide resolved
Copy link
Contributor

@cdmihai cdmihai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM pending making the warning string localized. In theory this is a breaking change. Up until now an msbuild node that received a shutdown message kept on executing long running user tasks while the main node disconnected async. After this change, long running user tasks in out of proc nodes will be killed if they don't finish within 30 seconds of the node receiving a shutdown message. I think it's a benign breakage and won't affect anybody, but worst case it will and then we'll have to redo this.

@KirillOsenkov
Copy link
Member

Hi @tmds apologies that my PR came in first.

I have created a merge PR for your PR that merges it with latest master, to help solve the conflict in NodeProviderOutOfProcBase.cs:
tmds#1

Hope it helps.

@tmds tmds force-pushed the wait_for_child_nodes branch from 4b17058 to 1b0177c Compare February 11, 2021 14:28
@tmds tmds force-pushed the wait_for_child_nodes branch from 7f7375c to 6f67a09 Compare February 11, 2021 15:44
@tmds
Copy link
Member Author

tmds commented Feb 11, 2021

@Forgind @KirillOsenkov ptal at the recent changes.

Copy link
Member

@Forgind Forgind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for all your work on this!

@Forgind Forgind added the merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now. label Feb 11, 2021
@tmds
Copy link
Member Author

tmds commented Feb 12, 2021

I wanted to verify this once more using the deploy script, but it is no longer working for me.

My patched sdk crashes with:

MSBUILD : error MSB1025: An internal failure occurred while running MSBuild.
System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.NET.StringTools, Version=1.0.0.0, Culture=neutral, PublicKeyToken=b03f5f7f11d50a3a'. The system cannot find the file specified.

I see there is a ./sdk/5.0.102/Microsoft.NET.StringTools.dll file in the SDK. I don't know why it is not found.

@Forgind
Copy link
Member

Forgind commented Feb 12, 2021

We hit that internally, too. Our deploy-msbuild copies files over but for .NET Core, dependencies have to be listed in MSBuild.deps.json as well. If you're patching a an older SDK without StringTools this dependency is missing and runtime is not even trying to find the assembly.

I put an updated MSBuild.deps.json below.

(Credit to ladipro and rokonec)

MSBuild.deps.zip

@tmds
Copy link
Member Author

tmds commented Feb 15, 2021

@Forgind thanks. I'm not sure if I did something wrong, but I still got an exception. I managed to get past it by using another SDK at that step. dotnet test ran without issue on my patched SDK and gave this trace:

[pid 18247] execve("/tmp/dotnet-5/dotnet", ["dotnet", "exec", "/tmp/dotnet-5/sdk/5.0.102/MSBuil"..., "-maxcpucount", "-verbosity:m", "-restore", "-target:VSTest", "-nodereuse:false", "-nologo", "-distributedlogger:Microsoft.Dot"...], 0x5616103eee80 /* 65 vars */) = 0
[pid 18272] execve("/tmp/dotnet-5/dotnet", ["/tmp/dotnet-5/dotnet", "/tmp/dotnet-5/sdk/5.0.102/MSBuil"..., "/tmp/dotnet-5/sdk/5.0.102/MSBuil"..., "/nologo", "/nodemode:1", "/nodeReuse:false", "/low:false"], 0x7f6334066cc0 /* 66 vars */) = 0
[pid 18291] execve("/tmp/dotnet-5/dotnet", ["/tmp/dotnet-5/dotnet", "/tmp/dotnet-5/sdk/5.0.102/MSBuil"..., "/tmp/dotnet-5/sdk/5.0.102/MSBuil"..., "/nologo", "/nodemode:1", "/nodeReuse:false", "/low:false"], 0x7f6334066cc0 /* 66 vars */) = 0
[pid 18351] execve("/tmp/dotnet-5/dotnet", ["dotnet", "exec", "/tmp/dotnet-5/sdk/5.0.102/vstest"..., "--testAdapterPath:/home/tmds/.nu"..., "--framework:.NETCoreApp,Version="..., "/tmp/TestTerminalIssues/Project1"...], 0x7f62f4009740 /* 66 vars */) = 0
[pid 18353] execve("/tmp/dotnet-5/dotnet", ["dotnet", "exec", "/tmp/dotnet-5/sdk/5.0.102/vstest"..., "--testAdapterPath:/home/tmds/.nu"..., "--framework:.NETCoreApp,Version="..., "/tmp/TestTerminalIssues/Project2"...], 0x5574b0bc9d30 /* 65 vars */) = 0
[pid 18384] execve("/tmp/dotnet-5/dotnet", ["/tmp/dotnet-5/dotnet", "exec", "--runtimeconfig", "/tmp/TestTerminalIssues/Project2"..., "--depsfile", "/tmp/TestTerminalIssues/Project2"..., "/home/tmds/.nuget/packages/micro"..., "--port", "40365", "--endpoint", "127.0.0.1:040365", "--role", "client", "--parentprocessid", "18353", "--telemetryoptedin", "false"], 0x7fadcc0069e0 /* 65 vars */) = 0
[pid 18390] execve("/tmp/dotnet-5/dotnet", ["/tmp/dotnet-5/dotnet", "exec", "--runtimeconfig", "/tmp/TestTerminalIssues/Project1"..., "--depsfile", "/tmp/TestTerminalIssues/Project1"..., "/home/tmds/.nuget/packages/micro"..., "--port", "41041", "--endpoint", "127.0.0.1:041041", "--role", "client", "--parentprocessid", "18351", "--telemetryoptedin", "false"], 0x7fc310006890 /* 66 vars */) = 0
[pid 18370] wait4(18390, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18390
[pid 18371] wait4(18384, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18384
[pid 18260] wait4(18351, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18351
[pid 18287] wait4(18353, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18353
[pid 18260] wait4(18291, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18291
[pid 18260] wait4(18272, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18272
[pid 18246] wait4(18247, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 18247

As shown in the trace: every process that gets started (execve) gets waited for (wait4).

@Forgind
Copy link
Member

Forgind commented Feb 19, 2021

Sounds good! We'll merge this pretty soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-when-branch-open PRs that are approved, except that there is a problem that means we are not merging stuff right now.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

'dotnet test' sometimes causes terminal to stop echoing after completion (ubuntu)
5 participants