Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't listen for the Console_CancelKeyPress #1323

Closed
wants to merge 1 commit into from

Conversation

andrewlock
Copy link
Member

Flushing and closing the tracer on cancel key press can lead to flushing the buffer before the app has finished writing spans. CurrentDomain_ProcessExit should always fire and flush, so flushing twice is unnecessary

As an example repro, create a new service worker using dotnet new worker, and update worker.cs to:

public class Worker : BackgroundService
{
    private readonly ILogger<Worker> _logger;
    public Worker(ILogger<Worker> logger, IHostApplicationLifetime lifetime)
    {
        _logger = logger;
    }

    protected override async Task ExecuteAsync(CancellationToken stoppingToken)
    {
        var client = new HttpClient();
        using (var scope = Tracer.Instance.StartActive("background-service")); // this trace is closed AFTER the tracer is flushed
        {
            while (!stoppingToken.IsCancellationRequested)
            {
                var response = await client.GetStringAsync("https://jsonplaceholder.typicode.com/todos/1");
                _logger.LogInformation("Received Json");
                await Task.Delay(100, stoppingToken);
            }
        }
    }
}

Run the app, see the requests being made, and hit ctrl-c. The traces are not sent, as the background-service operation is closed after the Tracer is close and flushed

@DataDog/apm-dotnet

Flushing and closing the tracer on cancel key press can lead to flushing the buffer before the app has finished writing spans. CurrentDomain_ProcessExit should always fire, so flushing twice is unnecessary
@andrewlock andrewlock requested a review from a team as a code owner March 29, 2021 16:30
@lucaspimentel
Copy link
Member

If we do this, we should probably remove it from Datadog.Trace.Tools.Runner as well. WDYT, @tonyredondo?

@tonyredondo
Copy link
Member

If we do this, we should probably remove it from Datadog.Trace.Tools.Runner as well. WDYT, @tonyredondo?

I did a simple test and the behavior changes:

The current code with some Console.WriteLine...

image

Behavior:

image

But if I stop listeing the Ctrl+C handler and comment the code:

image

Behavior:

image

Here CurrentDomain_ProcessExit doesn't get called when I press CTRL+C.

So I cannot remove it. I don't know why this doesn't happen in the tracer... WDYT, @andrewlock ?

@tonyredondo
Copy link
Member

Also, If I don't cancel the tokenSource in the Console_CancelKeyPress event, then it waits until the child process ends. Don't cancel anything.

Code:
image

Behavior:
image

@tonyredondo
Copy link
Member

I wonder then if this behavior can also happen in the tracer, because then we should handle both calls and prepare the tracer for both calls, 1 only flush, 1 flush + close.

@andrewlock
Copy link
Member Author

Here CurrentDomain_ProcessExit doesn't get called when I press CTRL+C.

Unless I'm misunderstanding, this seems really weird @tonyredondo? Why wouldn't the CurrentDomain_ProcessExit event fire, just because you've removed the Console_CancelKeyPress handler?

we should handle both calls and prepare the tracer for both calls, 1 only flush, 1 flush + close.

This seems like a reasonable approach anyway tbh. i.e. keep the Console_CancelKeyPress handler, but change it to call FlushTracesAsync instead of FlushAndCloseAsync. I can't see a downside to that approach, assuming that we will never get a Console_CancelKeyPress without CurrentDomain_ProcessExit on process shutdown (I don't believe we will)

@tonyredondo
Copy link
Member

Unless I'm misunderstanding, this seems really weird @tonyredondo? Why wouldn't the CurrentDomain_ProcessExit event fire, just because you've removed the Console_CancelKeyPress handler?

It doesn't fire If I hit CTRL+C. In a normal exit it will.
I actually have to look about this behavior, I suppose is something between one listening to SIGTERM and the other to SIGKILL. But I need to check the runtime code.

Although I saw this:

  1. Gracefully handling Ctrl+C ignores Program Main exit code. dotnet/runtime#27595
  2. ConsoleLifetime doesn't allow full graceful shutdown for ProcessExit dotnet/runtime#35990

For (1) I think is a problem with the dotnet CLI related to this issue (dotnet/sdk#4779)

I don't know if (2) affect us.

This seems like a reasonable approach anyway tbh. i.e. keep the Console_CancelKeyPress handler, but change it to call FlushTracesAsync instead of FlushAndCloseAsync. I can't see a downside to that approach, assuming that we will never get a Console_CancelKeyPress without CurrentDomain_ProcessExit on process shutdown (I don't believe we will)

👍🏻

@tonyredondo
Copy link
Member

tonyredondo commented Mar 30, 2021

So:

  1. SIGTERM is handled by AppDomain.CurrentDomain.ProcessExit
  2. SIGINT and SIGQUIT is handled by Console.CancelKeyPress
  3. SIGKILL can't be caught or ignored.

dotnet/runtime#4950 (comment)

Source code:

  1. For the Console.CancelKeyPress: https://github.com/dotnet/runtime/blob/79ae74f5ca5c8a6fe3a48935e85bd7374959c570/src/libraries/Native/Unix/System.Native/pal_signal.c#L109-L122 : where g_ctrlCallback is registered here called from here and called from here and here
  2. Follow AppDomain.ProcessExit is a bit more complicated, but it raises from the native side here and called from the finalizer thread here. That line is executed when a shutdown is requested (_ASSERTE), and the request is fulfilled on normal shutdowns or forced shutdowns controlled by the EEPolicy::HandleExitProcess (here) and that ForceEEShutdown is called from the ExceptionHandling (here). And finally we got the SIGTERM reference here

@andrewlock
Copy link
Member Author

Ah, so in a .NET Core app, they handle the console key press and do a normal exit, so that might explain the difference with what we're seeing I guess?

Yea there was a time when SIGTERM was not handled. There was some other bug related to handling the CancelKeyPress event, I don't remember the details. You can ignore these.

Urgh, going to be so many edge-cases here...

So what's the danger if we only call FlushTracesAsync, and don't call FlushAndCloseAsync (i.e. if we get a SIGINT/SIGQUIT without the associated SIGTERM)?

I'm also wondering how this would work in (for example) Kubernetes. There the container gets SIGTERM-ed and then SIGKILL-ed after x seconds. It's possible (and in some cases, necessary) to extend that duration significantly.

As I understand it, in that case, we would get the AppDomain.CurrentDomain.ProcessExit with the SIGTERM, and then keep receiving messages for 30s, before the app finally exits. And we won't get any notification of that closing by default (ASP.NET Core has IHostApplicationLifetime that you can hook into, but that's not an easy solution for us)

@tonyredondo
Copy link
Member

tonyredondo commented Mar 30, 2021

Ah, so in a .NET Core app, they handle the console key press and do a normal exit, so that might explain the difference with what we're seeing I guess?

Yeah I think this can be caused by a bug in the .NET CLI.

Yea there was a time when SIGTERM was not handled. There was some other bug related to handling the CancelKeyPress event, I don't remember the details. You can ignore these.

Urgh, going to be so many edge-cases here...

S C A R Y

So what's the danger if we only call FlushTracesAsync, and don't call FlushAndCloseAsync (i.e. if we get a SIGINT/SIGQUIT without the associated SIGTERM)?

I think we should do that, FlushTracesAsync for SIGINT/SIGQUIT. and FlushAndCloseAsync for SIGTERM.
The Console.CancelKeyPress eventargs has the following:

        //
        // Summary:
        //     Gets or sets a value that indicates whether simultaneously pressing the System.ConsoleModifiers.Control
        //     modifier key and the System.ConsoleKey.C console key (Ctrl+C) or the Ctrl+Break
        //     keys terminates the current process. The default is false, which terminates the
        //     current process.
        //
        // Returns:
        //     true if the current process should resume when the event handler concludes; false
        //     if the current process should terminate. The default value is false; the current
        //     process terminates when the event handler returns. If true, the current process
        //     continues.
        public bool Cancel { get; set; }

I think we can control the behavior by setting this to true and doing a grateful shutdown if we want.

As I understand it, in that case, we would get the AppDomain.CurrentDomain.ProcessExit with the SIGTERM, and then keep receiving messages for 30s, before the app finally exits.

Maybe I'm missing something but if we are already handling the SIGTERM we just can end the process before receiving the SIGKILL signal. Isn't the SIGKILL signal being sent only if the pid is still alive?

@andrewlock
Copy link
Member Author

I think we can control the behavior by setting this to true and doing a grateful shutdown if we want.

That seems like a dangerous thing for us to be messing with in people's apps though 😟 I feel like we're just stuck with handling the various combinations that could happen.

Maybe I'm missing something but if we are already handling the SIGTERM we just can end the process before receiving the SIGKILL signal. Isn't the SIGKILL signal being sent only if the pid is still alive?

The point is that we will get the SIGTERM but the app won't stop handling requests for (in my example) 30s. That means we will lose those 30s of traces while the aspnetcore app is still running, before it exits gracefully. (SIGKILL doesn't matter here, we won't see it if config is correct and app closes gracefully before it's killed).

e.g. Timeline of events:

Time, s Event
-1 Kubernetes starts a new deployment
0 Kubernetes sends a SIGTERM to the old pod
^ Tracer detects the SIGTERM, and performs flush and close
^ ASP.NET Core detects SIGTERM and starts shutting down, calling IHostApplicationLifetime.Stopping
^ A handler attached to the Stopping event blocks for 30s, to give time for the Ingress to drain requests from the pod. Traffic is still handled, but we won't be flushing any traces
30 The event handler exits, allowing aspnetcore to finish shutdown
31 The app exits normally
- (SIGTERM is never called as the app exited within a 60s threshold set on the pods (for example)

@tonyredondo
Copy link
Member

That seems like a dangerous thing for us to be messing with in people's apps though 😟 I feel like we're just stuck with handling the various combinations that could happen.

Yes, you're right, I agree we shouldn't be playing with the customer app behavior.

The point is that we will get the SIGTERM but the app won't stop handling requests for (in my example) 30s. That means we will lose those 30s of traces while the aspnetcore app is still running, before it exits gracefully. (SIGKILL doesn't matter here, we won't see it if config is correct and app closes gracefully before it's killed).

Oh now I get it!. Yes we have a problem here... hooking to IHostApplicationLifetime not gonna be easy from our side.

@andrewlock
Copy link
Member Author

Closing this for now as this is overly simplistic and ignores other edge cases. @tonyredondo is looking at an alternative, more robust, approach playing with the ProcessExit delegate list... hopefully that pans out! 🤞

@andrewlock andrewlock closed this Apr 1, 2021
@andrewlock andrewlock deleted the andrew/remove-console-cancel-event branch April 1, 2021 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants