Improve performance of the runtime state dump #399

ConnorMcMahon · 2020-05-22T23:13:36Z

The actions of dumping the orchestration state for the telemetry event
type TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin is
expensive. It scales linearly with the number of events, meaning
orchestrations with large histories start to spend most of their time
writing this verbose telemetry item, which is not ideal.

Instead, we now just log the number of events, as this is a constant
time operation.

The actions of dumping the orchestration state for the telemetry event type TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin is expensive. It scales linearly with the number of events, meaning orchestrations with large histories start to spend most of their time writing this verbose telemetry item, which is not ideal. Instead, we now just log the number of events, as this is a constant time operation.

cgillum · 2020-05-26T17:42:08Z

Wow, I have known about this verbose tracing for a while but had not gone the next step to see whether it might be impactful! Thanks for finding and making this fix.

To avoid impacting anyone who might depend on this behavior, I suggest we employ conditional complication. For #if DEBUG, let's keep the behavior as-is. Otherwise, we can use your new trace. Does that sound like a good idea?

src/DurableTask.Core/OrchestrationRuntimeStateDump.cs

sebastianburckhardt · 2020-05-26T18:09:44Z

Should the verbose tracing for DurableTask.Core even enabled by default? I am not sure we need it. If it is not enabled by default, this whole problem has an easy solution: check the verbosity level before doing expensive formatting, e.g. as in

if (TraceHelper.IsVerboseEnabled)
{
        if (TraceHelper.TraceInstance(
                        TraceEventType.Verbose,
                        "TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin",
                        runtimeState.OrchestrationInstance,
                        "Executing user orchestration: {0}",
                        DataConverter.Serialize(runtimeState.GetOrchestrationRuntimeStateDump(this.efficientTelemetry), true));
}

ConnorMcMahon · 2020-05-26T18:15:55Z

@sebastianburckhardt,

Unfortunately, it looks like ETW doesn't give you very good data on whether the Event Source is enabled unless you have a specific listener enabled. See here. Since TraceHelper's Is<log-level>Enabled methods rely on the EventSource.IsEnabled() method, I'm not sure we can rely on it.

I think that as a quick workaround in the meantime, I will go with Chris's suggestion.

sebastianburckhardt · 2020-05-27T02:34:26Z

As far as I understand the general idea (for both EventSource and ILogger) is that the log level is determined by the subscribed consumers, not by the producer. I think we should try to follow this standard pattern where possible.

In this particular case I would suggest that for our DF hosted scenarios, we stop collecting verbose tracing from DurableTask.Core. I don't think it gets stored in the Kusto logs anyway, so I am not sure why we should generate those events.

cgillum · 2020-05-27T04:31:43Z

I believe you are correct that this behavior is determined by the consumer. The problem I think we're facing is that we do not control the consumer. Rather, the consumer is a piece of infrastructure that runs on the Azure VMs (we call it "MDS"). We explicitly ask it to listen to DurableTask-Core event traces so that we can capture specific event IDs, but I suspect this infrastructure is subscribing to all trace levels. We are able to have more controlled filters before pushing the data to Kusto, but I don't know if we have the ability to control the trace levels on the ETW consumer side. It's something we'd need to look into and I'm not confident that the MDS configuration allows us this level of control.

sebastianburckhardt · 2020-05-27T21:24:32Z

I see. In that case perhaps the next best solution is to extend the TraceHelper so a lower limit on the tracing level can be specified by a configuration parameter. This is what I did for the EventSourced backend. We don't have a nice way to pass configuration parameters right now though, I was discussing that with @ConnorMcMahon last week.

oising · 2020-05-31T18:10:37Z

@sebastianburckhardt Sorry to jump in here (hi @cgillum), but any chance of a private preview/alpha for Microsoft.Azure.DurableTask.EventSourced with durable entities? I'm trying to make a critical path decision over moving to Orleans or not in the next 6 months... :P

oising · 2020-06-05T15:35:25Z

@sebastianburckhardt Sorry to jump in here (hi @cgillum), but any chance of a private preview/alpha for Microsoft.Azure.DurableTask.EventSourced with durable entities? I'm trying to make a critical path decision over moving to Orleans or not in the next 6 months... :P

Switched to Orleans.

The actions of dumping the orchestration state for the telemetry event type TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin is expensive. It scales linearly with the number of events, meaning orchestrations with large histories start to spend most of their time writing this verbose telemetry item, which is not ideal. Instead, we now just log the number of events for release bits, as this is a constant time operation.

ConnorMcMahon requested review from sebastianburckhardt and cgillum May 22, 2020 23:13

cgillum reviewed May 26, 2020

View reviewed changes

src/DurableTask.Core/OrchestrationRuntimeStateDump.cs Show resolved Hide resolved

Make old behavior work on DEBUG

299ecb5

ConnorMcMahon merged commit e996f92 into master May 27, 2020

ConnorMcMahon deleted the ImproveCorePerformance branch May 27, 2020 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of the runtime state dump #399

Improve performance of the runtime state dump #399

ConnorMcMahon commented May 22, 2020

cgillum commented May 26, 2020

sebastianburckhardt commented May 26, 2020

ConnorMcMahon commented May 26, 2020 •

edited

Loading

sebastianburckhardt commented May 27, 2020

cgillum commented May 27, 2020

sebastianburckhardt commented May 27, 2020

oising commented May 31, 2020

oising commented Jun 5, 2020

Improve performance of the runtime state dump #399

Improve performance of the runtime state dump #399

Conversation

ConnorMcMahon commented May 22, 2020

cgillum commented May 26, 2020

sebastianburckhardt commented May 26, 2020

ConnorMcMahon commented May 26, 2020 • edited Loading

sebastianburckhardt commented May 27, 2020

cgillum commented May 27, 2020

sebastianburckhardt commented May 27, 2020

oising commented May 31, 2020

oising commented Jun 5, 2020

ConnorMcMahon commented May 26, 2020 •

edited

Loading