-
Notifications
You must be signed in to change notification settings - Fork 297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve performance of the runtime state dump #399
Conversation
The actions of dumping the orchestration state for the telemetry event type TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin is expensive. It scales linearly with the number of events, meaning orchestrations with large histories start to spend most of their time writing this verbose telemetry item, which is not ideal. Instead, we now just log the number of events, as this is a constant time operation.
Wow, I have known about this verbose tracing for a while but had not gone the next step to see whether it might be impactful! Thanks for finding and making this fix. To avoid impacting anyone who might depend on this behavior, I suggest we employ conditional complication. For |
Should the verbose tracing for DurableTask.Core even enabled by default? I am not sure we need it. If it is not enabled by default, this whole problem has an easy solution: check the verbosity level before doing expensive formatting, e.g. as in if (TraceHelper.IsVerboseEnabled)
{
if (TraceHelper.TraceInstance(
TraceEventType.Verbose,
"TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin",
runtimeState.OrchestrationInstance,
"Executing user orchestration: {0}",
DataConverter.Serialize(runtimeState.GetOrchestrationRuntimeStateDump(this.efficientTelemetry), true));
} |
Unfortunately, it looks like ETW doesn't give you very good data on whether the Event Source is enabled unless you have a specific listener enabled. See here. Since TraceHelper's I think that as a quick workaround in the meantime, I will go with Chris's suggestion. |
As far as I understand the general idea (for both EventSource and ILogger) is that the log level is determined by the subscribed consumers, not by the producer. I think we should try to follow this standard pattern where possible. In this particular case I would suggest that for our DF hosted scenarios, we stop collecting verbose tracing from DurableTask.Core. I don't think it gets stored in the Kusto logs anyway, so I am not sure why we should generate those events. |
I believe you are correct that this behavior is determined by the consumer. The problem I think we're facing is that we do not control the consumer. Rather, the consumer is a piece of infrastructure that runs on the Azure VMs (we call it "MDS"). We explicitly ask it to listen to |
I see. In that case perhaps the next best solution is to extend the TraceHelper so a lower limit on the tracing level can be specified by a configuration parameter. This is what I did for the EventSourced backend. We don't have a nice way to pass configuration parameters right now though, I was discussing that with @ConnorMcMahon last week. |
@sebastianburckhardt Sorry to jump in here (hi @cgillum), but any chance of a private preview/alpha for Microsoft.Azure.DurableTask.EventSourced with durable entities? I'm trying to make a critical path decision over moving to Orleans or not in the next 6 months... :P |
Switched to Orleans. |
The actions of dumping the orchestration state for the telemetry event type TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin is expensive. It scales linearly with the number of events, meaning orchestrations with large histories start to spend most of their time writing this verbose telemetry item, which is not ideal. Instead, we now just log the number of events for release bits, as this is a constant time operation.
The actions of dumping the orchestration state for the telemetry event
type TaskOrchestrationDispatcher-ExecuteUserOrchestration-Begin is
expensive. It scales linearly with the number of events, meaning
orchestrations with large histories start to spend most of their time
writing this verbose telemetry item, which is not ideal.
Instead, we now just log the number of events, as this is a constant
time operation.