Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Netherite backend: Client request was cancelled because host is shutting down. #2603

Closed
vany0114 opened this issue Sep 26, 2023 · 233 comments
Closed
Assignees
Labels
netherite P1 Priority 1 Performance Reliability Durable functions get stuck or don’t run as expected.

Comments

@vany0114
Copy link

vany0114 commented Sep 26, 2023

Description

@davidmrdavid as discussed in the #2534 I'm opening another issue to troubleshoot this other issue since it is related as it is a consequence/side effect of the mitigation that avoids an orchestrator gets stuck for hours locking an entity.

Expected behavior

Signal the entity, raise the event, or start up a new orchestration normally.

Actual behavior

When the mitigation happens what I've seen is that when it tries to signal an entity (one of the ones we lock), raise an event or even start up new orchestrations, it fails with a System.OperationCanceledException that says: Client request was cancelled because host is shutting down. Any ideas?

We have a retry strategy in place with an exponential backoff that waits up to 2.5 secs, but it seems that the issue takes longer (because all those operations failed after 2.6 secs), since this is an Event Grid trigger, the EG retry strategy comes into play after that, but I would need to review whether the EG retries succeded or not because I'm not sure how long this issue might take, as you can see when that error appeared there were intermittent OperationCanceledException errors around ~9 hours.

Known workarounds

None

App Details

  • Durable Functions extension version (e.g. v1.8.3): 2.10.0
  • Azure Functions runtime version (1.0 or 2.0): 2
  • Programming language used: C#

Screenshots

image

image

If deployed to Azure

  • Timeframe issue observed:
  • Function App name: orders-saga
  • Azure region: West-US
@vany0114
Copy link
Author

Just for clarity, when you say "I've noticed it too", what does "it" refer to? Does it refer to the mitigation working

Hi @davidmrdavid yes, that's what I was referring to, the worst latency I've seen locking an entity is ~130 secs.

However, if this issue seems different from the original problem described in this thread, it would be best for clarity if you could create a new ticket for it!

It's a different error but it's a side effect/consequence of the original one (in previous messages in this thread I've brought it up), anytime we were experiencing the orchestrator getting stuck this other issue always showed up, that was preventing us from creating new orders since we weren't able to start new orchestrations or signaling lock entities, so it seems like that when the mitigation is happening we have this other intermittent error.

Some operation ids:

  • c39e5c653d26446c28c0a67bc276d8a8 - 2023-09-22 00:45:07.6454820
  • 985de0f9e2616a8080f466bc912ccf09 - 2023-09-22 01:53:39.1764030
  • 4cdf33f69718829644f852265dadc9b4 - 2023-09-22 01:33:45.3770374
  • 5486b501a333711c2d9a64397863899d - 2023-09-22 05:13:46.2640682
  • 7f8ddac7ee6f389b88730a2eceac9c98 - 2023-09-21 23:37:38.4313180
  • 332d0e1c7842b5306e623bdef81e234a - 2023-09-21 23:57:59.6733037

Error details:

System.OperationCanceledException: Client request was cancelled because host is shutting down.
   at DurableTask.Netherite.Client.PerformRequestWithTimeout(IClientRequestEvent request) in /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:line 259
   at DurableTask.Netherite.Client.SendTaskOrchestrationMessageBatchAsync(UInt32 partitionId, IEnumerable`1 messages) in /_/src/DurableTask.Netherite/OrchestrationService/Client.cs:line 511
   at DurableTask.Netherite.NetheriteOrchestrationService.DurableTask.Core.IOrchestrationServiceClient.SendTaskOrchestrationMessageAsync(TaskMessage message)
   at DurableTask.Core.TaskHubClient.RaiseEventAsync(OrchestrationInstance orchestrationInstance, String eventName, Object eventData) in /_/src/DurableTask.Core/TaskHubClient.cs:line 695
   at Microsoft.Azure.WebJobs.Extensions.DurableTask.DurableClient.SignalEntityAsyncInternal(DurableClient durableClient, String hubName, EntityId entityId, Nullable`1 scheduledTimeUtc, String operationName, Object operationInput) in D:\a\_work\1\s\src\WebJobs.Extensions.DurableTask\ContextImplementations\DurableClient.cs:line 344
   at Polly.AsyncPolicy.<>c__DisplayClass40_0.<<ImplementationAsync>b__0>d.MoveNext()
--- End of stack trace from previous location ---
   at Polly.Retry.AsyncRetryEngine.ImplementationAsync[TResult](Func`3 action, Context context, CancellationToken cancellationToken, ExceptionPredicates shouldRetryExceptionPredicates, ResultPredicates`1 shouldRetryResultPredicates, Func`5 onRetryAsync, Int32 permittedRetryCount, IEnumerable`1 sleepDurationsEnumerable, Func`4 sleepDurationProvider, Boolean continueOnCapturedContext)
   at Polly.AsyncPolicy.ExecuteAsync(Func`3 action, Context context, CancellationToken cancellationToken, Boolean continueOnCapturedContext)
   at Curbit.Orders.Saga.Triggers.ProviderOrderProgressedIntegrationEventHandler(EventGridEvent eventGridEvent, IDurableEntityClient starter, ILogger logger) in D:\a\1\sourceRepo\src\dotnet\saga\Curbit.Orders.Orchestrators\Triggers.cs:line 44
   at Microsoft.Azure.WebJobs.Host.Executors.VoidTaskMethodInvoker`2.InvokeAsync(TReflected instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\VoidTaskMethodInvoker.cs:line 20
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionInvoker`2.InvokeAsync(Object instance, Object[] arguments) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionInvoker.cs:line 52
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.InvokeWithTimeoutAsync(IFunctionInvoker invoker, ParameterHelper parameterHelper, CancellationTokenSource timeoutTokenSource, CancellationTokenSource functionCancellationTokenSource, Boolean throwOnTimeout, TimeSpan timerInterval, IFunctionInstance instance) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 581
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithWatchersAsync(IFunctionInstanceEx instance, ParameterHelper parameterHelper, ILogger logger, CancellationTokenSource functionCancellationTokenSource) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 527
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 306
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.ExecuteWithLoggingAsync(IFunctionInstanceEx instance, FunctionStartedMessage message, FunctionInstanceLogEntry instanceLogEntry, ParameterHelper parameterHelper, ILogger logger, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 352
   at Microsoft.Azure.WebJobs.Host.Executors.FunctionExecutor.TryExecuteAsync(IFunctionInstance functionInstance, CancellationToken cancellationToken) in D:\a\_work\1\s\src\Microsoft.Azure.WebJobs.Host\Executors\FunctionExecutor.cs:line 108

Please let mew know if that information helps.

@vany0114
Copy link
Author

@davidmrdavid this is getting worse, while this is intermittent it's happening more often than before, yesterday there were a bunch of these errors throughout the day, as you can see below.

image

As you can see the incoming vs the outgoing messages don't match in the Event Hub.
image

And orchestration latency increased from ~3 secs to ~100 secs.
image

@davidmrdavid
Copy link
Contributor

Thanks for creating this issue, Im sharing this internally so it gets an owner as im unfortunately personally not available for the rest of this week.

@lilyjma lilyjma added P1 Priority 1 netherite and removed Needs: Triage 🔍 labels Sep 26, 2023
@vany0114
Copy link
Author

@davidmrdavid really appreciate it! I'll stay tuned.

@sebastianburckhardt
Copy link
Collaborator

it fails with a System.OperationCanceledException that says: Client request was cancelled because host is shutting down. Any ideas?

This error message is indicating that the entire host on which this client is running is shutting down and can therefore not process any more requests. The intended way to handle this situation is to retry the request on a different host. In particular, a local retry loop does not help - the request has to be retried on a different host.

Based on your description, I would assume that the EG trigger retry should take care of that.

@vany0114
Copy link
Author

@sebastianburckhardt Thanks for the explanation, yes I've reviewed some of the affected requests and indeed the EG trigger retry is doing its thing. On the other hand, do you know why that error is happening that often?

@davidmrdavid
Copy link
Contributor

Hey @vany0114, I wanted to provide an update here.

It appears to me that, in most cases, when you receive the error "Client request was cancelled because host is shutting down.", the VM where this error was emitted previously logged the following message: "A function timeout has occurred. Host is shutting down.".

In other words, the timeline of events seems to be as follows:

  1. a given Azure Function, such as OrderDeliveredIntegrationEventHandler starts executing
  2. A minute later, the function times out
  3. The Host starts to shut down in response to the timeout. This is because a timeout may suggest that the VM is unhealthy, so we restart it for proactive recovery's sake.
  4. While shutting down, any other in-progress functions will fail with the error "Client request was cancelled because host is shutting down"

So, to answer your question: you are receiving the "client request cancelled" errors because some azure function invocations are timing out, which in turn triggers a VM restart.

So now the question becomes: why are function invocations timing out?. I don't know this for certain, but I have some theories that are partially backed by our telemetry.

In particular, I see that during periods of high failure, your app is exceeding the 'healthy' CPU usage threshold. In particular, it's using over 90% of the CPU available on your VMs. As you can imagine, this can slow down processing, which can in turn lead to timeouts. Furthermore, on our other thread, I theorized that perhaps high CPU utilization was to blame for the periodic hangs in the "compaction" step of checkpointing. I also see evidence of compounding effects: the timeouts cause VMs to restart, which create a backlog of processing, which means the backlog needs to be consumed on the next VM, which creates delays for new requests, which can cause new timeouts and thus a cycle.

At this point, I think it would be ideal to provide your app with more CPU capacity, which I think would help stabilize your app. I think there's a good chance this will help. I don't know for certain that this is the root cause of the timeouts, but so far I don't see any logical errors on our telemetry that explain the problem otherwise so I think this is our next best bet. I'm also hesitant to provide a new private package with a more aggressive mitigation (i.e one that fires more quickly), as the mitigation is to terminate processing for a durable partition on a given VM, which adds some delay to processing and therefore risks exacerbating the timeouts.

@vany0114: Are you able to upgrade/scale-up your application to utilize more powerful VMs? In particular, VMs with more CPUs capacity (e.g # of cores) should help improve the stability of your app. It's not necessary to go to the highest plan, just a higher one should help. Again, I don't know for certain that this will make the problem disappear, but it should remove one of the sources of instability I'm seeing. I also recommend proactively monitoring your CPU utilization to ensure it remains under healthy limits.

@vany0114
Copy link
Author

vany0114 commented Oct 6, 2023

@davidmrdavid Thanks for the detailed update! I'll move to the premium plan then because right now we're using a consumption plan.

@vany0114
Copy link
Author

vany0114 commented Oct 9, 2023

@davidmrdavid I'm reviewing this issue, particularly in our dev environment which is where we get this issue so often, as you can see almost all requests are failing because of the timeout and the shutting down issues.

image

I am bringing this up because the traffic in our development environment is not as high as in production, actually, it's very low.

Development:
image

Production:
image

In terms of infrastructure, the only difference is the throughput units that the Event Hub NS uses in dev vs production, which are 1 and 4 respectively, because both function apps (dev and prod) are deployed in a consumption plan.

So this behavior seems interesting to me because I wouldn't expect any issues in our development environment given the low requests number we process there, but it's the opposite, the behavior in our dev environment is actually worse than production, as a matter of fact, we reset it (func app, storage and Event Hub NS) last week and look at it now, it's totally unhealthy. For instance, look at the behavior of the EH messages.

image

BTW these are some insights in regard to memory and CPU consumption of the last 24 hours, there's indeed a high memory consumption during certain hours (which is something I'd like to review to try reducing allocation on our end) but not like the whole day is that high. On the other hand, the CPU seems healthy.

Memory consumption:
image

CPU consumption:
image

I hope this information helps you with the diagnostic.

@davidmrdavid
Copy link
Contributor

Just saw this, @vany0114. I'll take a comparative look at the dev environment tomorrow.

@davidmrdavid
Copy link
Contributor

In the meantime: any context on what could be causing those large memory allocations? Are you aware of any large allocations on your end, or is that surprising?

@vany0114
Copy link
Author

In the meantime: any context on what could be causing those large memory allocations? Are you aware of any large allocations on your end, or is that surprising?

@davidmrdavid Not sure yet, I need to look into it, but I think it might be the main input that the orchestration is created with (that's also persisted as part of the durable entity state, which may impact when it gets serialized/deserialized), but again I'm not sure I'll need to make a memory profile or something to get more insights.

@vany0114
Copy link
Author

@davidmrdavid today I redeployed the func app to move it to a premium plan and also redeployed the Event Hub NS to use again a standard tier, however I'm seeing this error a lot ever since:

An item with the same key has already been added. Key: durabletasktrigger-netherite-ordersagav3

image

image

Do you know what does it mean?

@vany0114
Copy link
Author

I had to go back to a consumption plan due to that issue FYI

@davidmrdavid
Copy link
Contributor

Thanks @vany0114.

Regarding the error: "An item with the same key has already been added. Key: durabletasktrigger-netherite-ordersagav3".

Yes, I've seen that before. My understanding is that this is a transient error in the ScaleController that is safe to ignore, although noisy. Unfortunately, it cannot be fixed on the application-side, the ScaleController team needs to make a release for it to disappear.

Did you experience that error affecting the correctness of your app, or was it just polluting the Application Insights logs. Just confirming.

As for the test app: I took a preliminary look but I'm not entirely ready to share a conclusion there just yet. I'm starting an internal thread about it. I'll keep you posted.

@vany0114
Copy link
Author

Did you experience that error affecting the correctness of your app, or was it just polluting the Application Insights logs. Just confirming.

@davidmrdavid I initially thought it was just noise but I started seeing misbehavior, basically with an ack/event that the func app waits for, even tho the ack was being published timely the func app somehow was not processing it and it just timed out, so that's why I moved back to the consumption plan. I'm curious about why I don't see that error when the app is deployed on a consumption plan 🤔

Unfortunately, it cannot be fixed on the application-side, the ScaleController team needs to make a release for it to
disappear.

I was told by the Azure support engineer that I could update the extensions package since it was already fixed but based on your answer it seems I cannot, so I'll wait until you give me the green light to do so, then I'll try again moving the func app to a premium plan.

@vany0114
Copy link
Author

@davidmrdavid I initially thought it was just noise but I started seeing misbehavior, basically with an ack/event that the func app waits for, even tho the ack was being published timely the func app somehow was not processing it and it just timed out, so that's why I moved back to the consumption plan. I'm curious about why I don't see that error when the app is deployed on a consumption plan 🤔

@davidmrdavid FWIW here is an example of an affected instance that I was referring to:
{"siteUID":"24948821-5187-436c-a080-b0994b10d39e","checkNumber":"157","openedAt":"2023-10-10","courseNumber":"1"}

image

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Oct 12, 2023

@vany0114: Regarding that ScaleController log, that will be fixed once their code takes on a dependency on a Netherite release with this fix (which I just approved): microsoft/durabletask-netherite#316 (review) .

Again, my understanding is that this bug can be safely ignored, so I need to double check that affected instance you shared. Looking into that now. I could have been wrong.

I did look into your staging app. I don't have a clear picture of what's wrong with it yet but I do see a lot of OOMs. At first, I thought maybe you had some kind of persistent error loop due to trying to deserialize too large of a blob, but I see the app is slowly healing over the past day, which invalidates that theory. Did you do anything do your staging app in the past day to help it recover?

@vany0114
Copy link
Author

Did you do anything do your staging app in the past day to help it recover?

Yes, I had to reset it because our dev environment wasn't working because of that. The fact that the func app is sharing the same service plan with more apps has something to do? in our dev environment, most of the func apps are deployed in the same service plan FWIW.

@davidmrdavid
Copy link
Contributor

@vany0114: ok, it's good to know it was reset, as that helps validate that my initial findings were probably not entirely off. I'll continue my investigation while ignoring the latest, healthier, data.

As an aside: There's an incoming Netherite release that I think should include the mitigation in your preview package, as well as the fix for the ScaleController error. I'll keep you posted on that.

Regarding your question on using the same app service plan: I think it's fine to do this.

@vany0114
Copy link
Author

vany0114 commented Oct 13, 2023

@davidmrdavid I'd like to know your thoughts on what would be the next course of action to fix this problem definitively, this is hurting us really bad, and resetting the app every week is not scalable, for instance, it's just starting the day today and it's timing out...I've tried everything you have advised (which I do appreciate a lot) and the issue persists, so I don't really know what else to do to stabilize our app 😞

image

image

@davidmrdavid
Copy link
Contributor

@vany0114: Yes, I want us to get your application to a better state soon. I'm also concerned about it's current state.

Here's my thoughts:

  1. I'm confident that moving to more powerful VMs should partially help. Naturally, more CPU and more memory will reduce the chance of OOMs and high CPU consumption warnings. I realize that you were seeing errors in the Scale Controller when you did this; I believe that I can provide you with instructions to circumvent the errors you were seeing (even though I think the errors are benign) so I can work to get you those asap.

If you decide to scale up to Elastic Premium, please "reset your app"/ clear your taskhub after doing this. This will ensure that the newer VMs can start in a healthy state and, as such, be able to "keep up with the incoming work" instead of trying to do that while dealing with a pre-existing backlog.

  1. Your app is shutting down VMs a lot. Our logs indicate that this is due to function timeouts. I think these shutdowns are creating compounding effects where: if a partition is busy, it will cause a timeout, which will cause partitions in that VM to fail to respond to requests, which will cause more timeouts in the client's VMs, and so on in a potentially self-sustaining cycle.

I think it's critical that we stop these timeout-induced shutdowns. @vany0114: can you change your code so that you manage the timeout yourself in the code instead of relying on the Functions runtime timeout setting? This way, when a request times out, it will not cause the VM to restart, which should help with the instability.

Addressing this will also make investigating easier for us. Right now, the timeout-induced shutdowns are extremely noisy, which make it difficult to identify the root problem. This is my main recommendation.

  1. Can you please set the Netherite setting "TakeStateCheckpointWhenStoppingPartition" to false in host.json? This is a setting inside the storageProvider segment when you select Netherite as your backend. This may reduce the incidence of OOMs when your VMs are shutting down (due to timeouts) as we won't try to checkpoint partitions as part of shutdown, which is very memory intensive.

Those are my immediate thoughts with your current set up. Finally, I think it's key to make sure your application isn't trying to save moderately large state in Entities (or in any Durable APIs) as per this best practice: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-best-practice-reference#keep-function-inputs-and-outputs-as-small-as-possible . If you have reason to suspect your code may be violating this constraint, I think this would also be an area to invest it. If you have moderately large data being communicated through DF APIs, please use indirection to materialize them, as described in the linked doc.

@vany0114
Copy link
Author

vany0114 commented Oct 17, 2023

Hi @davidmrdavid

Thanks for your thoughts, I think we can start then by scaling up to Elastic Premium, so I'll wait for your instructions to circumvent the errors re the Scale Controller.

Re handling the timeouts in the Event Grid triggers, absolutely that's something we can do, I just have a question, can I re-throw the timeout exception so that Event Grid can retry or should I re-throw a different exception? I'm asking because I don't know if by re-throwing the same timeout exception it will still cause the shutting down issue.

Edit: actually it will throw the Polly TimeoutRejectedException, so I think we should be good there.

Can you please set the Netherite setting "TakeStateCheckpointWhenStoppingPartition" to false in host.json? This is a setting inside the storageProvider segment when you select Netherite as your backend. This may reduce the incidence of OOMs when your VMs are shutting down (due to timeouts) as we won't try to checkpoint partitions as part of shutdown, which is very memory intensive.

Will do

As per the improvement to avoid saving large states in the Entities that's something that definitively we'll do, but it will take us some time since that's not a trivial refactor on our end.

@vany0114
Copy link
Author

I realize that you were seeing errors in the Scale Controller when you did this; I believe that I can provide you with instructions to circumvent the errors you were seeing

@davidmrdavid since I'm planning on making a release tomorrow can you provide please these instructions? the release will contain the timeout handling as well as the TakeStateCheckpointWhenStoppingPartition setting change FYI.

@vany0114
Copy link
Author

vany0114 commented May 10, 2024

@davidmrdavid Thanks for the heads up and for sharing this, I have been looking into it and indeed there's a bug on our end that is causing we're discarding a lot of events due to a serialization issue that is making the instance ids not match, i.e:

Example 1

Orchestrator instance id:

{"id":"00100045-06c0-0000-1805-090000000001","storeId":"86ae9330-4a0b-456e-808d-d9694a2db164","checkNumber":"10069"}

Event raised instance id:

{"checkNumber":"10069","storeId":"86ae9330-4a0b-456e-808d-d9694a2db164","id":"00100045-06c0-0000-1805-090000000001"}

Example 2

Orchestrator instance id:

{"brandId":"f6306b87-5975-ec11-aaad-0e885b099c11","orderId":"26666867702235136","storeNumber":"1"}

Event raised instance id:

{"orderId":"26666867702235136","storeNumber":"1","brandId":"f6306b87-5975-ec11-aaad-0e885b099c11"}

It's not intrinsically a problem, but the volume is so large (tens of thousands of discarded events) that it is at least suspicious. At the very least, it contributes somewhat to partitions being busy with work that isn't meaningful / that is to be discarded.

You're right, we put a validation in place to avoid raising events when the orchestration is completed already, tho I have a question, is there a way to know wheter a given intance exists? so that we also avoid raising events for instances that do not exists, it should not happen in normal scenarios, but as you said it helps to avoid overwelming the orchestrator with meaningless work. This is what we did:

// This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return;

await ResilienceStrategy.ExecuteAsync(() => context.RaiseEventAsync(instanceId, eventName, integrationEvent));

These changes will be released next week Moday or Tuesday FYI.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented May 10, 2024

@vany0114: ah, I see - that's interesting. It sounds like you were obtaining a JSON that was serializing it's keys out of order? In any case, glad we found and fixed a bug :) .

Regarding your validation:

// This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return;

This works, but I think it doesn't catch the case where the instanceID never existed (as it was in the bug). I'm juggling a few different contexts here so I can't remember exactly how it works in .NET in-process, but a quick test of the response value of GetStatusAsync with a non-existent instanceId as input should tell you the behavior. I suspect it either returns an exception, or null; I lean towards expecting the exception.

However, there's a subtlety here: in Netherite, querying the status of an orchestrator does send an event to the partition, so while you may be avoiding the discarded message (which is good!), you will still "burden" the partition with a status read request and then, if the orchestrator is in fact available, you'll send another event raised event right after that. So ultimately, if your goal is to minimize the partition burden, it may suffice for you to still optimistically send that raised event and rely on the discarding logic if the orchestrator already completed or doesn't exist. I may need @sebastianburckhardt to correct me here, but he's OOF for a few more days so that'll need to wait. What really matters is that your instanceIDs have the expected format, so that if they're received, they lead to meaningful work.

In summary: I think you should be good deploying just the instanceID deserialization fix, and at this time I believe that may be more performant. In any case - my main goal was to reduce the number of unexplained warnings, and this change (irrespective of you adding the additional step of checking if the instanceID exists) does that :-) . That'll make it easier for me to read your app's health signals in the future.

@vany0114
Copy link
Author

@davidmrdavid

This works, but I think it doesn't catch the case where the instanceID never existed (as it was in the bug). I'm juggling a few different contexts here so I can't remember exactly how it works in .NET in-process, but a quick test of the response value of GetStatusAsync with a non-existent instanceId as input should tell you the behavior. I suspect it either returns an exception, or null; I lean towards expecting the exception.

I tested it out and It does not return an exception nor a null response, it returns a response with the WaitingForActivation status which is different from a "not found" status.

However, there's a subtlety here: in Netherite, querying the status of an orchestrator does send an event to the partition, so while you may be avoiding the discarded message (which is good!), you will still "burden" the partition with a status read request and then, if the orchestrator is in fact available, you'll send another event raised event right after that.

Yeah I was thinking kinda the same, I'll remove that validation then.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented May 10, 2024

@vany0114: in your sample:

// This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return;

Don't you need to await the GetStatusAsync call? I want to rule out that the WaitingForActivation is not coming from .NET's own Task API, instead of DF

@vany0114
Copy link
Author

@davidmrdavid BTW do you think it explains the high outgoing messages in the event hub?

@vany0114
Copy link
Author

@vany0114: in your sample:

// This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return;

Don't you need to await the GetStatusAsync call? I want to rule out that the WaitingForActivation is not coming from .NET's own Task API, instead of DF

You're totally right 🤦but I did remove it as per our previous conversation.

@davidmrdavid
Copy link
Contributor

@vany0114:

BTW do you think it explains the high outgoing messages in the event hub?

Not sure, but that is a signal that I've also noticed throughout this thread. It seems to correlate with partition slowdowns in general, but I think we've seen it both in "partitions stuck/corrupted" instances as well as general temporarily slowdown, but I'm not yet certain what it represents. It's not clear to me how a stuck/slowed partition increases the rate of outgoing messages, I'll need to consult with the rest of the team.

@davidmrdavid
Copy link
Contributor

In general, there's usually been other, clearer, explanations of what's going wrong when you report a problem (usually it suffices to look at the partition logs and see if they're busy, or online at all), so I've been relying on those. But I do want to understand the high EH activity.

It sounds like you're drawing a correlation between high EH activity and this issue with the raised events? Can you tell me more about that? Maybe there's something application-specific I'm missing that would explain it.

@vany0114
Copy link
Author

It sounds like you're drawing a correlation between high EH activity and this issue with the raised events? Can you tell me more about that? Maybe there's something application-specific I'm missing that would explain it.

Nope, I'm not saying there's a correlation but seems suspicious to me, maybe the correlation is more with our timeout and/or the shutdown transient issue?

@davidmrdavid
Copy link
Contributor

davidmrdavid commented May 10, 2024

Right. Here's my summarized mental model:

The explanation of the spike in EH outgoing message activity is still unknown to me, but it hasn't been a blocker yet for identifying issues in the app.

The shutdowns have been mostly explained by two reasons:
(1) An OOM error leads to a shutdown, which may slow down processing in a partition and cause client requests to time out.
(2) Timeouts also lead to shutdowns, which is why you were facing a "domino effect" where sufficient OOM errors would cause seemingly self-perpetuating shutdowns, at least for a little while.

We've mitigated most of this by:
(A) getting more memory capacity, which should prevent OOMs
(B) using your timeout handler, which prevents platform-initiated restarts on a timeout.

Now the remaining questions are:
(1) How to control memory usage further for you to be able to scale down to a lower plan. This is tricky as memory usage is not flat, it's a function of traffic and your app's architecture.

(2) Have we stabilized Netherite itself: are there still partition corruption issues to worry about. Things are looking better here for us ever since we updated to the latest FASTER, but obviously this is a critical point for us.

(3) Can we recover from these issues without you needing to 'reset the app'. If so - can we automate it.

Let me know if you feel I'm missing anything

@davidmrdavid
Copy link
Contributor

Returning back to the purging history conversation:

We should definitely be purging data periodically, that should help speed up all sorts of FASTER queries, which is good for stability. You could set your purging of data to occur once a week if you're worried about the partition burden, but I think it should occur. In fact, now I wonder if the fact that you're not purging data cold explain why we have periods of 'good performance' and then a degradation kicking in a few weeks later.

And yes, there's a subtlety with entities in purging :Entities by design "live forever" unless explicitly deleted. So a purge operation, if I'm not mistaken, will not delete that data. If you're creating transient entities, you should delete them with the Delete API. If the Entities only live for the lifetime of a given orchestrator, I think the orchestrator could signal the delete operation on the entities it used as it's about to complete itself.

If you already have a large number of transient entities that are completed but taking up storage, you may need to call the "list entities" API and purge the ones you don't need - maybe you can see it's last timestamp update and use that to conservatively delete really old ones?

@vany0114
Copy link
Author

Right. Here's my summarized mental model:

The explanation of the spike in EH outgoing message activity is still unknown to me, but it hasn't been a blocker yet for identifying issues in the app.

The shutdowns have been mostly explained by two reasons: (1) An OOM error leads to a shutdown, which may slow down processing in a partition and cause client requests to time out. (2) Timeouts also lead to shutdowns, which is why you were facing a "domino effect" where sufficient OOM errors would cause seemingly self-perpetuating shutdowns, at least for a little while.

We've mitigated most of this by: (A) getting more memory capacity, which should prevent OOMs (B) using your timeout handler, which prevents platform-initiated restarts on a timeout.

Now the remaining questions are: (1) How to control memory usage further for you to be able to scale down to a lower plan. This is tricky as memory usage is not flat, it's a function of traffic and your app's architecture.

(2) Have we stabilized Netherite itself: are there still partition corruption issues to worry about. Things are looking better here for us ever since we updated to the latest FASTER, but obviously this is a critical point for us.

(3) Can we recover from these issues without you needing to 'reset the app'. If so - can we automate it.

Let me know if you feel I'm missing anything

@davidmrdavid I think that's it, you perfectly summarized it, I just would add a couple more questions:

  • Is Netherite suitable for apps running in a consumption plan? at least for the In-Process apps the memory footprint is huge.
  • Is there an ETA for the Netherite isolated worker library? I'm asking because the memory footprint seems to be dramatically lower in that implementation.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented May 10, 2024

Is there an ETA for the Netherite isolated worker library? I'm asking because the memory footprint seems to be dramatically lower in that implementation.

I'm merged the fix for this issue yesterday, just haven't had time to perform the release steps. However: I worry the memory footprint you're measuring might be a red-herring - the isolated process model has two processes, the "host" and your "app" (compared to the in-process model where both co-exist in the same process). I worry this memory reduction you're seeing is in your app's process, but it's misisng the "host" processes own memory usage. I'm saying this because it's the same Netherite code that's running in both models (inproc vs isolated) so it doesn't make sense to me that the memory footprint would be different, I suspect it's more likely the memory footprint was simply moved elsewhere (from app process to host process).

Again - I haven't measured this yet, but from first principles I worry this may be a red-herring.

@vany0114
Copy link
Author

Returning back to the purging history conversation:

We should definitely be purging data periodically, that should help speed up all sorts of FASTER queries, which is good for stability. You could set your purging of data to occur once a week if you're worried about the partition burden, but I think it should occur. In fact, now I wonder if the fact that you're not purging data cold explain why we have periods of 'good performance' and then a degradation kicking in a few weeks later.

And yes, there's a subtlety with entities in purging :Entities by design "live forever" unless explicitly deleted. So a purge operation, if I'm not mistaken, will not delete that data. If you're creating transient entities, you should delete them with the Delete API. If the Entities only live for the lifetime of a given orchestrator, I think the orchestrator could signal the delete operation on the entities it used as it's about to complete itself.

If you already have a large number of transient entities that are completed but taking up storage, you may need to call the "list entities" API and purge the ones you don't need - maybe you can see it's last timestamp update and use that to conservatively delete really old ones?

@davidmrdavid yeah that makes a lot of sense, in the next release we'll be purging history once a day, re the entities, we always delete them once the orchestration finishes.

@vany0114
Copy link
Author

Is there an ETA for the Netherite isolated worker library? I'm asking because the memory footprint seems to be dramatically lower in that implementation.

I'm merged the fix for this issue yesterday, just haven't had time to perform the release steps. However: I worry the memory footprint you're measuring might be a red-herring - the isolated process model has two processes, the "host" and your "app" (compared to the in-process model where both co-exist in the same process). I worry this memory reduction you're seeing is in your app's process, but it's misisng the "host" processes own memory usage. I'm saying this because it's the same Netherite code that's running in both models (inproc vs isolated) so it doesn't make sense to me that the memory footprint would be different, I suspect it's more likely the memory footprint was simply moved elsewhere (from app process to host process).

Again - I haven't measured this yet, but from first principles I worry this may be a red-herring.

I see, thanks for the clarification.

@davidmrdavid
Copy link
Contributor

@vany0114:

Is Netherite suitable for apps running in a consumption plan? at least for the In-Process apps the memory footprint is huge.

This depends. Netherite does have higher memory requirements, and consumption has lower memory capacity, so clearly that poses a bigger challenge than the default backend; but an app under the right circumstances (config, traffic, architecture) can use Netherite in the Consumption plan. That said, I am definitely thinking of writing a "best practices" guidance that push users to reserve Netherite for higher capacity plans, like Elastic Premium. I think users who gravitate towards Netherite are naturally trying to "push the system" to the limit (Netherite is supposed to be for high throughput scenarios after all) and naturally this will work better when the system has more capacity to give.

For your app specifically - it's proving to be difficult to return to the consumption plan. I don't think it's impossible, but this thread shows it isn't easy. That said, on my end I did spend a large part of this investigation focusing on the partition corruption issues over debugging our app's performance as the former was a permanent error (+ data loss) and the latter could be 'mitigated' with using the elastic premium plan. So I think that now that the corruption issue has not occurred in a while, I'm hoping we'll be able to experiment with tuning your Netherite config to see if it can bring you down to the consumption plan.

The question though is whether you prefer we do that over trying to get more performance out of the Azure Storage backend (which will definitely run in consumption, but may not give you the throughput you want), that is one of the reasons why I asked you to schedule a meeting with us (and I see it's scheduled for next week).

@davidmrdavid
Copy link
Contributor

davidmrdavid commented May 10, 2024

Here's a few things, some of them possible today, some of them need implementation and design on our end, that should help facilitate getting Netherite on the consumption plan. I'll write them down here for future reference.

(1) Using a 'shadow store' for client queries

As we discussed just now, getting the state of an orchestration sends a request to the partition. Ideally, we would be able to perform queries (i.e reading the state of entities and orchestrations) without needing to burden the partition itself. We can do this by creating a table in Azure Storage, similar to the Instances table in the default backend, that lists all instanceIDs and their state. Then, client queries would simply read from there, instead of sending a request to the partition. This table is what we internally call a 'shadow store', because it's a 'copy' of (some of) the data in the partition. At worst, the table may be a bit out of sync /delayed with the actual status on the partition, but the benefit is great: keep partitions focused on 'real' work.

This doesn't exist today, but the good news is that you can implement it easily on the application side: if your orchestrators and entities post their state to a table throughout their lifecycle, you can query that table to determine the status of orchestrations/entities instead of going through the partition. This could help if we believe queries are a bit part of the load, which we have not established yet.

**(2) Smarter load-balancing **
We know issues arise when a given partition is overburdened. In theory, random instanceIDs are supposed to be load balanced across partitions, but it seems to me you manually craft your instanceIDs, so that can cause more load to land on specific partitions (we hash the instanceID string to determine what partition number will handle it, non-random strings will therefore not be uniformly load balanced). I wonder if this may be contributing to the 'partitions stuck' issues we see, not sure if you have partitions that are doing a lot more work than others. If so, perhaps we can find a way to better load balance requests, a naive way to do that is to leverage system-generated instanceIDs. Again, more investigation more is needed to determine if this is worth investing on.

**(3) Partition throttling **

I think part of the challenge today is that Netherite does not throttle it's incoming work. This is part of why you get timeouts in your client requests: Netherite is accepting that work, and being optimistic that it'll complete it before the timeout, but it does not. If it had throttled your request and failed it immediately (assuming it was too busy to take on new work), that would probably be a better signal to you as a user, and also could help the partition to remain healthy in terms of keeping memory usage low at the expense of outright rejecting client requests, instead of timing out on them.

This is tricky to implement because the throttling shouldn't be too sensitive either, but if we make it user-configurable, you could make it as sensitive as you'd like to help control the load on each partition. Naively speaking, with a sensitive enough throttle, you're effectively reducing the traffic that reaches each Netherite partition, which may be the key for your app to run on lower capacity VMs. Of course, you have to balance this with your thoughput needs.

@vany0114
Copy link
Author

vany0114 commented May 10, 2024

This depends. Netherite does have higher memory requirements, and consumption has lower memory capacity, so clearly that poses a bigger challenge than the default backend; but an app under the right circumstances (config, traffic, architecture) can use Netherite in the Consumption plan. That said, I am definitely thinking of writing a "best practices" guidance that push users to reserve Netherite for higher capacity plans, like Elastic Premium. I think users who gravitate towards Netherite are naturally trying to "push the system" to the limit (Netherite is supposed to be for high throughput scenarios after all) and naturally this will work better when the system has more capacity to give.

@davidmrdavid Yep that's understandable, everything has a cost and trade-offs, in this case, the cost of high throughput is high memory consumption and I agree that it depends on the specificities as you said, traffic, architecture, and configuration are factors of memory consumption. And yeah documenting a best practices guide would be great for future users; I wish I had known all of the things we've discussed in this thread early to avoid these headaches (fortunately your support has been super helpful and much appreciate it 🙏), for example knowing the higher memory requirements that Neherite has would have been super helpful before deciding to move to Netherite. However, in our case, it was also not only because of throughput that we decided to move to Netherite, it was because of reliability too, we used to use Azure Storage as the backend and we found ourselves resetting the app several times a year (3 or 4), not that much as with Netherite, so we thought that with Netherite we wouldn't have that issue.

Even tho we've learned a lot of things about Netherite and Durable framework in the hard way because of the bugs we've found in Netherite/Faster and the lack of documentation we were talking about, it has helped us to improve our service either way (and hopefully to improve Netherite itself) and to start realizing or considering the tradeoffs of moving back to Azure Storage as the backend as a possibility, my concern other than throughput is to face similar problems, such as partition corruption and other bugs that were fixed as part of this long discussion (if they may apply to that backend)

I'm hoping we'll be able to experiment with tuning your Netherite config to see if it can bring you down to the consumption plan.

Yep let's give it a try, I'm willing to do so, I hope that after the release we're gonna do next week that includes the fix for the discarded events and the purging stuff it stabilizes even more the app and the memory consumption stay flat (or at least has a logarithmic growth) as much as possible.

Lastly, I have a question for you, is it possible to shard data in a durable func app? I'm curious to know your thoughts about it, since that's pretty common, for example in multitenancy architectures, I've been thinking about it as another improvement to help with the load.

@vany0114
Copy link
Author

but it seems to me you manually craft your instanceIDs

Yep we do, because we need it to raise events, etc I guess we could use the autogenerated ids and keep a mapping between the orchestration ids and our own id? that btw could be another thing worth mentioning in the documentation IMHO.

Re the partition, yep, that'd be great.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented May 10, 2024

guess we could use the autogenerated ids and keep a mapping between the orchestration ids and our own id?

Possibly - I'd first need to establish that you have a skewed distribution of partition load, with some partitions being significantly more burdened than others. Otherwise, this indirection will be a waste. I'll try to see if the data supports this possibility.

... that btw could be another thing worth mentioning in the documentation IMHO.

It is technically mentioned here (https://microsoft.github.io/durabletask-netherite/#/placement) but those aren't the official MSDocs docs so you're right :) . This GitHub page has a lot of worthwhile details though so I recommend reading through it.

Lastly, I have a question for you, is it possible to shard data in a durable func app? I'm curious to know your thoughts about it, since that's pretty common, for example in multitenancy architectures, I've been thinking about it as another improvement to help with the load.

Hmmm - if by sharding you mean to assign different segments to data to different machines, then that's exactly what our partitions do: each partition holds the data for a different range of instanceIDs, and each partition may be assigned to a different machine (if you're sufficiently scaled out). Of course, for this to work best, instanceIDs need to be random so they're evenly balanced across partitions.

and regarding this comment: #2603 (comment) . Thanks, we really want to get y'all to a stable place; I'm not happy with having to 'reset your app', the point of durable functions is that we should tolerate all sorts of transient failure, so I'm glad we're making forward progress on that. Yes - I think this thread has driven a lot of improvement to Netherite.

@vany0114
Copy link
Author

It is technically mentioned here (https://microsoft.github.io/durabletask-netherite/#/placement) but those aren't the official MSDocs docs so you're right :) . This GitHub page has a lot of worthwhile details though so I recommend reading through it.

I was aware of that page but I never read the placement stuff, I didn't know the !n syntax, that's cool.

Hmmm - if by sharding you mean to assign different segments to data to different machines, then that's exactly what our partitions do: each partition holds the data for a different range of instanceIDs, and each partition may be assigned to a different machine (if you're sufficiently scaled out). Of course, for this to work best, instanceIDs need to be random so they're evenly balanced across partitions.

I meant to shard data at Event Hub level, so for instance that we can have several instances of the app running but tied to an specific Event Hub depending on the tenant...I guess it would also required the app to be tied to a different storage account, maybe that would be possible using vertically partitioned deployments.

@vany0114
Copy link
Author

@davidmrdavid any thoughts? ☝️

@davidmrdavid
Copy link
Contributor

@vany0114:

I meant to shard data at Event Hub level, so for instance that we can have several instances of the app running but tied to an specific Event Hub depending on the tenant...I guess it would also required the app to be tied to a different storage account, maybe that would be possible using vertically partitioned deployments.

If I understand correctly, you're asking if each partition could have it's own EventHub and also share a single Azure Storage account. In theory, I think that's possible but for your case I don't think that would really help reduce the partition load. My impression is that EventHubs actually keeps up with the load quite well, it's the VM that does not have enough capacity to keep up with the load it is receiving and may throw OOM exceptions when a surge of events comes in.

@vany0114
Copy link
Author

it's the VM that does not have enough capacity to keep up with the load it is receiving and may throw OOM exceptions when a surge of events comes in.

@davidmrdavid Exactly, that's why I'm thinking of a vertically partitioned deployment to help distribute the load to the workers so that we can have some tenants tied to a given instance, i.e: saga-01 would serve requests to tenants 1, 2 and 3, saga-02 would serve requests to tenant 4, 5 and 6, etc.

I think that probably with that approach we would be able to use Netherite in a consumption plan, does that make sense?

@vany0114
Copy link
Author

@davidmrdavid bubbling this up ☝️

@davidmrdavid
Copy link
Contributor

Hey @vany0114: I believe we discussed this "vertical deployment" during our call, did we not? If it refers to possibly breaking down to app into several smaller deployments, and load balancing across them, then I could see that helping, but some experimentation will be needed.

Also, as per our call, I think we agreed to work towards bringing your app to EP1. I'm curious if we should start a new thread to focus on that. This GitHub issue has grown rather long, and if we're pivoting towards a new goal, it might be best to start fresh so folks don't need to catch up on so much context. Would you agree? If so, once you're ready, I'd appreciate if we could move to a new issue.

@vany0114
Copy link
Author

Hey @vany0114: I believe we discussed this "vertical deployment" during our call, did we not? If it refers to possibly breaking down to app into several smaller deployments, and load balancing across them, then I could see that helping, but some experimentation will be needed.

Also, as per our call, I think we agreed to work towards bringing your app to EP1. I'm curious if we should start a new thread to focus on that. This GitHub issue has grown rather long, and if we're pivoting towards a new goal, it might be best to start fresh so folks don't need to catch up on so much context. Would you agree? If so, once you're ready, I'd appreciate if we could move to a new issue.

@davidmrdavid done: #2832

@davidmrdavid
Copy link
Contributor

Thanks @vany0114. Let's look to continue our conversation there then. I'll be closing this issue to help reset context, this thread is getting hard to follow for those without pre-existing context.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
netherite P1 Priority 1 Performance Reliability Durable functions get stuck or don’t run as expected.
Projects
None yet
Development

No branches or pull requests

7 participants