-
Notifications
You must be signed in to change notification settings - Fork 272
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Netherite backend: Client request was cancelled because host is shutting down. #2603
Comments
Hi @davidmrdavid yes, that's what I was referring to, the worst latency I've seen locking an entity is ~130 secs.
It's a different error but it's a side effect/consequence of the original one (in previous messages in this thread I've brought it up), anytime we were experiencing the orchestrator getting stuck this other issue always showed up, that was preventing us from creating new orders since we weren't able to start new orchestrations or signaling lock entities, so it seems like that when the mitigation is happening we have this other intermittent error. Some operation ids:
Error details:
Please let mew know if that information helps. |
@davidmrdavid this is getting worse, while this is intermittent it's happening more often than before, yesterday there were a bunch of these errors throughout the day, as you can see below. As you can see the incoming vs the outgoing messages don't match in the Event Hub. And orchestration latency increased from ~3 secs to ~100 secs. |
Thanks for creating this issue, Im sharing this internally so it gets an owner as im unfortunately personally not available for the rest of this week. |
@davidmrdavid really appreciate it! I'll stay tuned. |
This error message is indicating that the entire host on which this client is running is shutting down and can therefore not process any more requests. The intended way to handle this situation is to retry the request on a different host. In particular, a local retry loop does not help - the request has to be retried on a different host. Based on your description, I would assume that the EG trigger retry should take care of that. |
@sebastianburckhardt Thanks for the explanation, yes I've reviewed some of the affected requests and indeed the EG trigger retry is doing its thing. On the other hand, do you know why that error is happening that often? |
Hey @vany0114, I wanted to provide an update here. It appears to me that, in most cases, when you receive the error "Client request was cancelled because host is shutting down.", the VM where this error was emitted previously logged the following message: "A function timeout has occurred. Host is shutting down.". In other words, the timeline of events seems to be as follows:
So, to answer your question: you are receiving the "client request cancelled" errors because some azure function invocations are timing out, which in turn triggers a VM restart. So now the question becomes: why are function invocations timing out?. I don't know this for certain, but I have some theories that are partially backed by our telemetry. In particular, I see that during periods of high failure, your app is exceeding the 'healthy' CPU usage threshold. In particular, it's using over 90% of the CPU available on your VMs. As you can imagine, this can slow down processing, which can in turn lead to timeouts. Furthermore, on our other thread, I theorized that perhaps high CPU utilization was to blame for the periodic hangs in the "compaction" step of checkpointing. I also see evidence of compounding effects: the timeouts cause VMs to restart, which create a backlog of processing, which means the backlog needs to be consumed on the next VM, which creates delays for new requests, which can cause new timeouts and thus a cycle. At this point, I think it would be ideal to provide your app with more CPU capacity, which I think would help stabilize your app. I think there's a good chance this will help. I don't know for certain that this is the root cause of the timeouts, but so far I don't see any logical errors on our telemetry that explain the problem otherwise so I think this is our next best bet. I'm also hesitant to provide a new private package with a more aggressive mitigation (i.e one that fires more quickly), as the mitigation is to terminate processing for a durable partition on a given VM, which adds some delay to processing and therefore risks exacerbating the timeouts. @vany0114: Are you able to upgrade/scale-up your application to utilize more powerful VMs? In particular, VMs with more CPUs capacity (e.g # of cores) should help improve the stability of your app. It's not necessary to go to the highest plan, just a higher one should help. Again, I don't know for certain that this will make the problem disappear, but it should remove one of the sources of instability I'm seeing. I also recommend proactively monitoring your CPU utilization to ensure it remains under healthy limits. |
@davidmrdavid Thanks for the detailed update! I'll move to the premium plan then because right now we're using a consumption plan. |
@davidmrdavid I'm reviewing this issue, particularly in our dev environment which is where we get this issue so often, as you can see almost all requests are failing because of the timeout and the shutting down issues. I am bringing this up because the traffic in our development environment is not as high as in production, actually, it's very low. In terms of infrastructure, the only difference is the throughput units that the Event Hub NS uses in dev vs production, which are 1 and 4 respectively, because both function apps (dev and prod) are deployed in a consumption plan. So this behavior seems interesting to me because I wouldn't expect any issues in our development environment given the low requests number we process there, but it's the opposite, the behavior in our dev environment is actually worse than production, as a matter of fact, we reset it (func app, storage and Event Hub NS) last week and look at it now, it's totally unhealthy. For instance, look at the behavior of the EH messages. BTW these are some insights in regard to memory and CPU consumption of the last 24 hours, there's indeed a high memory consumption during certain hours (which is something I'd like to review to try reducing allocation on our end) but not like the whole day is that high. On the other hand, the CPU seems healthy. I hope this information helps you with the diagnostic. |
Just saw this, @vany0114. I'll take a comparative look at the dev environment tomorrow. |
In the meantime: any context on what could be causing those large memory allocations? Are you aware of any large allocations on your end, or is that surprising? |
@davidmrdavid Not sure yet, I need to look into it, but I think it might be the main input that the orchestration is created with (that's also persisted as part of the durable entity state, which may impact when it gets serialized/deserialized), but again I'm not sure I'll need to make a memory profile or something to get more insights. |
@davidmrdavid today I redeployed the func app to move it to a premium plan and also redeployed the Event Hub NS to use again a standard tier, however I'm seeing this error a lot ever since:
Do you know what does it mean? |
I had to go back to a consumption plan due to that issue FYI |
Thanks @vany0114. Regarding the error: "An item with the same key has already been added. Key: durabletasktrigger-netherite-ordersagav3". Yes, I've seen that before. My understanding is that this is a transient error in the ScaleController that is safe to ignore, although noisy. Unfortunately, it cannot be fixed on the application-side, the ScaleController team needs to make a release for it to disappear. Did you experience that error affecting the correctness of your app, or was it just polluting the Application Insights logs. Just confirming. As for the test app: I took a preliminary look but I'm not entirely ready to share a conclusion there just yet. I'm starting an internal thread about it. I'll keep you posted. |
@davidmrdavid I initially thought it was just noise but I started seeing misbehavior, basically with an ack/event that the func app waits for, even tho the ack was being published timely the func app somehow was not processing it and it just timed out, so that's why I moved back to the consumption plan. I'm curious about why I don't see that error when the app is deployed on a consumption plan 🤔
I was told by the Azure support engineer that I could update the extensions package since it was already fixed but based on your answer it seems I cannot, so I'll wait until you give me the green light to do so, then I'll try again moving the func app to a premium plan. |
@davidmrdavid FWIW here is an example of an affected instance that I was referring to: |
@vany0114: Regarding that ScaleController log, that will be fixed once their code takes on a dependency on a Netherite release with this fix (which I just approved): microsoft/durabletask-netherite#316 (review) . Again, my understanding is that this bug can be safely ignored, so I need to double check that affected instance you shared. Looking into that now. I could have been wrong. I did look into your staging app. I don't have a clear picture of what's wrong with it yet but I do see a lot of OOMs. At first, I thought maybe you had some kind of persistent error loop due to trying to deserialize too large of a blob, but I see the app is slowly healing over the past day, which invalidates that theory. Did you do anything do your staging app in the past day to help it recover? |
Yes, I had to reset it because our dev environment wasn't working because of that. The fact that the func app is sharing the same service plan with more apps has something to do? in our dev environment, most of the func apps are deployed in the same service plan FWIW. |
@vany0114: ok, it's good to know it was reset, as that helps validate that my initial findings were probably not entirely off. I'll continue my investigation while ignoring the latest, healthier, data. As an aside: There's an incoming Netherite release that I think should include the mitigation in your preview package, as well as the fix for the ScaleController error. I'll keep you posted on that. Regarding your question on using the same app service plan: I think it's fine to do this. |
@davidmrdavid I'd like to know your thoughts on what would be the next course of action to fix this problem definitively, this is hurting us really bad, and resetting the app every week is not scalable, for instance, it's just starting the day today and it's timing out...I've tried everything you have advised (which I do appreciate a lot) and the issue persists, so I don't really know what else to do to stabilize our app 😞 |
@vany0114: Yes, I want us to get your application to a better state soon. I'm also concerned about it's current state. Here's my thoughts:
If you decide to scale up to Elastic Premium, please "reset your app"/ clear your taskhub after doing this. This will ensure that the newer VMs can start in a healthy state and, as such, be able to "keep up with the incoming work" instead of trying to do that while dealing with a pre-existing backlog.
I think it's critical that we stop these timeout-induced shutdowns. @vany0114: can you change your code so that you manage the timeout yourself in the code instead of relying on the Functions runtime timeout setting? This way, when a request times out, it will not cause the VM to restart, which should help with the instability. Addressing this will also make investigating easier for us. Right now, the timeout-induced shutdowns are extremely noisy, which make it difficult to identify the root problem. This is my main recommendation.
Those are my immediate thoughts with your current set up. Finally, I think it's key to make sure your application isn't trying to save moderately large state in Entities (or in any Durable APIs) as per this best practice: https://learn.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-best-practice-reference#keep-function-inputs-and-outputs-as-small-as-possible . If you have reason to suspect your code may be violating this constraint, I think this would also be an area to invest it. If you have moderately large data being communicated through DF APIs, please use indirection to materialize them, as described in the linked doc. |
Thanks for your thoughts, I think we can start then by scaling up to Elastic Premium, so I'll wait for your instructions to circumvent the errors re the Scale Controller. Re handling the timeouts in the Event Grid triggers, absolutely that's something we can do, I just have a question, can I re-throw the timeout exception so that Event Grid can retry or should I re-throw a different exception? I'm asking because I don't know if by re-throwing the same timeout exception it will still cause the shutting down issue. Edit: actually it will throw the Polly
Will do As per the improvement to avoid saving large states in the Entities that's something that definitively we'll do, but it will take us some time since that's not a trivial refactor on our end. |
@davidmrdavid since I'm planning on making a release tomorrow can you provide please these instructions? the release will contain the timeout handling as well as the TakeStateCheckpointWhenStoppingPartition setting change FYI. |
@davidmrdavid Thanks for the heads up and for sharing this, I have been looking into it and indeed there's a bug on our end that is causing we're discarding a lot of events due to a serialization issue that is making the instance ids not match, i.e: Example 1Orchestrator instance id:
Event raised instance id:
Example 2Orchestrator instance id:
Event raised instance id:
You're right, we put a validation in place to avoid raising events when the orchestration is completed already, tho I have a question, is there a way to know wheter a given intance exists? so that we also avoid raising events for instances that do not exists, it should not happen in normal scenarios, but as you said it helps to avoid overwelming the orchestrator with meaningless work. This is what we did: // This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return;
await ResilienceStrategy.ExecuteAsync(() => context.RaiseEventAsync(instanceId, eventName, integrationEvent)); These changes will be released next week Moday or Tuesday FYI. |
@vany0114: ah, I see - that's interesting. It sounds like you were obtaining a JSON that was serializing it's keys out of order? In any case, glad we found and fixed a bug :) . Regarding your validation: // This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return; This works, but I think it doesn't catch the case where the instanceID never existed (as it was in the bug). I'm juggling a few different contexts here so I can't remember exactly how it works in .NET in-process, but a quick test of the response value of However, there's a subtlety here: in Netherite, querying the status of an orchestrator does send an event to the partition, so while you may be avoiding the discarded message (which is good!), you will still "burden" the partition with a status read request and then, if the orchestrator is in fact available, you'll send another event raised event right after that. So ultimately, if your goal is to minimize the partition burden, it may suffice for you to still optimistically send that raised event and rely on the discarding logic if the orchestrator already completed or doesn't exist. I may need @sebastianburckhardt to correct me here, but he's OOF for a few more days so that'll need to wait. What really matters is that your instanceIDs have the expected format, so that if they're received, they lead to meaningful work. In summary: I think you should be good deploying just the instanceID deserialization fix, and at this time I believe that may be more performant. In any case - my main goal was to reduce the number of unexplained warnings, and this change (irrespective of you adding the additional step of checking if the instanceID exists) does that :-) . That'll make it easier for me to read your app's health signals in the future. |
I tested it out and It does not return an exception nor a
Yeah I was thinking kinda the same, I'll remove that validation then. |
@vany0114: in your sample: // This helps partitions avoid being busy with work that isn't meaningful/that is to be discarded.
var statusResponse = context.GetStatusAsync(instanceId, showInput: false);
if (statusResponse.IsCompleted) return; Don't you need to |
@davidmrdavid BTW do you think it explains the high outgoing messages in the event hub? |
You're totally right 🤦but I did remove it as per our previous conversation. |
Not sure, but that is a signal that I've also noticed throughout this thread. It seems to correlate with partition slowdowns in general, but I think we've seen it both in "partitions stuck/corrupted" instances as well as general temporarily slowdown, but I'm not yet certain what it represents. It's not clear to me how a stuck/slowed partition increases the rate of outgoing messages, I'll need to consult with the rest of the team. |
In general, there's usually been other, clearer, explanations of what's going wrong when you report a problem (usually it suffices to look at the partition logs and see if they're busy, or online at all), so I've been relying on those. But I do want to understand the high EH activity. It sounds like you're drawing a correlation between high EH activity and this issue with the raised events? Can you tell me more about that? Maybe there's something application-specific I'm missing that would explain it. |
Nope, I'm not saying there's a correlation but seems suspicious to me, maybe the correlation is more with our timeout and/or the shutdown transient issue? |
Right. Here's my summarized mental model: The explanation of the spike in EH outgoing message activity is still unknown to me, but it hasn't been a blocker yet for identifying issues in the app. The shutdowns have been mostly explained by two reasons: We've mitigated most of this by: Now the remaining questions are: (2) Have we stabilized Netherite itself: are there still partition corruption issues to worry about. Things are looking better here for us ever since we updated to the latest FASTER, but obviously this is a critical point for us. (3) Can we recover from these issues without you needing to 'reset the app'. If so - can we automate it. Let me know if you feel I'm missing anything |
Returning back to the purging history conversation: We should definitely be purging data periodically, that should help speed up all sorts of FASTER queries, which is good for stability. You could set your purging of data to occur once a week if you're worried about the partition burden, but I think it should occur. In fact, now I wonder if the fact that you're not purging data cold explain why we have periods of 'good performance' and then a degradation kicking in a few weeks later. And yes, there's a subtlety with entities in purging :Entities by design "live forever" unless explicitly deleted. So a purge operation, if I'm not mistaken, will not delete that data. If you're creating transient entities, you should delete them with the If you already have a large number of transient entities that are completed but taking up storage, you may need to call the "list entities" API and purge the ones you don't need - maybe you can see it's last timestamp update and use that to conservatively delete really old ones? |
@davidmrdavid I think that's it, you perfectly summarized it, I just would add a couple more questions:
|
I'm merged the fix for this issue yesterday, just haven't had time to perform the release steps. However: I worry the memory footprint you're measuring might be a red-herring - the isolated process model has two processes, the "host" and your "app" (compared to the in-process model where both co-exist in the same process). I worry this memory reduction you're seeing is in your app's process, but it's misisng the "host" processes own memory usage. I'm saying this because it's the same Netherite code that's running in both models (inproc vs isolated) so it doesn't make sense to me that the memory footprint would be different, I suspect it's more likely the memory footprint was simply moved elsewhere (from app process to host process). Again - I haven't measured this yet, but from first principles I worry this may be a red-herring. |
@davidmrdavid yeah that makes a lot of sense, in the next release we'll be purging history once a day, re the entities, we always delete them once the orchestration finishes. |
I see, thanks for the clarification. |
This depends. Netherite does have higher memory requirements, and consumption has lower memory capacity, so clearly that poses a bigger challenge than the default backend; but an app under the right circumstances (config, traffic, architecture) can use Netherite in the Consumption plan. That said, I am definitely thinking of writing a "best practices" guidance that push users to reserve Netherite for higher capacity plans, like Elastic Premium. I think users who gravitate towards Netherite are naturally trying to "push the system" to the limit (Netherite is supposed to be for high throughput scenarios after all) and naturally this will work better when the system has more capacity to give. For your app specifically - it's proving to be difficult to return to the consumption plan. I don't think it's impossible, but this thread shows it isn't easy. That said, on my end I did spend a large part of this investigation focusing on the partition corruption issues over debugging our app's performance as the former was a permanent error (+ data loss) and the latter could be 'mitigated' with using the elastic premium plan. So I think that now that the corruption issue has not occurred in a while, I'm hoping we'll be able to experiment with tuning your Netherite config to see if it can bring you down to the consumption plan. The question though is whether you prefer we do that over trying to get more performance out of the Azure Storage backend (which will definitely run in consumption, but may not give you the throughput you want), that is one of the reasons why I asked you to schedule a meeting with us (and I see it's scheduled for next week). |
Here's a few things, some of them possible today, some of them need implementation and design on our end, that should help facilitate getting Netherite on the consumption plan. I'll write them down here for future reference. (1) Using a 'shadow store' for client queries As we discussed just now, getting the state of an orchestration sends a request to the partition. Ideally, we would be able to perform queries (i.e reading the state of entities and orchestrations) without needing to burden the partition itself. We can do this by creating a table in Azure Storage, similar to the This doesn't exist today, but the good news is that you can implement it easily on the application side: if your orchestrators and entities post their state to a table throughout their lifecycle, you can query that table to determine the status of orchestrations/entities instead of going through the partition. This could help if we believe queries are a bit part of the load, which we have not established yet. **(2) Smarter load-balancing ** **(3) Partition throttling ** I think part of the challenge today is that Netherite does not throttle it's incoming work. This is part of why you get timeouts in your client requests: Netherite is accepting that work, and being optimistic that it'll complete it before the timeout, but it does not. If it had throttled your request and failed it immediately (assuming it was too busy to take on new work), that would probably be a better signal to you as a user, and also could help the partition to remain healthy in terms of keeping memory usage low at the expense of outright rejecting client requests, instead of timing out on them. This is tricky to implement because the throttling shouldn't be too sensitive either, but if we make it user-configurable, you could make it as sensitive as you'd like to help control the load on each partition. Naively speaking, with a sensitive enough throttle, you're effectively reducing the traffic that reaches each Netherite partition, which may be the key for your app to run on lower capacity VMs. Of course, you have to balance this with your thoughput needs. |
@davidmrdavid Yep that's understandable, everything has a cost and trade-offs, in this case, the cost of high throughput is high memory consumption and I agree that it depends on the specificities as you said, traffic, architecture, and configuration are factors of memory consumption. And yeah documenting a best practices guide would be great for future users; I wish I had known all of the things we've discussed in this thread early to avoid these headaches (fortunately your support has been super helpful and much appreciate it 🙏), for example knowing the higher memory requirements that Neherite has would have been super helpful before deciding to move to Netherite. However, in our case, it was also not only because of throughput that we decided to move to Netherite, it was because of reliability too, we used to use Azure Storage as the backend and we found ourselves resetting the app several times a year (3 or 4), not that much as with Netherite, so we thought that with Netherite we wouldn't have that issue. Even tho we've learned a lot of things about Netherite and Durable framework in the hard way because of the bugs we've found in Netherite/Faster and the lack of documentation we were talking about, it has helped us to improve our service either way (and hopefully to improve Netherite itself) and to start realizing or considering the tradeoffs of moving back to Azure Storage as the backend as a possibility, my concern other than throughput is to face similar problems, such as partition corruption and other bugs that were fixed as part of this long discussion (if they may apply to that backend)
Yep let's give it a try, I'm willing to do so, I hope that after the release we're gonna do next week that includes the fix for the discarded events and the purging stuff it stabilizes even more the app and the memory consumption stay flat (or at least has a logarithmic growth) as much as possible. Lastly, I have a question for you, is it possible to shard data in a durable func app? I'm curious to know your thoughts about it, since that's pretty common, for example in multitenancy architectures, I've been thinking about it as another improvement to help with the load. |
Yep we do, because we need it to raise events, etc I guess we could use the autogenerated ids and keep a mapping between the orchestration ids and our own id? that btw could be another thing worth mentioning in the documentation IMHO. Re the partition, yep, that'd be great. |
Possibly - I'd first need to establish that you have a skewed distribution of partition load, with some partitions being significantly more burdened than others. Otherwise, this indirection will be a waste. I'll try to see if the data supports this possibility.
It is technically mentioned here (https://microsoft.github.io/durabletask-netherite/#/placement) but those aren't the official MSDocs docs so you're right :) . This GitHub page has a lot of worthwhile details though so I recommend reading through it.
Hmmm - if by sharding you mean to assign different segments to data to different machines, then that's exactly what our partitions do: each partition holds the data for a different range of instanceIDs, and each partition may be assigned to a different machine (if you're sufficiently scaled out). Of course, for this to work best, instanceIDs need to be random so they're evenly balanced across partitions. and regarding this comment: #2603 (comment) . Thanks, we really want to get y'all to a stable place; I'm not happy with having to 'reset your app', the point of durable functions is that we should tolerate all sorts of transient failure, so I'm glad we're making forward progress on that. Yes - I think this thread has driven a lot of improvement to Netherite. |
I was aware of that page but I never read the placement stuff, I didn't know the
I meant to shard data at Event Hub level, so for instance that we can have several instances of the app running but tied to an specific Event Hub depending on the tenant...I guess it would also required the app to be tied to a different storage account, maybe that would be possible using vertically partitioned deployments. |
@davidmrdavid any thoughts? ☝️ |
If I understand correctly, you're asking if each partition could have it's own EventHub and also share a single Azure Storage account. In theory, I think that's possible but for your case I don't think that would really help reduce the partition load. My impression is that EventHubs actually keeps up with the load quite well, it's the VM that does not have enough capacity to keep up with the load it is receiving and may throw OOM exceptions when a surge of events comes in. |
@davidmrdavid Exactly, that's why I'm thinking of a vertically partitioned deployment to help distribute the load to the workers so that we can have some tenants tied to a given instance, i.e: I think that probably with that approach we would be able to use Netherite in a consumption plan, does that make sense? |
@davidmrdavid bubbling this up ☝️ |
Hey @vany0114: I believe we discussed this "vertical deployment" during our call, did we not? If it refers to possibly breaking down to app into several smaller deployments, and load balancing across them, then I could see that helping, but some experimentation will be needed. Also, as per our call, I think we agreed to work towards bringing your app to EP1. I'm curious if we should start a new thread to focus on that. This GitHub issue has grown rather long, and if we're pivoting towards a new goal, it might be best to start fresh so folks don't need to catch up on so much context. Would you agree? If so, once you're ready, I'd appreciate if we could move to a new issue. |
@davidmrdavid done: #2832 |
Thanks @vany0114. Let's look to continue our conversation there then. I'll be closing this issue to help reset context, this thread is getting hard to follow for those without pre-existing context. |
Description
@davidmrdavid as discussed in the #2534 I'm opening another issue to troubleshoot this other issue since it is related as it is a consequence/side effect of the mitigation that avoids an orchestrator gets stuck for hours locking an entity.
Expected behavior
Actual behavior
When the mitigation happens what I've seen is that when it tries to signal an entity (one of the ones we lock), raise an event or even start up new orchestrations, it fails with a
System.OperationCanceledException
that says:Client request was cancelled because host is shutting down.
Any ideas?We have a retry strategy in place with an exponential backoff that waits up to 2.5 secs, but it seems that the issue takes longer (because all those operations failed after 2.6 secs), since this is an Event Grid trigger, the EG retry strategy comes into play after that, but I would need to review whether the EG retries succeded or not because I'm not sure how long this issue might take, as you can see when that error appeared there were intermittent
OperationCanceledException
errors around ~9 hours.Known workarounds
None
App Details
Screenshots
If deployed to Azure
orders-saga
The text was updated successfully, but these errors were encountered: