Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestration getting stuck while getting the lock #2534

Closed
vany0114 opened this issue Aug 7, 2023 · 166 comments
Closed

Orchestration getting stuck while getting the lock #2534

vany0114 opened this issue Aug 7, 2023 · 166 comments
Assignees
Labels
Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. Needs: Triage 🔍 P1 Priority 1 Reliability Durable functions get stuck or don’t run as expected.

Comments

@vany0114
Copy link

vany0114 commented Aug 7, 2023

Description

The full description with code snippets, screenshots, issue samples, etc is here #2530

Expected behavior

Acquire the lock in seconds at most, not minutes or hours.

Actual behavior

It seems the orchestration is getting stuck while acquiring the locks of the entities intervening in the orchestration.

Known workarounds

Reset the durable storage account and the func app storage account

App Details

  • Durable Functions extension version (e.g. v1.8.3): 2.10.0
  • Azure Functions runtime version (1.0 or 2.0): 2
  • Programming language used: C#

If deployed to Azure

  • Timeframe issue observed:
  • Function App name: orders-saga
  • Azure region: West-US
  • Azure storage account name: ordersagav2
@ghost ghost added the Needs: Triage 🔍 label Aug 7, 2023
@davidmrdavid
Copy link
Contributor

@vany0114: while we start this investigation, can you share with us when this issue first started (rough timestamp in UTC)? Does it correlate with any code changes (i.e partial classes, or trying out the new partition manager)?

@vany0114
Copy link
Author

vany0114 commented Aug 7, 2023

@davidmrdavid It seems the issue started around 2023-08-07 13:34 UTC

@davidmrdavid
Copy link
Contributor

Thanks! Any known code/config changes around that time? @vany0114?

@vany0114
Copy link
Author

vany0114 commented Aug 7, 2023

@davidmrdavid Nope, we haven't touched anything since the last deployment on August 2nd which btw that was the last time the issue showed up and we had to reset everything.

@vany0114
Copy link
Author

vany0114 commented Aug 7, 2023

@davidmrdavid BTW I'll re-deploy the func app today to try to mitigate this issue, but this time it will use Netherite as the storage provider, meaning that the old ordersagav2 storage account is still there for troubleshooting however the func app's storage account will be reset, would that be a problem for you guys while you look into?

@davidmrdavid
Copy link
Contributor

@vany0114: that shouldn't be an issue on our end as our historical telemetry is stored independently of our app's live status, so you're good to go ahead :).

I'm not certain at this point in time that this is an issue with the storage provider itself, so I can't comment on whether changing to Netherite will help, but it probably won't hurt to try it out and may tell us more about the problem. Just an FYI.

@vany0114
Copy link
Author

vany0114 commented Aug 7, 2023

Agree I'm not sure either, just wanna try something different this time, and as you said it may tell us more about the problem.

@vany0114
Copy link
Author

vany0114 commented Aug 7, 2023

@davidmrdavid BTW if you guys can also take a look at this microsoft/durabletask-netherite#294 would be great since that's related to the Netherite implementation.

@davidmrdavid
Copy link
Contributor

A few small updates:
(1) I don't see your app using the new table partition manager. I'm only calling this out because you mentioned this new feature in the discussion thread. Just to confirm: is this expected or did you take an action to try and use the new partition manager? Just triple checking the expectation.

(2) I'm not immediately spotting a partition manager-type issue. In other words, the queues that are hosting your orchestrator and entity messages seem to be receiving new messages just fine. So I'm starting to think the issue here might have to do with the entity message processing, not with receiving messages. Still need to validate this.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Aug 7, 2023

Actually, I'm seeing abnormally high message ages in "control queue 3". Let me drill into that. If that queue is hosting any of the entities you're trying to lock, maybe that's to blame.

@vany0114
Copy link
Author

vany0114 commented Aug 7, 2023

(1) I don't see your app using the new table partition manager. I'm only calling this out because you mentioned this new feature in the discussion thread. Just to confirm: is this expected or did you take an action to try and use the new partition manager? Just triple checking the expectation.

Yeah I rolled back that change because I thought that was the culprit, but it wasn't, sorry I forgot to mention that.

@davidmrdavid
Copy link
Contributor

Actually, I'm seeing abnormally high message ages in "control queue 3". Let me drill into that. If that queue is hosting any of the entities you're trying to lock, maybe that's to blame.

I'm not able to draw a correlation from this observation to the symptoms you experienced, so let's ignore this for now.

Also, I've been focusing on this set of data, from the other thread:

{"storeId":"615c29df-5e7c-458c-9486-ef9480d2bf3b","courseUid":"c77bb860-04f7-46d3-86d9-d8c0e65fe856","checkNumber":"90007"} same, the instance was stuck 21 minutes:

* @ticketnumberkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @originorderkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1
* @order@{"storeId":"615c29df-5e7c-458c-9486-ef9480d2bf3b","courseUid":"c77bb860-04f7-46d3-86d9-d8c0e65fe856","checkNumber":"90007"}

So that orchestrator is stuck from about ~11:33 to ~11:54, and I see in my telemetry that EntityId @scheduledkey@615c29df-5e7c-458c-9486-ef9480d2bf3b_2023-08-07_90007-1 is the last Entity to respond to that lock request, precisely at about ~11:54, whereas others responded around ~11:33. So the question became: why did it take us so long to acquire the lock on this final entity?

Well, I see that entity is managed by control queue 01. According to our telemetry, control queue 01 was being listened to "VM 19" from ~13:15 to ~13:31. Then, no VM is actively listening to new messages from that queue until ~13:54, when "VM 20" takes over.

To summarize, it appears that no new messages are being processed from that queue from ~13:31 to ~13:54.
So now what I need to answer is the following:
(1) Why does VM19 stop listening to this queue in the first place? Normally, this is because that VM is in the process of shutting down, another worker may steal ownership of its assigned queues, and therefore the VM will stop listening to new messages.
(2) Why does it take so long for VM20 to start listening to queue 01? My guess is that VM 19 took too long to restart, which prevented VM20 from being able to listen to that queue (only one VM can listen to a single queue at a time, otherwise we risk "split brain" issues).

That's what I've been able to find so far. Assuming I understand the problem correctly, this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend.

I'll continue looking into the logs to answer the (2) open questions above, I'll keep you posted.

@vany0114
Copy link
Author

vany0114 commented Aug 8, 2023

@davidmrdavid I much appreciate the update 🙏

@vany0114
Copy link
Author

vany0114 commented Aug 8, 2023

[...] this means, to me, that simply restarting your app should suffice to mitigate this situation when using the Azure Storage backend.

Unfortunately, it didn't work, I'm gonna reset it again.

@davidmrdavid
Copy link
Contributor

Hi @vany0114 - if you have a new timerange of when your app was stuck, as well as the timestamp of your restart, that would help us greatly

@vany0114
Copy link
Author

vany0114 commented Aug 8, 2023

@davidmrdavid It has been stuck since yesterday (2023-08-07 13:34 UTC), I restarted it when I leave the previous comment. Now the func app is using Netherite and it's working fine, but I guess you'll be able to find a lot of instances stuck in the old storage account.

@davidmrdavid
Copy link
Contributor

@vany0114: I'll be looking at the logs but I need a bit more details.
In this post, you mentioned that some orchestrators were temporarily stuck / greatly delayed when acquiring locks. Those issues are from around ~13:34 UTC, and I think each self-healed within 30min if I understand correctly.

So when you say that it's been stuck since 2023-08-07 13:34 UTC, do you mean that there are some instanceIDs that never recovered? Or that you've been experiencing intermittent delayed orchestrations since that time. If you have an orchestrator that has been stuck since 2023-08-07 13:34 UTC and remains stuck until now, that would be a useful instanceID for us to have. Thanks!

@vany0114
Copy link
Author

vany0114 commented Aug 8, 2023

It's not intermittent, it was happening basically with every request processed by the orchestrator, here are two more examples where those instances were stuck for hours.

  • {"id":"3107849517381639","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="} This was suck for more than 10 hours.
    image

  • {"id":"3107849515304964","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="} and this one more than 6 hours
    image

As a result of that our processed orders rate was affected, you can see here how dramatically it dropped because the orchestrator was holding almost all of them.
image

@davidmrdavid please let me know if the information provided helps

@davidmrdavid davidmrdavid added Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. P1 Priority 1 and removed Needs: Triage 🔍 labels Aug 8, 2023
@nytian nytian self-assigned this Aug 11, 2023
@nytian
Copy link
Collaborator

nytian commented Aug 14, 2023

Hi, @vany0114 . My team member @davidmrdavid is OOF and I am taking over this case. Can you tell me what kind of id is this id is? "id":"3107849517381639" And is this time "businessDate":"2023-08-07T00:00:00" UTC time? Thanks!

I queried with function app name and time stamp since 2023-08-07T00:00:00 UTC, I didn't find any message with an Age of several hours. So, I think the orchestration instnaceId of the two above examples would be helpful for me to investigate.

@vany0114
Copy link
Author

Hi @nytian, the ids I've shared are actually the orchestration ids:

  • {"id":"3107849517381639","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="}
  • {"id":"3107849515304964","businessDate":"2023-08-07T00:00:00","locationToken":"5zAQ1KZzYkqmea9XbBkLbA=="}

The orchestration ids are a small serialized JSON basically.

@nytian
Copy link
Collaborator

nytian commented Aug 14, 2023

I queried the two InstanceId above. The orchestrator awoke when it received the external event, which is expected.
image

As you can see, the message ExternalEventRaised age is only 900, which means the message got picked up by the orchestration in less than 1 sec. I didn't find any problem here. It seems like the orchestrator is just waiting for the external event to be raised. And the time gap was when the orchestrator was just waiting.

So, I am curious what is the expected behavior for the event here. Does it supposed to raise event in such a later time? And how do you raise the event? Thanks!

@vany0114
Copy link
Author

@nytian not sure if you already read the whole thread and all the documentation provided but the problem here to figure out is why that event took that long, that's not an event our orchestrator is waiting for, it seems to be that internally when an entity is locked it waits for some sort of ack or something and that is what causes the orchestrations to get hang.

image

@vany0114
Copy link
Author

@nytian @davidmrdavid any updates?

@vany0114
Copy link
Author

vany0114 commented Aug 19, 2023

More examples from today

  • {"storeId":"0cbddcbd-6df1-4e00-9e4a-e88a876a8996","courseUid":"bf2933ca-0770-4e19-bf75-ea93baed75c6","checkNumber":"26944"}
    image

  • {"storeId":"9c4e79c3-d748-4c11-8975-c6f5910118a4","courseUid":"76082c77-de76-4999-a69a-a04c221599e8","checkNumber":"1023"}
    image

  • {"storeId":"5c7d0ceb-d19f-48e9-a62a-ed8702d3ff95","courseUid":"5e44fac5-239b-406f-b400-e29885b3833f","checkNumber":"13624"}
    image

FWIW it was working great after we started using Netherite, however yesterday we made a deployment around 2023-08-18T12:45 UTC, and after that, it seems the problem arises again.

@davidmrdavid can we get some help asap please, this is happening in our production environment.
cc @cgillum

@vany0114
Copy link
Author

vany0114 commented Aug 20, 2023

It happened again today, one weird thing I noticed is a spike in the Netherite Event Hub incoming messages around 8:32 am and 9:37 am UTC which does not match with the traffic volume during that time window. (FWIW the problem from yesterday showed up in the same time window)

image

As you can see the volume of requests in the event hub does not match the incoming traffic that that time, you can also see that during one hour the response time got degraded, it was 1 minute consistently (because of our timeout configuration).
image

After that, it seems to be that the orchestrator auto-healed and started operating normally again.

This is one of the instances affected during that time window (same problem, locking an entity): {"storeId":"7794ca1e-7e8c-49ef-8e2a-763fc114a99c","courseUid":"cb03b4f3-a26a-4465-84d1-9e44a7ae65e2","checkNumber":"28584"}

image

@davidmrdavid @cgillum

@vany0114
Copy link
Author

vany0114 commented Jul 2, 2024

@davidmrdavid got it, thanks! and what about the partition count? increasing it to the maximum (16) would help with throughput?

@davidmrdavid
Copy link
Contributor

@vany0114: If starting with a brand new task hub - then yes it can help increase throughput when scaled out, but will also increase storage costs (more queues to load balance and poll). But please note that increasing the partition count is not a safe thing to do for a pre-existing task hub (instanceIDs land on a queue based on partition count, so if you change it, you risk having old messages now "in the wrong queue"), so if you change that please do so for a brand new and empty task hub.

@davidmrdavid
Copy link
Contributor

@vany0114 sorry for the delay here, we've been under a lot of deadlines (still are) but I carved out some time to tackle this now. I think I understand the problem.

So I looked at the "stuck orchestrator" with instanceID "{"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}"

this tried to take a lock on this entity "@order@{"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}'" (+ 4 others)

at 2024-06-03 23:22:03.7759128 UTC and did not receive the lock for all entities until 2024-06-03 23:32:04.5303100 UTC, so 10 minutes, as you said.

From the logs, I can actually see that the lock over all entities was obtained at "2024-06-03 23:27:04.4272879", but the message to notify the orchestrator of this successful locking took ~5 minutes to be dequeued. So the real lock time took about ~5 minutes, but it took another ~5 to deliver the message.

I think the most critical thing to debug is why it took ~5 min to deliver the message. From what I see in my telemetry, it appears your queues, at times, get pretty backed up to the point of creating this "5 minute wait" (sometimes it's much longer) that newly inserted messages are paying before getting processed.

This is all I have so far. I have some rough ideas for how to optimize this, but I'll need to think it through. I'll post another update tomorrow with some ideas, as well as instructions on how you can view your queue backlog yourself.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Jul 9, 2024

Hey @vany0114 - are you using the Durable Functions v3.0.0-rc extension by any chance? My logs indicate that this is the case. Please don't change anything in your app just yet / attempt to downgrade if you are (it is not safe to downgrade from v3 to v2 due to breaking changes!) -but if so then you'd be a preview package that isn't production ready and may be experiencing some known bugs (none that would explain the queue backlog by themselves, but they wouldn't help nonetheless). Please let me know if this is the case and we can figure out a strategy to bring you back to a production-ready package.

@vany0114
Copy link
Author

vany0114 commented Jul 9, 2024

Hey @vany0114 - are you using the Durable Functions v3.0.0-rc extension by any chance? My logs indicate that this is the case. Please don't change anything in your app just yet / attempt to downgrade if you are (it is not safe to downgrade from v3 to v2 due to breaking changes!) -but if so then you'd be a preview package that isn't production ready and may be experiencing some known bugs (none that would explain the queue backlog by themselves, but they wouldn't help nonetheless). Please let me know if this is the case and we can figure out a strategy to bring you back to a production-ready package.

Hey @davidmrdavid that's correct, we're using 3.0.0-rc.1

@vany0114
Copy link
Author

vany0114 commented Jul 9, 2024

@vany0114 sorry for the delay here, we've been under a lot of deadlines (still are) but I carved out some time to tackle this now. I think I understand the problem.

So I looked at the "stuck orchestrator" with instanceID "{"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}"

this tried to take a lock on this entity "@order@{"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}'" (+ 4 others)

at 2024-06-03 23:22:03.7759128 UTC and did not receive the lock for all entities until 2024-06-03 23:32:04.5303100 UTC, so 10 minutes, as you said.

From the logs, I can actually see that the lock over all entities was obtained at "2024-06-03 23:27:04.4272879", but the message to notify the orchestrator of this successful locking took ~5 minutes to be dequeued. So the real lock time took about ~5 minutes, but it took another ~5 to deliver the message.

I think the most critical thing to debug is why it took ~5 min to deliver the message. From what I see in my telemetry, it appears your queues, at times, get pretty backed up to the point of creating this "5 minute wait" (sometimes it's much longer) that newly inserted messages are paying before getting processed.

This is all I have so far. I have some rough ideas for how to optimize this, but I'll need to think it through. I'll post another update tomorrow with some ideas, as well as instructions on how you can view your queue backlog yourself.

@davidmrdavid FWIW we're experiencing that high latency for around ~5% of requests, below you can see the p95, p50, and average of the processing time.

image

The weird thing is that when it happens the delay is almost always consistent, 5 minutes, I thought that maybe there was some sort of threshold that you guys have internally but may be just a coincidence as sometimes it takes longer.

@davidmrdavid
Copy link
Contributor

@vany0114:

Hey @davidmrdavid that's correct, we're using 3.0.0-rc.1

Hmm, yeah I think this may be part of the problem. Like I mentioned, this is a preview package (as per the suffix) with breaking changes. Please do not downgrade to a 2.x package at this time, that will cause other issues since right now you have messages that are using an encoding that will not work in the 2.x package.

My first recommendation is: can you please upgrade to the -rc.2 package? That will give me some confidence you're working with the latest iteration of DF V3, which should be better.

I'm seeing some errors in the scale controller (the component that scales your app to keep up with traffic) that may be explained by the use of this -rc package as well: since the package is not GA, the scale controller is not fully compatible with it.

I think we can bypass the scale controller by enabling "runtime scale monitoring", which essentially makes the app itself make scaling decisions and send them to the scale controller, instead of having the scale controller do that by itself. You can turn that on as described here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-azure-sql-trigger?tabs=isolated-process%2Cpython-v2%2Cportal&pivots=programming-language-csharp#enable-runtime-driven-scaling .

Using your staging/dev environment, where I assume you also have the 3.x package, could you try this? Then please let me know, I'll monitor that the errors disappear, and from there I'll give you the green light to make the change in prod as well. I think that'll help with throughput (As scaling should improve) though it may not be our final fix.

@vany0114
Copy link
Author

vany0114 commented Jul 9, 2024

@davidmrdavid

My first recommendation is: can you please upgrade to the -rc.2 package? That will give me some confidence you're working with the latest iteration of DF V3, which should be better.

I just did it in our dev environment.

I think we can bypass the scale controller by enabling "runtime scale monitoring", which essentially makes the app itself make scaling decisions and send them to the scale controller, instead of having the scale controller do that by itself. You can turn that on as described here: https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-azure-sql-trigger?tabs=isolated-process%2Cpython-v2%2Cportal&pivots=programming-language-csharp#enable-runtime-driven-scaling .

I'm afraid we cannot do so since the func app is running under a consumption plan, my understanding is that the "runtime scale monitoring" is only available in a premium tier.

@davidmrdavid
Copy link
Contributor

I'm afraid we cannot do so since the func app is running under a consumption plan, my understanding is that the "runtime scale monitoring" is only available in a premium tier.

You may be right, now that I think about it. Let me see what I can do in the backend.

Can you please send me your dev and prod app names again, it's been a bit.

@vany0114
Copy link
Author

I'm afraid we cannot do so since the func app is running under a consumption plan, my understanding is that the "runtime scale monitoring" is only available in a premium tier.

You may be right, now that I think about it. Let me see what I can do in the backend.

Can you please send me your dev and prod app names again, it's been a bit.

@davidmrdavid 👇
func-orders-saga
func-orders-saga-dev

@davidmrdavid
Copy link
Contributor

Hi @vany0114 - sorry to keep you waiting. Ok so the situation in the backend is a bit complicated: the scaling component is in the midst of a large upgrade that we can't delay and so we're not going to have an easy time providing you with a backend-side way to make this preview DF extension work with it.

Ultimately, I think the issues in the scale controller are a big contributor of the delays you're seeing.
As mentioned previously - downgrading is not safe today. Naturally, it's in our plans to make DF extension 2.x "forwards compatible" with 3.x before 3.x becomes GA, and that'd allow you to downgrade to the GA package, but that's not immediately available.

We have 2 options:
(1) If you can wait another week or so - I can provide you with a DF v2.x release that you can downgrade to, which the scale controller should be able to interoperate with.
(2) If you cannot wait - and a "taskhub restart" like the ones we did with Netherite are 'ok', then you can downgrade to the 2.x package locally, change the taskhub locally, and then deploy that payload. Again - the 2.x package works well with the scaler, so this would allow you to reset your app to a compatible task hub.

Please let me know if you prefer (1) or (2) and we'll proceed accordingly. After the fact, I'll review your app's performance again to see what surfaces after that.

@vany0114
Copy link
Author

@davidmrdavid I could reset the app again, this is how the configuration in the host.json looks, do I need to change something there:

"extensions": {
  "durableTask": {
    "hubName": "ordersagav4",
    "useGracefulShutdown": true,
    "storageProvider": {
      "partitionCount": 16,
      "useLegacyPartitionManagement": false,
      "useTablePartitionManagement": true,
      "connectionStringName": "SAGA_DURABLETASK_STORAGE_CONNECTIONSTRING"
    },
    "tracing": {
      "DistributedTracingEnabled": true,
      "Version": "V2"
    }
  }
},

Also, can you explain please what do you mean by: "change the taskhub locally, and then deploy that payload".

Note:
FWIW the reason we used the v3.x package was that back in the day we were using the v2.x version and we did experience these huge delays as well, that's why we opened this issue originally (more context here), so I thought that the latest version would probably behave better.

@vany0114
Copy link
Author

@davidmrdavid can you please help us reviewing what happened yesterday ASAP? yesterday the orchestrations experienced a huge delay between ~5 and ~15 hours 😨...the only thing that has changed recently was that we upgraded from -rc.1 to -rc.2 as you initially suggested.

image

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Jul 22, 2024

@vany0114 - I'm taking a look.

In the meantime - let's get you mitigated by "resetting your app". All you need to do is two things:
(1) Downgrade to the 2.x track of the package
(2) Change your hubName property in the host.json you shared.

Also, can you explain please what do you mean by: "change the taskhub locally, and then deploy that payload".

I meant to change your host.json, sorry for making it confusing.

@davidmrdavid
Copy link
Contributor

@vany0114: did you reset your app?

@vany0114
Copy link
Author

vany0114 commented Jul 23, 2024

@vany0114: did you reset your app?

Not yet in production, just in our dev environment, I'm waiting for our next release which will probably take place tomorrow, I'll keep you posted.

The reason I haven't done so in production yet, is because it seems the func app auto-healed on Sunday after it experienced that huge latency all Saturday long.

Have you found something so far?

@vany0114
Copy link
Author

@davidmrdavid I did reset it.

@davidmrdavid
Copy link
Contributor

Thanks @vany0114.
It was a bit difficult to analyze the app before resetting as the overwhealming signal was the lack of proper scaling, which does mess up the health signals I usually look for. Now that it's been reset, I will both take a look to try to identify what happened last week, while also keeping on eye on the health of the current version of the app.

I've also instructed the team to reach out to you internally to see if we can work more directly to optimize your Azure Storage app. I think we wanted to do this with Netherite, but then that didn't happen once you switched back to Azure Storage. I think it may be worth pursuing a more direct engagement nonetheless, irrespective of backend, to make sure you have the right debugging tools as well as the right set of settings for performance. In any case - I'll investigate on both fronts, I'll send you an analysis tomorrow and we'll be reaching out directly via email as well.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Jul 26, 2024

@vany0114:
I've been taking a look at your app's health as for the last 12 hours. What immediately jumps to me is that the majority of your "slow" requests (requests / queue messages that take a while to start processing) are trying to reach entities. Those messages are up to 5 minutes old when they start processing, whereas all other messages seem to run rather quickly. What that tells me is that your entity processing speed is the main thing we should be focusing on.

What's odd to me is that this "5 minute" ceiling seems very consistent. Just to confirm - are you by any chance sending scheduled entity signals (requests meant to fire at a particular time in an entity)? That could explain some of this consistency in the delay.

In any case - I'm building an internal list of observations and will follow up with some suggestions.

@vany0114
Copy link
Author

Hi @davidmrdavid , thanks for the update

are you by any chance sending scheduled entity signals (requests meant to fire at a particular time in an entity)?

Yes, we're doing so sometimes (under certain conditions) to perform a check in the next 5 minutes.

Regarding the version downgrade, so far it has been working way better, the kinda consistent 5 minutes delay went away.
image

@davidmrdavid
Copy link
Contributor

Hi @vany0114. I'm happy to hear the perf is looking better - this makes sense to me, the scaling logic simply wasn't working before (it is not compatible with the DF v3 package) so things should be better with the GA package :-) .

Regarding the scheduled entity signals: thanks for clarifying that you do use them. That explains the "delayed" messages I was seeing (they're delayed because they're scheduled to fire in ~5 minutes). @vany0114 - you mentioned you upgraded to the DF v3 package because of perf issues. Since you downgraded back to DF v2, have you encountered any of these perf issues again? I could really use a recent reproduction with DF v2.

@vany0114
Copy link
Author

@davidmrdavid So far I haven't seen any perf issues, fortunately, we'll see how long this performance lasts, what I've seen is that as time goes by the response time starts getting degrading.

@davidmrdavid
Copy link
Contributor

@vany0114: and usually the perf issues you see are related to entity locking, right? In any case, as soon as you get another repro, please let me know. I think the perf issues you had with DF v3 are easy attributed to the subpar scaling you were experiencing, as well as downstream effects of that.

In the meantime, I'll look to re-review your report here (#2534 (comment)), assuming that the telemetry is still available to me.

@vany0114
Copy link
Author

@vany0114: and usually the perf issues you see are related to entity locking, right?

@davidmrdavid Most of the time yes, I've also seen that the other scenario is that you kick off an orchestrator and it might take minutes to start.

@davidmrdavid
Copy link
Contributor

Just a heads up - this is still on my radar, I owe you a package with a small bug fix that I think could have added a few seconds on your last "entity lock took too long" report. Just have not had cycles yet to get to it. ETA: mid-next week.

@davidmrdavid
Copy link
Contributor

davidmrdavid commented Aug 13, 2024

Hey @vany0114, I started working on fixing that performance bug you experienced on 6/3 with orchestrator ID {"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}'; and noticed that you were already using Durable Functions extension v3 at that time. Is that right?

As I mentioned during our call, I had just fixed the suspicious exception you encountered in the v3 package before we met; and I was looking to backporting to v2 just now but it appears to me the exception is not possible in the v2 package. From inspecting your logs more closely, I noticed they call out that you were using DF v3 at the time, which explains the bug in the first place.

Please let me know if this doesn't match your records. If this is the case, then there's no hotfix package I can provide (as there's no bug to hotfix for the GA package) and I'll be waiting for another report of slowdowns. Please let me know if this checks out on your end, thanks!

@vany0114
Copy link
Author

vany0114 commented Aug 13, 2024

Hey @vany0114, I started working on fixing that performance bug you experienced on 6/3 with orchestrator ID {"orderId":"27250874282721285","storeNumber":"0292","brandId":"ca4c7477-4000-e811-80dd-0e1f910c8464"}'; and noticed that you were already using Durable Functions extension v3 at that time. Is that right?

As I mentioned during our call, I had just fixed the suspicious exception you encountered in the v3 package before we met; and I was looking to backporting to v2 just now but it appears to me the exception is not possible in the v2 package. From inspecting your logs more closely, I noticed they call out that you were using DF v3 at the time, which explains the bug in the first place.

Please let me know if this doesn't match your records. If this is the case, then there's no hotfix package I can provide (as there's no bug to hotfix for the GA package) and I'll be waiting for another report of slowdowns. Please let me know if this checks out on your end, thanks!

Hi @davidmrdavid! that's correct, the hours delay issue we experienced was in the v3 package. since we downgraded to v2 it has been working well 🤞 but I'm glad you fixed it in the v3.

@davidmrdavid
Copy link
Contributor

@vany0114 - I assume things are still working well? Would it make sense to resolve this thread and open a new one in case a new issue occurs?

@vany0114
Copy link
Author

vany0114 commented Sep 9, 2024

Closing since things have been working fine with the v2 package.

@vany0114 vany0114 closed this as completed Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs: Investigation 🔍 A deeper investigation needs to be done by the project maintainers. Needs: Triage 🔍 P1 Priority 1 Reliability Durable functions get stuck or don’t run as expected.
Projects
None yet
Development

No branches or pull requests

4 participants