Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update function state change traces to improve orchestration monitoring query #2302

Merged
merged 13 commits into from
Oct 10, 2023

Conversation

bachuv
Copy link
Collaborator

@bachuv bachuv commented Nov 1, 2022

This PR adds a new RuntimeStatus field to track function state changes. This helps us improve our monitoring query in the portal (Functions -> OrchestratorFunction -> Monitor tab -> App Insights query).

Issues that are fixed in this PR:

  • The portal shows the incorrect status for Terminated orchestrations
  • We don't support showing the Suspended runtime state

New query to list the status of orchestration instances:

traces
| where customDimensions.Category == "Host.Triggers.DurableTask"
| extend functionName = tostring(customDimensions["prop__functionName"])
| extend instanceId = tostring(customDimensions["prop__instanceId"])
| extend state = tostring(customDimensions["prop__state"])
| extend isReplay = tobool(tolower(customDimensions["prop__isReplay"]))
| extend hubName = tostring(tolower(customDimensions["prop__hubName"]))
| extend runtimeStatus = tostring((customDimensions["prop__runtimeStatus"]))
| where isReplay != true
| where functionName == "<functionName>"
| where state != "Awaited"
| where runtimeStatus != ""
| summarize arg_max(timestamp, *) by instanceId
| order by timestamp asc
| project timestamp, instanceId, runtimeStatus

image

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs

@bachuv bachuv self-assigned this Nov 1, 2022
@bachuv bachuv marked this pull request as ready for review November 16, 2022 21:09
test/Common/TestHelpers.cs Outdated Show resolved Hide resolved
Copy link
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this work! I added a request for a manual test, please see my comment below

@bachuv
Copy link
Collaborator Author

bachuv commented Jun 28, 2023

As an update, I removed the functionState property to make it simpler and updated the query in the description based on that change. I'm now looking into removing the additional "orchestrator state changed" log.

@bachuv
Copy link
Collaborator Author

bachuv commented Aug 7, 2023

I looked into removing the additional state trace log and updating the existing Activity.Tags with the correct runtime status to try and not change the query used in the portal that's used populate the orchestration state. I realized that doesn't work for client APIs like Terminate and Suspend though. For those APIs, I tried updating the Tags in the relevant methods in DurableClient but was seeing different values for Activity.Current which meant the runtime status value wouldn't get updated correctly. Let me know if there's another way of approaching this that I haven't considered.

I'm looking into updating the existing state change trace with a new State field now.

@bachuv bachuv requested a review from davidmrdavid August 7, 2023 16:31
@bachuv
Copy link
Collaborator Author

bachuv commented Aug 7, 2023

I was able to remove the additional state trace log by adding a RuntimeStatus field to the existing state trace log instead. I also updated the runtime status query.

I haven't updated all of the E2E tests since it's a tedious process, but I will update them once I get some reviews on this new approach of adding a runtime status field.

traces
| where customDimensions.Category == "Host.Triggers.DurableTask"
| extend functionName = tostring(customDimensions["prop__functionName"])
| extend instanceId = tostring(customDimensions["prop__instanceId"])
| extend state = tostring(customDimensions["prop__state"])
| extend isReplay = tobool(tolower(customDimensions["prop__isReplay"]))
| extend hubName = tostring(tolower(customDimensions["prop__hubName"]))
| extend runtimeStatus = tostring((customDimensions["prop__runtimeStatus"]))
| where isReplay != true
| where state != "Awaited"
| where runtimeStatus != ""
| summarize arg_max(timestamp, *) by instanceId
| order by timestamp asc
| project timestamp, instanceId, runtimeStatus
image

Copy link
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left 2 small questions

test/Common/TestHelpers.cs Outdated Show resolved Hide resolved
@bachuv bachuv changed the title Adding function state change traces to improve orchestration monitoring query Update function state change traces to improve orchestration monitoring query Aug 8, 2023
@bachuv bachuv requested a review from nytian August 9, 2023 23:00
@davidmrdavid davidmrdavid self-requested a review August 22, 2023 16:20
Copy link
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception of the ResumingOrchestration log, the code changes look good to me! It is certainly confusing that we log both a State and RuntimeStatus field. We'll need to tackle that eventually. Other than that, the overall change strategy makes sense to me. I think you should be good to update the test code

src/WebJobs.Extensions.DurableTask/EndToEndTraceHelper.cs Outdated Show resolved Hide resolved
@bachuv bachuv requested a review from cgillum August 23, 2023 18:38
@bachuv
Copy link
Collaborator Author

bachuv commented Oct 10, 2023

@davidmrdavid, I know the Netherite smoke test is expected to fail right now, but is the Python smoke test expected to fail?

@davidmrdavid
Copy link
Contributor

@bachuv: it isn't, but the error suggests the problem was a failure in downloading the docker image. I'm re-running the test now to see if it's transient.

Copy link
Contributor

@davidmrdavid davidmrdavid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM so long as all smoke tests (except the Netherite one) pass

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@bachuv bachuv merged commit 08d2180 into dev Oct 10, 2023
18 of 19 checks passed
@bachuv bachuv deleted the vabachu/addingfunctionstatetraces branch October 10, 2023 19:17
nytian pushed a commit that referenced this pull request Oct 26, 2023
…ng query (#2302)

This PR adds a new RuntimeStatus field to track function state changes. This helps us improve our monitoring query in the portal (Functions -> OrchestratorFunction -> Monitor tab -> App Insights query).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants