Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

node: fix task caller lifecycle in Triggers schedule #2605

Merged
merged 3 commits into from
Dec 21, 2024

Conversation

canepat
Copy link
Member

@canepat canepat commented Dec 16, 2024

Fixes #2604

The problem was hidden behind the lifecycle of the trigger task caller: its coroutine frame could be destroyed before being spawned.

@canepat canepat added the maintenance Some maintenance work (fix, refactor, rename, test...) label Dec 16, 2024
@canepat canepat requested a review from battlmonstr December 16, 2024 00:43
@canepat canepat marked this pull request as ready for review December 16, 2024 00:43
return concurrency::spawn_task(ioc_, task_caller());
co_return co_await concurrency::spawn_task(ioc_, task_caller());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an interesting side effect. I wonder why this change makes any difference?

The call to schedule is already awaited here on the call site

An additional await here delays the destruction of some locals, right? Which ones? task_caller is not relevant after calling it. trigger is moved into the task, so not destroyed unless the spawned task is destroyed. The spawned task is moved inside co_spawn. I hope that co_spawn is not destroying the passed task prematurely, but maybe it does? Is it possible to verify with a unit test?

If it really does, maybe we modify spawn_task to await for co_spawn inside spawn_task so that this mistake is not possible in other places (we use spawn_task without awaiting in a bunch of places) and in the future code?

I'd like to fill the gap in my understanding here. I know it is not the high priority, so let's maybe create an issue about researching and generalizing this and make it an exercise for the future reader?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could potentially be a reason of some sentry discovery crashes, because SerialNodeDb follows this pattern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's try to move the callback from the capture list to the lambda parameter list and see if it works without co_await (see links below)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, we must move both this and the callback because all the capture list is affected (indeed, the issue was happening on this->current_tx_).
Moreover, it seems to be reproducible systematically in unit tests, so I'm going to add one test and fix by moving from capture list to the lambda parameters.

@canepat canepat merged commit 2d84274 into master Dec 21, 2024
5 checks passed
@canepat canepat deleted the fix/triggers_stage_invalid_task_lifecycle branch December 21, 2024 10:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Some maintenance work (fix, refactor, rename, test...)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

silkworm: assertion failed in Triggers stage
2 participants