Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tokio task leak #151

Merged
merged 7 commits into from
Mar 14, 2025
Merged

Conversation

rodoufu
Copy link
Contributor

@rodoufu rodoufu commented Feb 27, 2025

While running the agent version v.2.12.1, it crashed after some time with an out of memory issue, having to be killed by the OS.
The machine I was using has 48GB of RAM

It was happening for both mainet and testnet.

I did not see anything particularly special about the resources it was using:
tokio_console_resources-pr

But the number of tasks seems large.
tokio_console_threads-pr

One can see 7721 tasks while a lot of them are idle for some time.
After observing it, the number of tasks would increase with time until the point it would, the binary would be killed.
The idle tasks are created here https://github.com/pyth-network/pyth-agent/blob/main/src/agent/services/oracle.rs#L132 as can be seen in the image, where the subscriber is handling the handle_price_account_update.
That line is crating tokio tasks without keeping track of the task handle JoinHandle, in cases where those are created faster than they are .awaited this can lead to a leak.

After the proposed change I can see a much more comfortable number of tasks:
tokio-console_threads_after-pr
The number of tasks now is stable around 100, being 114 in the attached image.
It is worthy mentioning that I used 100 worker tasks to wait for the previously leaked ones to finish, so this number can be way smaller with another configuration.

In order to reproduce the tokio console one can follow the instructions in https://github.com/tokio-rs/console which are basic:

  • Install tokio-console with cargo install --locked tokio-console
  • Add console-subscriber = "0.3.0" as a dependency
  • Add console_subscriber::init(); as the first line in the main funciton
  • Run the binary with RUSTFLAGS="--cfg tokio_unstable" cargo run --bin agent -- --config <config file path>
  • Run tokio-console to watch its data

@rodoufu
Copy link
Contributor Author

rodoufu commented Mar 5, 2025

Everything seems to be green this time.

$ pre-commit --version && pre-commit run --all-files && echo $?
pre-commit 4.1.0

Trim Trailing Whitespace.................................................Passed
Fix End of Files.........................................................Passed
Check for added large files..............................................Passed
rustfmt..................................................................Passed
Integration Test Artifact Checksums......................................Passed

0

@rodoufu
Copy link
Contributor Author

rodoufu commented Mar 5, 2025

I have also this other MR for the project, for which I would love to hear your thoughts.

cc @aditya520 @ali-bahjati

@rodoufu rodoufu force-pushed the fixTokioTaskLeak branch from 208d34f to ff9eeba Compare March 5, 2025 14:10
@rodoufu
Copy link
Contributor Author

rodoufu commented Mar 12, 2025

@Riateche I've removed the synchronization and kept only the change to drop the locks before the .await that made a great difference here for me.
Could you please take another look?

@ali-bahjati ali-bahjati merged commit 05ad3b4 into pyth-network:main Mar 14, 2025
2 checks passed
@rodoufu rodoufu deleted the fixTokioTaskLeak branch March 14, 2025 13:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants