Skip to content

Don't panic on run_forever exceptions #562

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

dulinriley
Copy link
Contributor

@dulinriley dulinriley commented Jul 16, 2025

If the event loop running in run_forever has an exception, don't panic the thread with "unwrap". Just inspect and log
the error, then exit the thread (after stopping and closing the loop). Nothing will inspect those errors anyway.

Also, sys.exit in Python raises a special SystemExit exception which inherits from BaseException but not Exception.
In those cases we don't want to set the rust panic flag, but we can just treat it as a normal python exception.

This was discovered while working on
python/tests/test_actor_error.py::test_actor_mesh_supervision_handling,
but it doesn't fully fix the problems there. Currently the run_forever thread is panic'ing when destructing thread
local storage, and I'm not sure where that is coming from.

Full exception message:

thread 'asyncio-event-loop' panicked at /rustc/50aa04180709189a03dde5fd1c05751b2625ed37/library/std/src/thread/local.rs:281:25:
cannot access a Thread Local Storage value during or after destruction: AccessError
stack backtrace:
   0:        0x12cbd2a68 - <std::sys::backtrace::BacktraceLock::print::DisplayBacktrace as core::fmt::Display>::fmt::h374015554d140b31
   1:        0x12cbf0e34 - core::fmt::write::h5f6f0f1b7b128cd8
   2:        0x12cbcf3b8 - std::io::Write::write_fmt::h242299ee639fd4b9
   3:        0x12cbd291c - std::sys::backtrace::BacktraceLock::print::ha6db2935f346d396
   4:        0x12cbd3fbc - std::panicking::default_hook::{{closure}}::hc9150bcd0bf44b0f
   5:        0x12cbd3e0c - std::panicking::default_hook::ha1fa36f4dd660c20
   6:        0x12cbd4a9c - std::panicking::rust_panic_with_hook::hf730ed456c6c4163
   7:        0x12cbd46c8 - std::panicking::begin_panic_handler::{{closure}}::h4a0c8619b621322e
   8:        0x12cbd2f20 - std::sys::backtrace::__rust_end_short_backtrace::haef08a1a5e1b8627
   9:        0x12cbd4370 - __rustc[f67a3b4e60d8f4c4]::rust_begin_unwind
  10:        0x12cc57df8 - core::panicking::panic_fmt::hafe8c07966b2184a
  11:        0x12cc56798 - std::thread::local::panic_access_error::h6ae8c05a78fefcb7
  12:        0x12cc48f30 - thread_local::thread_id::get_slow::h47ee7ca1576a16d9
  13:        0x12ca098a8 - <tracing_subscriber::registry::sharded::Registry as tracing_core::subscriber::Subscriber>::exit::hea47d35f3aa4c807
  14:        0x12c990cf4 - <tracing_subscriber::layer::layered::Layered<L,S> as tracing_core::subscriber::Subscriber>::exit::h822bc566bdf64486
  15:        0x12c24a078 - std::sys::thread_local::native::eager::destroy::h1ab06624e43f4db7
  16:        0x12cbdaaf8 - std::sys::thread_local::guard::apple::enable::run_dtors::h091d7e1ba5b25935

E0716 14:53:26.191446 22372 hyperactor/src/proc.rs:991] _17nXz8doP7Pz[0].error[5]: actor failure: serving _17nXz8doP7Pz[0].error[5]: processing error: asyncio.exceptions.CancelledError

fatal runtime error: thread local panicked on drop, aborting

The stack makes it seem like somehow "tracing_subscriber" had some registry on this run_forever thread,
and it was failing to run its Drop

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant