Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create a new scope executor for every scope #7564

Closed
wants to merge 1 commit into from

Conversation

hymm
Copy link
Contributor

@hymm hymm commented Feb 8, 2023

Objective

Solution

  • Create a new scope executor for every usage of scope and remove the reused thread local one.
  • This does seem to fix the above issue. Running cargo run --example load_gltf --features debug_asset_server deadlocks without this pr, but works with it. But I'm unsure of the root cause of the deadlock, so this is not a guarantied fix.

Changelog

  • Change scope back to create a new scope executor for every scope.

@hymm
Copy link
Contributor Author

hymm commented Feb 8, 2023

This test works on main. But if my mental model of what is causing the dealock was right, this should deadlock.

fn can_run_nested_multithreaded_schedules() {
  let mut world = World::default();

  world.init_resource::<MainThreadExecutor>();
  world.init_resource::<SystemOrder>();

  let mut inner_schedule = Schedule::default();
  inner_schedule.set_executor_kind(ExecutorKind::MultiThreaded);
  inner_schedule.add_system(make_function_system(0));

  let mut outer_schedule = Schedule::default();
  outer_schedule.set_executor_kind(ExecutorKind::MultiThreaded);
  outer_schedule.add_system(move |world: &mut World| {
      inner_schedule.run(world);
  });

  outer_schedule.run(&mut world);

  assert_eq!(world.resource::<SystemOrder>().0, vec![0]);
}

@alice-i-cecile alice-i-cecile added C-Bug An unexpected or incorrect behavior A-Tasks Tools for parallel and async work labels Feb 8, 2023
@alice-i-cecile alice-i-cecile added this to the 0.10 milestone Feb 8, 2023
@superdump superdump requested a review from cart February 8, 2023 18:23
@superdump superdump added the P-Regression Functionality that used to work but no longer does. Add a test for this! label Feb 8, 2023
@hymm
Copy link
Contributor Author

hymm commented Feb 9, 2023

There's definitely some timing issues associated with the bug. I eventually added enough dbg!'s and that slowed things down enough that it ran and avoided the deadlock. Not sure if that's what's making my test avoid deadlocking. Running with cargo test --release didn't make it deadlock.

@superdump
Copy link
Contributor

Are there any downsides to this PR? The debug asset server is very important for rendering development. The ability to iterate shaders live saves a ton of time given that bevy render/pbr etc take a long time to compile.

@superdump
Copy link
Contributor

This looks like it removes reuse of a per-thread executor, to instead create a new thread executor every time a scope is used. That sounds like it has the potential for a performance regression due to creating new thread executors for the duration of the scope.

@hymm
Copy link
Contributor Author

hymm commented Feb 22, 2023

I won't have time to look into this issue more for at least a week. I do plan on investigating more when I do have time. But in case no one else figures anything out, I do consider this change to be low risk. The change to reuse a thread local executor instead of creating a new one was made during this release cycle in #7087. So this is basically just reverting that change

@shuoli84
Copy link
Contributor

Following code can reproduce dead lock.

use bevy_app::App;
use bevy_ecs::prelude::*;

fn run_sub_app(mut sub_app: NonSendMut<DebugApp>) {
    sub_app.app.update();
}

struct DebugApp {
    app: App,
}

fn main() {
    let mut app = bevy_app::App::new();

    let sub_app = bevy_app::App::new();
    app.insert_non_send_resource(DebugApp { app: sub_app });
    app.add_system(run_sub_app);

    app.update();
}

@shuoli84
Copy link
Contributor

I think the reason is, run_sub_app is exclusive, subapp's apply_system_buffer also is exclusive. They dead locked.

@shuoli84
Copy link
Contributor

I wondered how LocalExecutor on main thread get ticked, until see the code tick_global_task_pools_on_main_thread, this system is run in CoreSet::Last, if the main is block_on something in CoreSet::Update, and that something spawn a task to main's local executor, which requires main's tick to advance.

@hymm
Copy link
Contributor Author

hymm commented Feb 22, 2023

I think the reason is, run_sub_app is exclusive, subapp's apply_system_buffer also is exclusive. They dead locked.

There are two executors here with one running inside the other one. The exclusivity is per executor, and so running 2 systems that want exclusive access in different executors is allowed.

I wondered how LocalExecutor on main thread get ticked, until see the code tick_global_task_pools_on_main_thread, this system is run in CoreSet::Last, if the main is block_on something in CoreSet::Update, and that something spawn a task to main's local executor, which requires main's tick to advance.

They're also ticked inside the bevy_tasks::scope which the multithreaded executor runs inside. tick_global_task_pools_on_main_thread exists so that the local executors not ticked in the scope are ticked occationally.

@hymm
Copy link
Contributor Author

hymm commented Feb 22, 2023

The test code above is a little out of date. The inner schedule should have some apply_system_buffers in it. My investigations showed that the deadlock was happening during the startup schedule of the debug app. Typically it happened during the 2nd or 3rd apply_system_buffers in that schedule. The schedule doesn't otherwise have any systems in it. But even with trying to add some apply_system_buffers to the test it doesn't deadlock.

@NiklasEi
Copy link
Member

I have run into issues with integration tests of bevy_asset_loader freezing up on latest main. This PR fixes those issues.

Copy link
Member

@james7132 james7132 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this to prior to pipelined rendering's merge, this indeed is just a revert. If this is fixing the usage of the debug asset server, we shouldn't ship 0.10 without this.

@shuoli84
Copy link
Contributor

I think I find something interesting. The problem actually is "the async-executor entered a state that it is not able to be triggered just by spawn."

Normally, when spawn_exclusive_system_task, it triggers/notifies the main threads ThreadExecutor.

spawn thread: ThreadId(4) -> Executor { id: 9, active: 1, global_tasks: 0, local_runners: [], sleepers: 2 } 0 f: bevy_ecs::schedule::executor::multi_threaded::MultiThreadedExecutor::spawn_exclusive_system_task::{{closure}}
executor[9] notify wake 2 Waker { data: 0x600001da4850, vtable: 0x107eb9698 } now: "\"count:2 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dd2fb0, vtable: 0x107eb9698 }>\\n\""

But when the problem happens, it can't trigger the executor. The notify no effort is a log I added in async_executor, it means the executor believes it is already notified, so nothing to do.

spawn thread: ThreadId(4) -> Executor { id: 9, active: 1, global_tasks: 0, local_runners: [], sleepers: 2 } 0 f: bevy_ecs::schedule::executor::multi_threaded::MultiThreadedExecutor::spawn_exclusive_system_task::{{closure}}
executor[9] notify no effort 

Why? That's something I am still figuring out. It appears the executor's state.sleepers messed up. following is it's state. We can see the sleepers' count is 2, but there is only 1 wakers.

  9 ThreadExecutor : active:1 sleepers:"\"count:2 free_ids:[] wakers:1  <wakers: 2 Waker { data: 0x600001da4850, vtable: 0x107eb9698 }>\\n\""

Actually when the ticker/sleeper is just notified and running, this is not a problem. In our case, the main thread is already parked, then this is a problem. The executor believes the ticker is running or already notified, so it won't notify it again. Then the main thread parked forever.

@shuoli84
Copy link
Contributor

This is other executor's states. Only Executor [9]'s count is wakers.len() + 1.

  1 shared executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0040, vtable: 0x107eb9698 }>\\n\""
  2 shared executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0070, vtable: 0x107eb9698 }>\\n\""
  3 shared executor : active:3 sleepers:"\"count:1 free_ids:[2, 3] wakers:1  <wakers: 1 Waker { data: 0x600001dc8070, vtable: 0x107eb9698 }>\\n\""
  5 local executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0070, vtable: 0x107eb9698 }>\\n\""
  4 local executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc0040, vtable: 0x107eb9698 }>\\n\""
  6 local executor : active:0 sleepers:"\"count:1 free_ids:[] wakers:1  <wakers: 1 Waker { data: 0x600001dc8070, vtable: 0x107eb9698 }>\\n\""
  8 ThreadExecutor : active:0 sleepers:"\"count:1 free_ids:[1] wakers:1  <wakers: 2 Waker { data: 0x600001da4f40, vtable: 0x107eb9698 }>\\n\""
  7 ThreadExecutor : active:2 sleepers:"\"count:0 free_ids:[1] wakers:0 \""
  9 ThreadExecutor : active:1 sleepers:"\"count:2 free_ids:[] wakers:1  <wakers: 2 Waker { data: 0x600001da4850, vtable: 0x107eb9698 }>\\n\""
  10 local executor : active:0 sleepers:"\"count:0 free_ids:[] wakers:0 \""
  11 ThreadExecutor : active:0 sleepers:"\"count:0 free_ids:[1, 2] wakers:0 \""

@shuoli84
Copy link
Contributor

I think I found the root cause, there are two ThreadExecutors for the main thread, and when one task capture two tickers, each from each Executor, then it's in the situation I mentioned above.

/// the thread local one
thread_local! {
        static LOCAL_EXECUTOR: async_executor::LocalExecutor<'static> = async_executor::LocalExecutor::new("local executor");
        static THREAD_EXECUTOR: Arc<ThreadExecutor<'static>> = Arc::new(ThreadExecutor::new());
    }

/// and also the MainExecutor resource
#[derive(Resource, Default, Clone)]
pub struct MainThreadExecutor(pub Arc<ThreadExecutor<'static>>);

And the troublesome task is created by code:

external_ticker.tick().or(scope_ticker.tick()).await;

@shuoli84
Copy link
Contributor

Just opened #7825 , check it out?

@james7132
Copy link
Member

Closing in favor of #7825.

@james7132 james7132 closed this Mar 1, 2023
@hymm hymm deleted the fix-nested-schedules branch October 5, 2023 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-Tasks Tools for parallel and async work C-Bug An unexpected or incorrect behavior P-Regression Functionality that used to work but no longer does. Add a test for this!
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature debug_asset_server causes freeze/infinite loop on startup
6 participants