add session end tasks support and use for remote cache writes #16952

tdyas · 2022-09-21T21:37:11Z

Add the concept of "session end tasks" to Pants sessions so that Rust code can schedule async tasks for execution where the tasks should complete before a particular Pants run is considered complete. (It is called "tail tasks" internally.)

First use case: Schedule remote cache writes as session end tasks.

tdyas · 2022-09-21T21:58:13Z

cc @somdoron

src/rust/engine/src/scheduler.rs

Eric-Arellano

Thank you! It's neat how little code this took

Eric-Arellano · 2022-09-22T16:52:42Z

src/rust/engine/src/scheduler.rs

@@ -331,6 +336,23 @@ impl Scheduler {
    })
  }

+  async fn wait_for_tail_tasks(tasks: Vec<BoxFuture<'static, ()>>) {
+    if !tasks.is_empty() {


You could invert this and early return if is_empty(). Will make the function easier to understand w/ less nesting

Eric-Arellano · 2022-09-22T16:54:24Z

src/rust/engine/src/scheduler.rs

+      let timeout_fut = time::timeout(Duration::from_secs(5), joined_tail_tasks_fut);
+      match timeout_fut.await {
+        Ok(_) => {
+          log::trace!("tail tasks completed successfully");


Probably debug?

Eric-Arellano · 2022-09-22T16:54:31Z

src/rust/engine/src/scheduler.rs

@@ -331,6 +336,23 @@ impl Scheduler {
    })
  }

+  async fn wait_for_tail_tasks(tasks: Vec<BoxFuture<'static, ()>>) {
+    if !tasks.is_empty() {
+      log::trace!("waiting for {} tail tasks to complete", tasks.len());


Probably debug?

Yea, definitely.

Maybe even INFO, honestly? Consider only rendering this if there are tasks which haven't already finished, by polling each of them individually to see whether they have already finished: likely using https://docs.rs/tokio/latest/tokio/task/struct.JoinHandle.html#method.is_finished

Would it be useful to print out "names" of the tasks that were delayed? (And then I'd add a newtype to carry the task future and its name.)

Eric-Arellano · 2022-09-22T16:54:38Z

src/rust/engine/src/scheduler.rs

+          log::trace!("tail tasks completed successfully");
+        }
+        Err(_) => {
+          log::trace!("tail tasks failed to complete within timeout");


Maybe warn?

Possibly. I have another comment suggesting adding names for the tasks. A warning here would be better for the user if it identified what Pants was doing that timed out.

src/rust/engine/src/scheduler.rs

stuhood

Thanks a lot!

stuhood · 2022-09-22T17:53:35Z

src/rust/engine/process_execution/src/remote_cache.rs

+      let mut tail_tasks = context.tail_tasks.lock();
+      tail_tasks.push(write_fut.boxed());


A bit shorter, and a bit less error prone (because it reduces the chances of lines inserted at the end of this function accidentally being under the lock):

Suggested change

let mut tail_tasks = context.tail_tasks.lock();

tail_tasks.push(write_fut.boxed());

context.tail_tasks.lock().push(write_fut.boxed());

This is a new Rust improvement? I thought before you had to create a distinct variable for the Mutex Guard

Possibly? I'm not sure if this was a NLL improvement, or whether it already worked.

stuhood · 2022-09-22T17:54:20Z

src/rust/engine/process_execution/src/lib.rs

+    workunit_store: WorkunitStore,
+    build_id: String,
+    run_id: RunId,
+    tail_tasks: Arc<Mutex<Vec<BoxFuture<'static, ()>>>>,


This is probably worth a newtype.

stuhood · 2022-09-22T17:55:14Z

src/rust/engine/src/scheduler.rs

@@ -331,6 +336,23 @@ impl Scheduler {
    })
  }

+  async fn wait_for_tail_tasks(tasks: Vec<BoxFuture<'static, ()>>) {
+    if !tasks.is_empty() {
+      log::trace!("waiting for {} tail tasks to complete", tasks.len());


Yea, definitely.

Maybe even INFO, honestly? Consider only rendering this if there are tasks which haven't already finished, by polling each of them individually to see whether they have already finished: likely using https://docs.rs/tokio/latest/tokio/task/struct.JoinHandle.html#method.is_finished

stuhood · 2022-09-22T17:57:19Z

src/rust/engine/src/scheduler.rs

+            // Wait for tail tasks to complete.
+            let tail_tasks = session.tail_tasks().lock().drain(..).collect();
+            Self::wait_for_tail_tasks(tail_tasks).await;


Hm! So, this scoping works, and is probably the simplest possible implementation.

It might be slightly more efficient to make it an explicit method called in a superset of the locations where Scheduler.shutdown is called... something like Session.await_tail_tasks. It would be called at the end of a run, even if the whole Scheduler was not being torn down.

But given how simple this is, I'm fine either way.

stuhood · 2022-09-22T18:03:39Z

src/rust/engine/process_execution/src/remote_cache.rs

@@ -471,7 +471,7 @@ impl crate::CommandRunner for CommandRunner {
    {
      let command_runner = self.clone();
      let result = result.clone();
-      let _write_join = self.executor.spawn(in_workunit!(
+      let write_fut = self.executor.spawn(in_workunit!(


I think that a tail task should likely be added by an explicit method of Executor (e.g. executor.spawn_tail(&mut tail_tasks, fut)), for two reasons:

If the task is not spawnd, it would likely be an error, because the work would not even start until shutdown. That would potentially be a neat edgecase to hack, because you could queue up lazy work until shutdown, but IMO you would want it to be explicit instead.

Being able to poll is_finished on a JoinHandle returned directly by the Executor, or a wrapper type if need be would be very helpful in terms of the debug output at the end of the run (see other comment).

Is the executor associated 1:1 with a session?

No, it isn't: an executor may be used across multiple sessions. Hence passing the tail_tasks as an argument in the snippet: executor.spawn_tail(&mut tail_tasks, fut).

But if the return type was specific, then you could do it as:

let tail_task = executor.spawn_tail(fut); // Only compiles because there is a wrapper type which matches. session.tail_tasks.add(tail_task);

src/rust/engine/src/scheduler.rs

tdyas · 2022-09-23T19:59:07Z

Just discovered tokio::task::JoinSet which seems relevant.

tdyas · 2022-09-26T16:45:34Z

There has been some significant refactoring of this PR. It is worth a re-review.

tdyas · 2022-09-26T16:47:15Z

The latest commits introduce a TailTasks time which encapsulate both spawning and waiting for tail tasks.

tdyas · 2022-09-26T16:48:25Z

src/rust/engine/task_executor/src/lib.rs

+    let mut inner = match self.inner.lock().take() {
+      Some(inner) => inner,
+      None => {
+        log::debug!("Session end tasks awaited multiple times!");


Apparently this line is triggered even when running ./pants --version. Maybe there is a bug in the session shutdown code?

cc @stuhood

wait is being called inside of Scheduler.execute, which is used multiple times within a single Session. For example: if you run multiple goals sequentially, it is executed as a loop over calls to Scheduler.execute.

If you wanted to only wait at the "end" of a Session (rather than once per execute call), you'd need to do something more like #16952 (comment)

Eric-Arellano

Lgtm, but probably worth Stu's approval

stuhood

Thanks!

src/rust/engine/.cargo/config

src/rust/engine/Cargo.toml

src/rust/engine/process_execution/src/remote_cache.rs

src/rust/engine/task_executor/src/lib.rs

stuhood · 2022-09-26T20:37:50Z

src/rust/engine/task_executor/src/lib.rs

+    let mut inner = match self.inner.lock().take() {
+      Some(inner) => inner,
+      None => {
+        log::debug!("Session end tasks awaited multiple times!");


wait is being called inside of Scheduler.execute, which is used multiple times within a single Session. For example: if you run multiple goals sequentially, it is executed as a loop over calls to Scheduler.execute.

If you wanted to only wait at the "end" of a Session (rather than once per execute call), you'd need to do something more like #16952 (comment)

src/rust/engine/stdio/src/lib.rs

src/rust/engine/.cargo/config

src/rust/engine/Cargo.toml

src/rust/engine/task_executor/src/lib.rs

src/python/pants/bin/local_pants_runner.py

Upgrade Tokio to v1.21.1 plus upgrade related Tokio ecosystem crates. Besides bug fixes, the main motivation to upgrade is to gain access to `tokio::task::JoinSet` for use in #16952.

[ci skip-build-wheels]

stuhood

Thanks!

stuhood · 2022-09-28T16:19:56Z

src/rust/engine/stdio/src/lib.rs

@@ -417,6 +417,9 @@ thread_local! {
  static THREAD_DESTINATION: RefCell<Arc<Destination>> = RefCell::new(Arc::new(Destination(Mutex::new(InnerDestination::Logging))))
 }

+// Note: The behavior of this task_local! invocation is affected by the `tokio_no_const_thread_local`


tdyas · 2022-09-28T20:06:32Z

It appears that log_cache_error is not working with tail tasks. https://github.com/pantsbuild/pants/actions/runs/3144144927/jobs/5110386492#step:13:598

I can add regular logging to show that the tail tasks are completing, but errors from log_cache_error are not being displayed.

stuhood · 2022-09-28T20:33:21Z

It appears that log_cache_error is not working with tail tasks. https://github.com/pantsbuild/pants/actions/runs/3144144927/jobs/5110386492#step:13:598

I can add regular logging to show that the tail tasks are completing, but errors from log_cache_error are not being displayed.

Mmm... that would be because the task local / thread local information is not being propagated to the new task. See the use of Self::future_with_correct_context here:

pants/src/rust/engine/task_executor/src/lib.rs

Lines 121 to 137 in cdf21fc

    
           /// 
        
           /// Run a Future on a tokio Runtime as a new Task, and return a Future handle to it. 
        
           /// 
        
           /// Unlike tokio::spawn, if the background Task panics, the returned Future will too. 
        
           /// 
        
           /// If the returned Future is dropped, the computation will still continue to completion: see 
        
           /// https://docs.rs/tokio/0.2.20/tokio/task/struct.JoinHandle.html 
        
           /// 
        
           pub fn spawn<O: Send + 'static, F: Future<Output = O> + Send + 'static>( 
        
             &self, 
        
             future: F, 
        
           ) -> impl Future<Output = O> { 
        
             self 
        
               .handle 
        
               .spawn(Self::future_with_correct_context(future)) 
        
               .map(|r| r.expect("Background task exited unsafely.")) 
        
           }

tdyas · 2022-09-28T20:56:48Z

Mmm... that would be because the task local / thread local information is not being propagated to the new task. See the use of Self::future_with_correct_context here:

pants/src/rust/engine/task_executor/src/lib.rs

Lines 121 to 137 in cdf21fc

///

/// Run a Future on a tokio Runtime as a new Task, and return a Future handle to it.

///

/// Unlike tokio::spawn, if the background Task panics, the returned Future will too.

///

/// If the returned Future is dropped, the computation will still continue to completion: see

/// https://docs.rs/tokio/0.2.20/tokio/task/struct.JoinHandle.html

///

pub fn spawn<O: Send + 'static, F: Future<Output = O> + Send + 'static>(

&self,

future: F,

) -> impl Future<Output = O> {

self

.handle

.spawn(Self::future_with_correct_context(future))

.map(|r| r.expect("Background task exited unsafely."))

}

That was it. Thanks!

It apparently worked in an earlier iteration of the PR only due to the double-spawn, one of which was via Executor which did wrap the task with future_with_correct_context.

[ci skip-build-wheels]

tdyas added the category:internal CI, fixes for not-yet-released features, etc. label Sep 21, 2022

tdyas requested review from stuhood and Eric-Arellano September 21, 2022 21:37

tdyas mentioned this pull request Sep 21, 2022

pants usually miss the last remote cache write #16947

Closed

tdyas commented Sep 21, 2022

View reviewed changes

src/rust/engine/src/scheduler.rs Outdated Show resolved Hide resolved

Eric-Arellano approved these changes Sep 22, 2022

View reviewed changes

stuhood reviewed Sep 22, 2022

View reviewed changes

tdyas force-pushed the tail_tasks_api branch from 18d490b to a2ea913 Compare September 26, 2022 15:58

tdyas requested review from Eric-Arellano, stuhood and chrisjrn September 26, 2022 16:45

tdyas commented Sep 26, 2022

View reviewed changes

Eric-Arellano reviewed Sep 26, 2022

View reviewed changes

stuhood approved these changes Sep 26, 2022

View reviewed changes

tdyas force-pushed the tail_tasks_api branch from 1f596ce to 605c3b2 Compare September 27, 2022 01:05

tdyas commented Sep 27, 2022

View reviewed changes

src/rust/engine/stdio/src/lib.rs Show resolved Hide resolved

stuhood reviewed Sep 27, 2022

View reviewed changes

src/rust/engine/.cargo/config Outdated Show resolved Hide resolved

src/rust/engine/Cargo.toml Show resolved Hide resolved

src/rust/engine/task_executor/src/lib.rs Outdated Show resolved Hide resolved

src/python/pants/bin/local_pants_runner.py Outdated Show resolved Hide resolved

tdyas mentioned this pull request Sep 28, 2022

Upgrade Tokio to v1.21.1 #17034

Merged

tdyas pushed a commit that referenced this pull request Sep 28, 2022

Upgrade Tokio to v1.21.1 (#17034)

9550caf

Upgrade Tokio to v1.21.1 plus upgrade related Tokio ecosystem crates. Besides bug fixes, the main motivation to upgrade is to gain access to `tokio::task::JoinSet` for use in #16952.

tdyas force-pushed the tail_tasks_api branch from 605c3b2 to cc58f7c Compare September 28, 2022 03:23

Tom Dyas added 6 commits September 27, 2022 23:24

introduce "tail tasks" concept

e7950b3

[ci skip-build-wheels]

add remote cache writes to tail tasks

9a0adee

[ci skip-build-wheels]

make timeout configurable

205025b

[ci skip-build-wheels]

upgrade tokio to 1.21+ for JoinSet

e217851

[ci skip-build-wheels]

checkpoint

afb550b

[ci skip-build-wheels]

move TailTasks into task_executor to avoid crate cycle

5105a2d

[ci skip-build-wheels]

Tom Dyas added 9 commits September 27, 2022 23:24

prevent access to inner once .wait is called

e33067b

[ci skip-build-wheels]

display long-running tail task names

6497230

[ci skip-build-wheels]

avoid double spawn of cache write

962f6e7

[ci skip-build-wheels]

better document need for tokio_unstable

8ebf251

[ci skip-build-wheels]

fix typo

0fac48e

[ci skip-build-wheels]

add wait_for_tail_tasks helper and invoke from LocalPantsRunner

adbdbad

[ci skip-build-wheels]

update comment

f9a1150

[ci skip-build-wheels]

use non-const task_local

1cb0bb5

[ci skip-build-wheels]

code review comments

3208118

[ci skip-build-wheels]

tdyas force-pushed the tail_tasks_api branch from cc58f7c to 3208118 Compare September 28, 2022 03:25

Tom Dyas added 2 commits September 28, 2022 01:35

wait for tail tasks in relevant tests

ddb376b

[ci skip-build-wheels]

satisfy mypy

ea3c3e2

[ci skip-build-wheels]

stuhood approved these changes Sep 28, 2022

View reviewed changes

wrap session end tasks in correct future context

2a29794

[ci skip-build-wheels]

tdyas changed the title ~~add tail tasks API and use for remote cache writes~~ add session end tasks support and use for remote cache writes Sep 29, 2022

tdyas merged commit b837d95 into pantsbuild:main Sep 29, 2022

tdyas deleted the tail_tasks_api branch September 29, 2022 02:00

wisechengyi mentioned this pull request Oct 2, 2022

Prep release 2.15.0.dev3 #17083

Merged

		let mut tail_tasks = context.tail_tasks.lock();
		tail_tasks.push(write_fut.boxed());

	let mut tail_tasks = context.tail_tasks.lock();
	tail_tasks.push(write_fut.boxed());
	context.tail_tasks.lock().push(write_fut.boxed());

add session end tasks support and use for remote cache writes #16952

add session end tasks support and use for remote cache writes #16952

Conversation

tdyas commented Sep 21, 2022 • edited Loading

tdyas commented Sep 21, 2022

Eric-Arellano left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

stuhood Sep 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Sep 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood Sep 22, 2022 • edited Loading

Choose a reason for hiding this comment

tdyas commented Sep 23, 2022

tdyas commented Sep 26, 2022

tdyas commented Sep 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Eric-Arellano left a comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

stuhood left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tdyas commented Sep 28, 2022

stuhood commented Sep 28, 2022

tdyas commented Sep 28, 2022 • edited Loading

tdyas commented Sep 21, 2022 •

edited

Loading

stuhood Sep 22, 2022 •

edited

Loading

stuhood Sep 22, 2022 •

edited

Loading

stuhood Sep 22, 2022 •

edited

Loading

tdyas commented Sep 28, 2022 •

edited

Loading