[TIR] Asynchronous stage in software pipeline #12171

masahi · 2022-07-25T07:19:14Z

This PR implements the asynchronous pipeline feature proposed in apache/tvm-rfcs#80 and lowering for CUDA async global to shared memory copy.

The main change is in inject_software_pipeline, where necessary synchronization annotations are inserted according to the user provided list of async stages, software_pipeline_async_stages.

@vinx13 @junrushao1994 @csullivan @JosephTheOctonaut @wrongtest-intellif @kparzysz-quic

vinx13 · 2022-07-25T17:21:32Z

src/tir/transforms/thread_storage_sync.cc

@@ -384,6 +426,9 @@ class ThreadSyncInserter : public StmtExprMutator {

 Stmt ThreadSync(Stmt stmt, std::string storage_scope) {
  StorageScope sync_scope = StorageScope::Create(storage_scope);
+  if (sync_scope.rank == StorageRank::kShared && sync_scope.tag == "") {


do we need to check sync_scope.tag? I assume it also works for dynamic shared memory

This is only for making sure that this code path is hit only once. ThreadSyncAfterWaitQueueInserter just looks for async_wait_queue_scope and inserts syncthreads after it. So assuming that all shared memory, including dynamic ones, are protected by async_wait_queue_scope (which should be the case by InjectSoftwarePipeline), all necessary syncthreads will be inserted.

Since ThreadSync is called twice, for shared and shared.dyn,

tvm/src/driver/driver_api.cc

Lines 530 to 531 in 7ef6811

mixed_pass_list.push_back(tir::transform::ThreadSync("shared"));

mixed_pass_list.push_back(tir::transform::ThreadSync("shared.dyn"));

, we get two syncthreads without this check.

Thinking about it more now, this assumes that async_wait_queue_scope on GPU is always associated with shared memory. This should be fine as long as the only async operation is copying into shared memory. I have to admit this is a bit hacky, but something like this is needed for correctness.

src/tir/transforms/inject_software_pipeline.cc

vinx13 · 2022-07-25T18:34:03Z

src/tir/transforms/inject_software_pipeline.cc

+      new_block = Downcast<Block>(
+          Substitute(new_block, {{pipeline_loop_->loop_var, normalized_access_index}}));
+
+      if (pipeline_info_[block].async) {


can we refactor async pipeline related into some functions to make the original EmitImpl logic more concise?

ok moved the bulk of logic into two functions. Now EmitImpl itself is kept short.

masahi · 2022-07-27T00:45:12Z

src/tir/transforms/inject_software_pipeline.cc

+
+  // Given pipelined blocks and async-related information, generate final loop statements with async
+  // scopes (if any).
+  Array<Stmt> CompletePipelineLoopStatements(


I'm not entirely happy with the choice of this name, a suggestion for better one welcome.

vinx13 · 2022-07-27T18:42:51Z

src/tir/transforms/inject_software_pipeline.cc

+      new_block = Downcast<Block>(
+          Substitute(new_block, {{pipeline_loop_->loop_var, normalized_access_index}}));
+
+      if (pipeline_info_[block].async) {


would be great to also refactor this if statement to some functions

It's possible but since this code block touches a lot of stuff defined in this loop, the extracted function would look rather messy like this:

void UpdateForAsync(Block block, Block new_block, int stage, size_t new_blocks_size, PrimExpr normalized_access_index, PrimExpr inbound, arith::Analyzer* ana_normalized, std::map<int, AsyncStateLocal>* async_states_local, std::unordered_map<const BufferNode*, int>* buffer_to_commit_group) { ...

And a reader would need to go back and forth between this function andEmitImpl anyway to understand the meanings of these variables and how they are used.

So I think making this change would rather hurt the readability.

* [TIR] Support asynchronous stages in software pipeline transform * Support interleaved async producers separated by a consumer * clean up * adding doc * adding doc * simplifying * make wait count computation a two pass process * commit_stage -> commit_queue, wait_stage -> wait_queue * make async_commit_queue special scope stmt * codegen async_commit_queue in cuda * clean up * clean up * Move block predicate outside of commit_queue * updating test * test updated * changed async_wait to an annotation * update doc * update meaning of software_pipeline_async_stages * update test * fixing codegen * more fix * remove one of tests that have async and sync ops in the same stage * format * lint and other fix * Define attr::software_pipeline_async_stages * populate wait count in a separate function * fold variabel consumed into AsyncStateLocal * introduce CompletePipelineLoopStatements function for further refactor

masahi force-pushed the async-sync branch 2 times, most recently from e656cbe to 1baf10d Compare July 25, 2022 08:05

masahi marked this pull request as ready for review July 25, 2022 11:07

vinx13 reviewed Jul 25, 2022

View reviewed changes

masahi changed the title ~~[TIR] Asynchrounos stage in software pipeline~~ [TIR] Asynchronous stage in software pipeline Jul 26, 2022

masahi commented Jul 27, 2022

View reviewed changes

masahi force-pushed the async-sync branch from 13e77d1 to 2f26fb2 Compare July 27, 2022 01:53

masahi added 23 commits July 27, 2022 15:47

[TIR] Support asynchronous stages in software pipeline transform

dc5e2ef

Support interleaved async producers separated by a consumer

1054638

clean up

fcb75a5

adding doc

ab78c35

adding doc

b2ade84

simplifying

769632b

make wait count computation a two pass process

67f81a7

commit_stage -> commit_queue, wait_stage -> wait_queue

8c01129

make async_commit_queue special scope stmt

9d0f7d6

codegen async_commit_queue in cuda

a5a4bfc

clean up

6e0b442

clean up

75f8a38

Move block predicate outside of commit_queue

8f04f70

updating test

bc4f073

test updated

c80bbd9

changed async_wait to an annotation

7e50d2f

update doc

b4289a3

update meaning of software_pipeline_async_stages

be51062

update test

d446581

fixing codegen

8228587

more fix

07dd0b2

remove one of tests that have async and sync ops in the same stage

dca56c6

format

8a0ff51

masahi added 5 commits July 27, 2022 15:47

lint and other fix

468566f

Define attr::software_pipeline_async_stages

bf13acf

populate wait count in a separate function

787f608

fold variabel consumed into AsyncStateLocal

44bbb12

introduce CompletePipelineLoopStatements function for further refactor

d4ae91a

masahi force-pushed the async-sync branch from 2f26fb2 to d4ae91a Compare July 27, 2022 06:47

vinx13 reviewed Jul 27, 2022

View reviewed changes

vinx13 approved these changes Jul 27, 2022

View reviewed changes

vinx13 merged commit 3c737fb into apache:main Jul 28, 2022

AndrewZhaoLuo mentioned this pull request Oct 4, 2022

TVM v0.10.0.rc0 Release Candidate Notes #12979

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TIR] Asynchronous stage in software pipeline #12171

[TIR] Asynchronous stage in software pipeline #12171

masahi commented Jul 25, 2022 •

edited

Loading

vinx13 Jul 25, 2022

masahi Jul 26, 2022

vinx13 Jul 25, 2022

masahi Jul 27, 2022

masahi Jul 27, 2022

vinx13 Jul 27, 2022

masahi Jul 27, 2022 •

edited

Loading

	mixed_pass_list.push_back(tir::transform::ThreadSync("shared"));
	mixed_pass_list.push_back(tir::transform::ThreadSync("shared.dyn"));

[TIR] Asynchronous stage in software pipeline #12171

[TIR] Asynchronous stage in software pipeline #12171

Conversation

masahi commented Jul 25, 2022 • edited Loading

vinx13 Jul 25, 2022

Choose a reason for hiding this comment

masahi Jul 26, 2022

Choose a reason for hiding this comment

vinx13 Jul 25, 2022

Choose a reason for hiding this comment

masahi Jul 27, 2022

Choose a reason for hiding this comment

masahi Jul 27, 2022

Choose a reason for hiding this comment

vinx13 Jul 27, 2022

Choose a reason for hiding this comment

masahi Jul 27, 2022 • edited Loading

Choose a reason for hiding this comment

masahi commented Jul 25, 2022 •

edited

Loading

masahi Jul 27, 2022 •

edited

Loading