Avoid input cache when resized #28

naoyam · 2023-03-17T08:42:39Z

C++ test copied from #27

If the producer of a resized tensor is an input cache and we need to promote it to global memory, just do not cache the input but directly read from the global memory input. It doesn't make any sense to cache a global memory input in global memory.

This fixes the issue of #27, which is due to the grid sync inserted for an input to pad, where the input is a copy of a fusion input.

naoyam · 2023-03-20T21:10:37Z

csrc/scheduler/utils.cpp

-      }
-      producer->cacheAfter();
-      cached.insert(producer);
+    // Resized tensors are those created by operations like pad and


This is an unrelated change. Instead of using cacheAfter, just create a copy of producer and use it as the input to the resize expr.

Example:

tv1 = sum(tv0) tv2 = some_resize_op(tv1); tv3 = some_other_op(tv1);

When tv1 is promoted to Global, we want to avoid reducing to a global memory tensor, so with cacheAfter:

tv1 = sum(tv0); tv4 = tv1 tv4->setMemoryType(Global) tv2 = some_resize_op(tv3) tv3 = some_other_op(tv3);

This way, the reduction is done using Local, but some_other_op doesn't need to use the gmem copy of tv1. This should be just fine:

tv1 = sum(tv0); tv4 = tv1 tv4->setMemoryType(Global) tv2 = some_resize_op(tv4) tv3 = some_other_op(tv1);

Is this just cacheFork?

Don't think so. There's some similarity, though here we don't change fusion outputs.

I think we can extend the interface of cacheFork to do this?

If we have:

tv1 = sum(tv0) tv2 = some_resize_op(tv1); tv3 = some_other_op(tv1);

Can we just do

tv1->setMemoryType(Global); tv4 = tv1->cacheFork(/*keep_global=*/{tv2->definition()})

And we get:

tv0 -> tv4 (local) -> tv1 -> tv2 tv4 (fork) -> tv3

Just some idea, have no strong opinion on having to do like this.

cacheFork is also designed to change Fusion outputs, which shouldn't be done in this case. We could make it optional, of course. This is a simple transformation, so I don't feel it's worth consolidating into an extended cacheFork

test/test_resize.cpp

zasdfgbnm · 2023-03-21T14:33:11Z

csrc/scheduler/utils.cpp

-      cached.insert(producer);
+    // Resized tensors are those created by operations like pad and
+    // slice. If it has no defining expression, it must be a fusion
+    // input, and no need of the memory type promotion


Can a fusion input have resize rfactor? I thought fusion inputs never have rfactor domain.

Yes. When segmented, a resized tensor may be an output from a segment and an input to a next segment.

zasdfgbnm · 2023-03-21T14:41:46Z

csrc/scheduler/utils.cpp

-      }
-      producer->cacheAfter();
-      cached.insert(producer);
+    // Resized tensors are those created by operations like pad and


Is this just cacheFork?

naoyam added 3 commits March 17, 2023 01:41

WIP

aafae47

Same fix for reduction and persistent schedulers

6dc19d6

cleanup

46eb661

naoyam commented Mar 20, 2023

View reviewed changes

naoyam marked this pull request as ready for review March 20, 2023 21:11

naoyam requested a review from zasdfgbnm March 20, 2023 21:13

naoyam added 3 commits March 20, 2023 14:14

fix

9cd114a

Merge branch 'main' into resize_sched_avoid_gmem

6fcf183

fix

03f8690

zasdfgbnm reviewed Mar 21, 2023

View reviewed changes

test/test_resize.cpp Outdated Show resolved Hide resolved

zasdfgbnm reviewed Mar 21, 2023

View reviewed changes

test/test_resize.cpp Outdated Show resolved Hide resolved

zasdfgbnm reviewed Mar 21, 2023

View reviewed changes

naoyam and others added 3 commits March 21, 2023 08:15

Remove debug print

57048aa

Merge branch 'main' into resize_sched_avoid_gmem

0dbe8f4

Merge branch 'main' into resize_sched_avoid_gmem

14854ae

zasdfgbnm approved these changes Mar 21, 2023

View reviewed changes

Merge branch 'main' into resize_sched_avoid_gmem

f5f3462

naoyam merged commit 47b2d08 into main Mar 21, 2023

naoyam deleted the resize_sched_avoid_gmem branch March 21, 2023 17:25

kevinstephano mentioned this pull request Mar 22, 2023

Slice+Softmax fusion is triggering a scheduler error #40

Closed

naoyam mentioned this pull request Mar 22, 2023

Bug fix: Resize memory promotion #46

Merged

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid input cache when resized #28

Avoid input cache when resized #28

naoyam commented Mar 17, 2023 •

edited

Loading

naoyam Mar 20, 2023

zasdfgbnm Mar 21, 2023

naoyam Mar 21, 2023

zasdfgbnm Mar 21, 2023

naoyam Mar 21, 2023 •

edited

Loading

zasdfgbnm Mar 21, 2023

naoyam Mar 21, 2023

zasdfgbnm Mar 21, 2023

Avoid input cache when resized #28

Avoid input cache when resized #28

Conversation

naoyam commented Mar 17, 2023 • edited Loading

naoyam Mar 20, 2023

Choose a reason for hiding this comment

zasdfgbnm Mar 21, 2023

Choose a reason for hiding this comment

naoyam Mar 21, 2023

Choose a reason for hiding this comment

zasdfgbnm Mar 21, 2023

Choose a reason for hiding this comment

naoyam Mar 21, 2023 • edited Loading

Choose a reason for hiding this comment

zasdfgbnm Mar 21, 2023

Choose a reason for hiding this comment

naoyam Mar 21, 2023

Choose a reason for hiding this comment

zasdfgbnm Mar 21, 2023

Choose a reason for hiding this comment

naoyam commented Mar 17, 2023 •

edited

Loading

naoyam Mar 21, 2023 •

edited

Loading