IterDomain resize for pad, cat, slice #3

jjsjann123 · 2023-03-14T09:21:25Z

Cherry-picking from: csarofeen/pytorch#2480

Author: Naoya Maruyama naoyam@users.noreply.github.com
Date: Mon Mar 13 17:12:01 2023 -0700

IterDomain resize for pad, cat, slice (#2480)

* fix tests for multicluster fusion

jjsjann123 · 2023-03-14T09:43:52Z

I'm getting segfault from this commit. cpp test NVFuserTest.FusionReshapeReductionShmoo_CUDA

It even repros on csarofeen/devel. So looks like the original PR is busted as well.

jjsjann123 · 2023-03-14T09:55:34Z

I'm marking this one as draft to block merging. since there's a failing test.

naoyam · 2023-03-14T16:22:32Z

I can't repro any failure with the test. I'm using the PyT A100 node. What does the crash look like?

naoyam

There are C++ test failures due to string mismatch of generated code. Strangely, when I run all the C++ tests together, I don't see any error on my container, but I see errors on the nightly container, and those failing tests do fail on my container when run individually. The diff seems to happen only with bool predicate scalar variables. It seems there's some non-determinism in expr simplification. I'll work with Xiang when he's back.

Fixes #8 Based on #3 --------- Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com> Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

``` Traceback (most recent call last): File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute result = self._execute( RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate. Something unexpected changed between the compilation and execution. nan != nan Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) <omitting python frames> frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) ```

Merge upstream 191223

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

Type error was detected with #3263 while I was testing it with a Debug build. ``` pytest -v tests/python/test_python_frontend.py -k test_pad_dynamic ``` It has a fusion of: ``` Inputs: T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ] Outputs: T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] %kernel_math { f7 = (float)(7); f9 = float(2.5) * f7; i11 = (int64_t)(f9); i14 = (nvfuser_index_t)(i11); i16 = (nvfuser_index_t)(i11); i18 = (nvfuser_index_t)(i11); i20 = (nvfuser_index_t)(i11); T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ] = pad( T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ], {0, 0, i14, i16, i18, i20} ) T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] = Set( T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ], cache_op=Streaming ) } // %kernel_math ``` Stack trace: ``` #0 __cxxabiv1::__cxa_throw (obj=0xabc6ea0, tinfo=0x7ffeba81c248 <typeinfo for nvfuser::nvfError>, dest=0x7ffeba059370 <nvfuser::nvfError::~nvfError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80 #1 0x00007ffeba058665 in nvfuser::nvfCheckFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, msg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.cpp:283 #2 0x00007ffeb9c1be4b in nvfuser::nvfErrorFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, condMsg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.h:229 #3 0x00007ffeb9c1bbe4 in nvfuser::PolymorphicBase::as<nvfuser::TensorView> (this=0xac07490) at /raid/nmaruyama/debug1/csrc/utils.h:119 #4 0x00007ffeba54c67f in nvfuser::(anonymous namespace)::isLoadGlobalToLocal (expr=0xabd7c50) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:61 #5 0x00007ffeba54c599 in nvfuser::refineCachePolicy (fusion=0xabef940) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:153 ```

samnordmann and others added 2 commits March 14, 2023 02:07

Fix multidevice tests (#2574)

e050187

* fix tests for multicluster fusion

IterDomain resize for pad, cat, slice (#2480)

839a299

jjsjann123 changed the title ~~IterDomain resize for pad, cat, slice (#2480)~~ IterDomain resize for pad, cat, slice Mar 14, 2023

jjsjann123 requested a review from naoyam March 14, 2023 09:24

jjsjann123 mentioned this pull request Mar 14, 2023

IterDomain resize for pad, cat, slice csarofeen/pytorch#2480

Merged

3 tasks

jjsjann123 marked this pull request as draft March 14, 2023 09:55

jjsjann123 mentioned this pull request Mar 14, 2023

persistent_use_of_buffer is accumulated over all the resolution points. #4

Merged

Base automatically changed from cherry_pick_2574 to main March 14, 2023 10:22

bug fix

7ee79cf

This was referenced Mar 15, 2023

IterDomain resize for pad, cat, slice #9

Closed

Add pad to Python frontend #10

Merged

naoyam marked this pull request as ready for review March 15, 2023 15:30

Merge branch 'main' into cherry_pick_2480

667bb3b

naoyam approved these changes Mar 15, 2023

View reviewed changes

naoyam merged commit 979517a into main Mar 15, 2023

naoyam deleted the cherry_pick_2480 branch March 15, 2023 15:58

jacobhinkle added a commit that referenced this pull request Mar 24, 2023

Add pad to Python frontend (#10)

4c18efc

Fixes #8 Based on #3 --------- Co-authored-by: Naoya Maruyama <nmaruyama@nvidia.com> Co-authored-by: Jacob Hinkle <jhinkle@nvidia.com>

liqiangxl mentioned this pull request May 8, 2023

MatmulSASSTest on H100 #306

Closed

jjsjann123 mentioned this pull request Nov 9, 2023

Codegen error: Exception raised from inferShape at /opt/pytorch/nvfuser/csrc/executor.cpp:532 #1277

Closed

cowanmeg pushed a commit to cowanmeg/Fuser that referenced this pull request Dec 20, 2023

Merge pull request NVIDIA#3 from samnordmann/merge_upstream_191223

c7bb6d4

Merge upstream 191223

DariusMaham mentioned this pull request Apr 18, 2024

Internal assert failed #2107

Closed

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

naoyam mentioned this pull request Sep 5, 2024

Add slice tests to demonstrate manual scheduling #2898

Merged

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IterDomain resize for pad, cat, slice #3

IterDomain resize for pad, cat, slice #3

jjsjann123 commented Mar 14, 2023

jjsjann123 commented Mar 14, 2023

jjsjann123 commented Mar 14, 2023

naoyam commented Mar 14, 2023

naoyam left a comment

IterDomain resize for pad, cat, slice #3

IterDomain resize for pad, cat, slice #3

Conversation

jjsjann123 commented Mar 14, 2023

jjsjann123 commented Mar 14, 2023

jjsjann123 commented Mar 14, 2023

naoyam commented Mar 14, 2023

naoyam left a comment

Choose a reason for hiding this comment