Fix multidevice tests #2

jjsjann123 · 2023-03-14T09:17:53Z

Cherry-picking from: csarofeen/pytorch#2574

Author: samnordmann snordmann@nvidia.com
Date: Mon Mar 13 17:44:39 2023 +0100

Fix multidevice tests (#2574)

* fix tests for multicluster fusion

* fix tests for multicluster fusion

jjsjann123 · 2023-03-14T10:20:20Z

Is NVFuserTest.MultiClusterFusion_CUDA flaky and non deterministic?

[  SKIPPED ] NVFuserTest.FusionMultiGPU_Reduce_CUDA (0 ms)
[ RUN      ] NVFuserTest.MultiClusterFusion_CUDA
unknown file: Failure
C++ exception with description "obtained_string_aDag == ref_string_aDag INTERNAL ASSERT FAILED at "/jiej/playground/gitlab/github_nvfuser/test/test_multicluster_fusion.cpp":138, please report a bug to PyTorch. the obtained AggregateDag is not the one expected
Exception raised from TestBody at /jiej/playground/gitlab/github_nvfuser/test/test_multicluster_fusion.cpp:138 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fe4f2dadafc in /opt/pytorch/pytorch/build/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fe4f2d706c8 in /opt/pytorch/pytorch/build/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x57 (0x7fe4f2dab817 in /opt/pytorch/pytorch/build/lib/libc10.so)
frame #3: <unknown function> + 0x50a9ed (0x5599b483a9ed in ./bin/nvfuser_tests)
frame #4: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x51 (0x5599b4880441 in ./bin/nvfuser_tests)
frame #5: <unknown function> + 0x542590 (0x5599b4872590 in ./bin/nvfuser_tests)
frame #6: <unknown function> + 0x542a25 (0x5599b4872a25 in ./bin/nvfuser_tests)
frame #7: <unknown function> + 0x543181 (0x5599b4873181 in ./bin/nvfuser_tests)
frame #8: testing::internal::UnitTestImpl::RunAllTests() + 0x10e9 (0x5599b4874909 in ./bin/nvfuser_tests)
frame #9: testing::UnitTest::Run() + 0x98 (0x5599b4874e28 in ./bin/nvfuser_tests)
frame #10: <unknown function> + 0x14c80c (0x5599b447c80c in ./bin/nvfuser_tests)
frame #11: __libc_start_main + 0xf3 (0x7fe4f2839083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: _start + 0x2e (0x5599b44b4d1e in ./bin/nvfuser_tests)
" thrown in the test body.
[  FAILED  ] NVFuserTest.MultiClusterFusion_CUDA (0 ms)

I saw this in one of my full test run, but can't repro it afterwards... I'm gonna merge it as-is. If the failure shows up again we can revisit this.

- post-review changes, part #2

``` Traceback (most recent call last): File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute result = self._execute( RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate. Something unexpected changed between the compilation and execution. nan != nan Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first): frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so) frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so) <omitting python frames> frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) ```

Automatic pipeline creation

This introduces a thread-local global memory allocator for each device and uses it whenever there is an intermediate tensor needed which requires zero-initialization. To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the allocator using `NVFUSER_DUMP=global_zeroed_memory`. Before we enable this feature by default, we need to have high confidence that every kernel using zero-initialized memory will always clean up their semaphores. This is currently only the case for serial grid reductions, as far as I know. This enables the basic functionality of #1829. However, it does not modify existing algorithms to clean up their memory. See `NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`, which succeeds when using serial grid reduction, but fails (in debug mode) when using `gridReduce` (note that this test is updated to behave differently in this PR): ``` # NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = SerialGridReductionTest.Scheduling [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from SerialGridReductionTest [ RUN ] SerialGridReductionTest.Scheduling [global zeroed memory] Resizing arena to 512 bytes [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 512 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Resizing arena to 16384 bytes [global zeroed memory] Allocating byte range: 0 to 16384 bytes [global zeroed memory] Resetting allocated bytes to 0 [global zeroed memory] Allocating byte range: 0 to 16384 bytes unknown file: Failure C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first): frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests) frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests) frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests) frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests) frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests) frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests) frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests) frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests) frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests) frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests) frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests) frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests) frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests) frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests) frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests) frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests) frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests) frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests) frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests) frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests) frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6) frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling' [ FAILED ] SerialGridReductionTest.Scheduling (5669 ms) [----------] 1 test from SerialGridReductionTest (5669 ms total) ``` This test runs with serial grid reduction, then with `gridReduce`. Each time it runs two grid reductions. Both serial grid reductions succeed because the semaphore buffer is properly zeroed. The `gridReduce` succeeds the first time since the memory pool calls `at::zeros` again to request a larger buffer size (`gridReduce` requires more semaphores since there is one per thread segment vs one for each each block segment). However, the second call to `gridReduce` fails because it has not cleaned up its semaphores. Hacking that function to force `PERSISTENT=1` would clean up the semaphores resulting in success in this case. I'm leaving those kind of modifications for a follow-up.

Type error was detected with #3263 while I was testing it with a Debug build. ``` pytest -v tests/python/test_python_frontend.py -k test_pad_dynamic ``` It has a fusion of: ``` Inputs: T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ] Outputs: T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] %kernel_math { f7 = (float)(7); f9 = float(2.5) * f7; i11 = (int64_t)(f9); i14 = (nvfuser_index_t)(i11); i16 = (nvfuser_index_t)(i11); i18 = (nvfuser_index_t)(i11); i20 = (nvfuser_index_t)(i11); T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ] = pad( T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ], {0, 0, i14, i16, i18, i20} ) T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ] = Set( T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ], cache_op=Streaming ) } // %kernel_math ``` Stack trace: ``` #0 __cxxabiv1::__cxa_throw (obj=0xabc6ea0, tinfo=0x7ffeba81c248 <typeinfo for nvfuser::nvfError>, dest=0x7ffeba059370 <nvfuser::nvfError::~nvfError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80 #1 0x00007ffeba058665 in nvfuser::nvfCheckFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, msg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.cpp:283 #2 0x00007ffeb9c1be4b in nvfuser::nvfErrorFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119, condMsg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.h:229 #3 0x00007ffeb9c1bbe4 in nvfuser::PolymorphicBase::as<nvfuser::TensorView> (this=0xac07490) at /raid/nmaruyama/debug1/csrc/utils.h:119 #4 0x00007ffeba54c67f in nvfuser::(anonymous namespace)::isLoadGlobalToLocal (expr=0xabd7c50) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:61 #5 0x00007ffeba54c599 in nvfuser::refineCachePolicy (fusion=0xabef940) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:153 ```

Example error message: ```CUDA [ RUN ] TMemTest.AddKernelSameRegion unknown file: Failure C++ exception with description " INTERNAL ASSERT FAILED at "/home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp":169, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. // Codegen generated utilities namespace tmem { __device__ __inline__ void alloc(uint32_t in0, uint32_t in1) { asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"(in0), "r"(in1)); } __device__ __inline__ void relinquishAllocPermit() { asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n"); } __device__ __inline__ void store(uint32_t in0, Array<float, 1, 1> in1) { asm volatile( "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n" : :"r"(in0), "f"(in1[0]) ); } __device__ __inline__ void waitStore() { asm volatile("tcgen05.wait::st.sync.aligned;\n"); } __device__ __inline__ void load(Array<float, 1, 1>& out0, uint32_t in0) { asm( "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n" :"=f"(out0[0]) :"r"(in0) ); } __device__ __inline__ void waitLoad() { asm volatile("tcgen05.wait::ld.sync.aligned;\n"); } } // namespace tmem __global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4, Tensor<float, 1, 1> T9) { alignas(16) extern __shared__ char array[]; const unsigned smem_offset = 0; nvfuser_index_t i0; i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x)); bool b1; b1 = i0 < T0.logical_size[0LL]; uint32_t* T10 = reinterpret_cast<uint32_t*>(array + smem_offset + 0); tmem::alloc((uint32_t)(toSmem(T10)), (uint32_t)(32)); tmem::relinquishAllocPermit(); __syncthreads(); Array<float, 1, 1> T1; T1[0] = 0; if (b1) { T1[0] = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T2(T10[0], 0, (uint16_t)(0)); tmem::store((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T1[0]))); tmem::waitStore(); Array<float, 1, 1> T3; tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0])), (uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0})); tmem::waitLoad(); asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T10[0]), "r"((uint32_t)(32))); Array<float, 1, 1> T5; T5[0] = 0; if (b1) { T5[0] = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))]; } TMemTensor T6(T10[0], 0, (uint16_t)(1)); tmem::store((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T5[0]))); tmem::waitStore(); Array<float, 1, 1> T7; tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T7[0])), (uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0})); tmem::waitLoad(); Array<float, 1, 1> T8; T8[0] = T3[0] + T7[0]; if (b1) { T9[i0] = T8[0]; } } } CUDA NVRTC compile error: ptxas application ptx input, line 48; error : Instruction 'tcgen05.alloc' not supported on .target 'sm_89' ptxas application ptx input, line 48; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 52; error : Instruction 'tcgen05.relinquish_alloc_permit' not supported on .target 'sm_89' ptxas application ptx input, line 52; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 69; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 69; error : Instruction 'tcgen05.st' not supported on .target 'sm_89' ptxas application ptx input, line 73; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 77; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 77; error : Instruction 'tcgen05.ld' not supported on .target 'sm_89' ptxas application ptx input, line 81; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 86; error : Instruction 'tcgen05.dealloc' not supported on .target 'sm_89' ptxas application ptx input, line 86; error : Feature '.cta_group::1' not supported on .target 'sm_89' ptxas application ptx input, line 101; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 101; error : Instruction 'tcgen05.st' not supported on .target 'sm_89' ptxas application ptx input, line 105; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas application ptx input, line 109; error : Feature '.32x32b' not supported on .target 'sm_89' ptxas application ptx input, line 109; error : Instruction 'tcgen05.ld' not supported on .target 'sm_89' ptxas application ptx input, line 113; error : Instruction 'tcgen05.wait' not supported on .target 'sm_89' ptxas fatal : Ptx assembly aborted due to errors Exception raised from invoke at /home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp:169 (most recent call first): frame #0: <unknown function> + 0x1f3e89 (0x5f8f19a46e89 in ./bin/test_nvfuser) frame #1: <unknown function> + 0x5fc9ac (0x5f8f19e4f9ac in ./bin/test_nvfuser) frame #2: <unknown function> + 0x920965 (0x5f8f1a173965 in ./bin/test_nvfuser) frame #3: <unknown function> + 0x923318 (0x5f8f1a176318 in ./bin/test_nvfuser) frame #4: <unknown function> + 0x935e30 (0x5f8f1a188e30 in ./bin/test_nvfuser) frame #5: <unknown function> + 0x100f4f9 (0x5f8f1a8624f9 in ./bin/test_nvfuser) frame #6: <unknown function> + 0x1267437 (0x5f8f1aaba437 in ./bin/test_nvfuser) frame #7: <unknown function> + 0x1250676 (0x5f8f1aaa3676 in ./bin/test_nvfuser) frame #8: <unknown function> + 0x12508b5 (0x5f8f1aaa38b5 in ./bin/test_nvfuser) frame #9: <unknown function> + 0x125115b (0x5f8f1aaa415b in ./bin/test_nvfuser) frame #10: <unknown function> + 0x125ee25 (0x5f8f1aab1e25 in ./bin/test_nvfuser) frame #11: <unknown function> + 0x1267ac7 (0x5f8f1aabaac7 in ./bin/test_nvfuser) frame #12: <unknown function> + 0x125099f (0x5f8f1aaa399f in ./bin/test_nvfuser) frame #13: <unknown function> + 0x3cafcb (0x5f8f19c1dfcb in ./bin/test_nvfuser) frame #14: <unknown function> + 0x27488 (0x7a5456a35488 in /usr/lib/libc.so.6) frame #15: __libc_start_main + 0x8c (0x7a5456a3554c in /usr/lib/libc.so.6) frame #16: <unknown function> + 0x3cb535 (0x5f8f19c1e535 in ./bin/test_nvfuser) " thrown in the test body. To reproduce: NVFUSER_TEST_RANDOM_SEED=1740626485 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='TMemTest.AddKernelSameRegion' [ FAILED ] TMemTest.AddKernelSameRegion (67 ms) ```

Fix multidevice tests (#2574)

e050187

* fix tests for multicluster fusion

jjsjann123 requested review from naoyam and samnordmann March 14, 2023 09:18

jjsjann123 merged commit 48b0cb4 into main Mar 14, 2023

jjsjann123 deleted the cherry_pick_2574 branch March 14, 2023 10:22

liqiangxl mentioned this pull request May 8, 2023

MatmulSASSTest on H100 #306

Closed

drzejan2 added a commit that referenced this pull request May 19, 2023

Matmul Scheduler - alpha scaling epilogue

2392789

- post-review changes, part #2

drzejan2 added a commit that referenced this pull request May 19, 2023

Matmul Scheduler - alpha scaling epilogue

81c90b2

- post-review changes, part #2

jjsjann123 mentioned this pull request Nov 9, 2023

Codegen error: Exception raised from inferShape at /opt/pytorch/nvfuser/csrc/executor.cpp:532 #1277

Closed

jacobhinkle mentioned this pull request Dec 1, 2023

NVFUSER_DUMP=ca_map fails #1434

Closed

cowanmeg pushed a commit to cowanmeg/Fuser that referenced this pull request Dec 20, 2023

Merge pull request NVIDIA#2 from samnordmann/automatic_pipeline_creation

a5d6c64

Automatic pipeline creation

drzejan2 added a commit that referenced this pull request Jan 4, 2024

[wip] attempts to create IR nodes, #2

db142f0

drzejan2 added a commit that referenced this pull request Jan 9, 2024

[wip] attempts to create IR nodes, #2

8053c82

DariusMaham mentioned this pull request Apr 18, 2024

Internal assert failed #2107

Closed

wujingyue mentioned this pull request May 8, 2024

Write a sharded transformer block in nvFuser API. #2199

Closed

This was referenced Jun 6, 2024

Squeezed IterDomain ?S536{1} must concretize to IterType::Broadcast but found ?S536{1}. #2359

Closed

Merging IterDomains requires that their iteration types match. #2317

Closed

naoyam mentioned this pull request Sep 5, 2024

Add slice tests to demonstrate manual scheduling #2898

Merged

wujingyue mentioned this pull request Oct 18, 2024

OpInfo has problems testing define_tensor. #3225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix multidevice tests #2

Fix multidevice tests #2

Uh oh!

jjsjann123 commented Mar 14, 2023

Uh oh!

jjsjann123 commented Mar 14, 2023

Uh oh!

Uh oh!

Fix multidevice tests #2

Fix multidevice tests #2

Uh oh!

Conversation

jjsjann123 commented Mar 14, 2023

Uh oh!

jjsjann123 commented Mar 14, 2023

Uh oh!

Uh oh!