Skip to content

Conversation

jjsjann123
Copy link
Collaborator

Cherry-picking from: csarofeen/pytorch#2574

Author: samnordmann snordmann@nvidia.com
Date: Mon Mar 13 17:44:39 2023 +0100

Fix multidevice tests (#2574)

* fix tests for multicluster fusion

* fix tests for multicluster fusion
@jjsjann123
Copy link
Collaborator Author

Is NVFuserTest.MultiClusterFusion_CUDA flaky and non deterministic?

[  SKIPPED ] NVFuserTest.FusionMultiGPU_Reduce_CUDA (0 ms)
[ RUN      ] NVFuserTest.MultiClusterFusion_CUDA
unknown file: Failure
C++ exception with description "obtained_string_aDag == ref_string_aDag INTERNAL ASSERT FAILED at "/jiej/playground/gitlab/github_nvfuser/test/test_multicluster_fusion.cpp":138, please report a bug to PyTorch. the obtained AggregateDag is not the one expected
Exception raised from TestBody at /jiej/playground/gitlab/github_nvfuser/test/test_multicluster_fusion.cpp:138 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fe4f2dadafc in /opt/pytorch/pytorch/build/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fe4f2d706c8 in /opt/pytorch/pytorch/build/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, char const*) + 0x57 (0x7fe4f2dab817 in /opt/pytorch/pytorch/build/lib/libc10.so)
frame #3: <unknown function> + 0x50a9ed (0x5599b483a9ed in ./bin/nvfuser_tests)
frame #4: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x51 (0x5599b4880441 in ./bin/nvfuser_tests)
frame #5: <unknown function> + 0x542590 (0x5599b4872590 in ./bin/nvfuser_tests)
frame #6: <unknown function> + 0x542a25 (0x5599b4872a25 in ./bin/nvfuser_tests)
frame #7: <unknown function> + 0x543181 (0x5599b4873181 in ./bin/nvfuser_tests)
frame #8: testing::internal::UnitTestImpl::RunAllTests() + 0x10e9 (0x5599b4874909 in ./bin/nvfuser_tests)
frame #9: testing::UnitTest::Run() + 0x98 (0x5599b4874e28 in ./bin/nvfuser_tests)
frame #10: <unknown function> + 0x14c80c (0x5599b447c80c in ./bin/nvfuser_tests)
frame #11: __libc_start_main + 0xf3 (0x7fe4f2839083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #12: _start + 0x2e (0x5599b44b4d1e in ./bin/nvfuser_tests)
" thrown in the test body.
[  FAILED  ] NVFuserTest.MultiClusterFusion_CUDA (0 ms)

I saw this in one of my full test run, but can't repro it afterwards... I'm gonna merge it as-is. If the failure shows up again we can revisit this.

@jjsjann123 jjsjann123 merged commit 48b0cb4 into main Mar 14, 2023
@jjsjann123 jjsjann123 deleted the cherry_pick_2574 branch March 14, 2023 10:22
@liqiangxl liqiangxl mentioned this pull request May 8, 2023
drzejan2 added a commit that referenced this pull request May 19, 2023
- post-review changes, part #2
drzejan2 added a commit that referenced this pull request May 19, 2023
- post-review changes, part #2
wujingyue added a commit that referenced this pull request Oct 11, 2023
```
Traceback (most recent call last):
  File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute
    result = self._execute(
RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate.
Something unexpected changed between the compilation and execution.
nan != nan
Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
<omitting python frames>
frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
```
cowanmeg pushed a commit to cowanmeg/Fuser that referenced this pull request Dec 20, 2023
drzejan2 added a commit that referenced this pull request Jan 4, 2024
drzejan2 added a commit that referenced this pull request Jan 9, 2024
jacobhinkle added a commit that referenced this pull request Mar 22, 2024
This introduces a thread-local global memory allocator for each device
and uses it whenever there is an intermediate tensor needed which
requires zero-initialization.

To enable use `NVFUSER_ENABLE=reuse_zeroed_memory`. You can monitor the
allocator using `NVFUSER_DUMP=global_zeroed_memory`.

Before we enable this feature by default, we need to have high
confidence that every kernel using zero-initialized memory will always
clean up their semaphores. This is currently only the case for serial
grid reductions, as far as I know.

This enables the basic functionality of #1829. However, it does not
modify existing algorithms to clean up their memory. See
`NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory
build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling`,
which succeeds when using serial grid reduction, but fails (in debug
mode) when using `gridReduce` (note that this test is updated to behave
differently in this PR):
```
# NVFUSER_ENABLE=reuse_zeroed_memory NVFUSER_DUMP=global_zeroed_memory build/nvfuser_tests --gtest_filter=SerialGridReductionTest.Scheduling                                                       
Running main() from /opt/pytorch/nvfuser/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = SerialGridReductionTest.Scheduling
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from SerialGridReductionTest
[ RUN      ] SerialGridReductionTest.Scheduling
[global zeroed memory] Resizing arena to 512 bytes
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 512 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Resizing arena to 16384 bytes
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
[global zeroed memory] Resetting allocated bytes to 0
[global zeroed memory] Allocating byte range: 0 to 16384 bytes
unknown file: Failure
C++ exception with description "nnz.equal(0) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/global_allocator.cpp":88, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Global memory arena was not properly zeroed. Found 2048 bytes that are not zero
Exception raised from checkZeroed at /opt/pytorch/nvfuser/csrc/global_allocator.cpp:88 (most recent call first):
frame #0: <unknown function> + 0x2fde9e (0x556cdb95de9e in build/nvfuser_tests)
frame #1: <unknown function> + 0x2fe0df (0x556cdb95e0df in build/nvfuser_tests)
frame #2: <unknown function> + 0x3f3720 (0x556cdba53720 in build/nvfuser_tests)
frame #3: <unknown function> + 0x3f33df (0x556cdba533df in build/nvfuser_tests)
frame #4: <unknown function> + 0x3f38ed (0x556cdba538ed in build/nvfuser_tests)
frame #5: <unknown function> + 0x315e67 (0x556cdb975e67 in build/nvfuser_tests)
frame #6: <unknown function> + 0x7c5780 (0x556cdbe25780 in build/nvfuser_tests)
frame #7: <unknown function> + 0x7c5877 (0x556cdbe25877 in build/nvfuser_tests)
frame #8: <unknown function> + 0x138f8cc (0x556cdc9ef8cc in build/nvfuser_tests)
frame #9: <unknown function> + 0x1457f0b (0x556cdcab7f0b in build/nvfuser_tests)
frame #10: <unknown function> + 0x14519fd (0x556cdcab19fd in build/nvfuser_tests)
frame #11: <unknown function> + 0x142de24 (0x556cdca8de24 in build/nvfuser_tests)
frame #12: <unknown function> + 0x142e93f (0x556cdca8e93f in build/nvfuser_tests)
frame #13: <unknown function> + 0x142f345 (0x556cdca8f345 in build/nvfuser_tests)
frame #14: <unknown function> + 0x143f86c (0x556cdca9f86c in build/nvfuser_tests)
frame #15: <unknown function> + 0x1458e98 (0x556cdcab8e98 in build/nvfuser_tests)
frame #16: <unknown function> + 0x1452ac7 (0x556cdcab2ac7 in build/nvfuser_tests)
frame #17: <unknown function> + 0x143de6d (0x556cdca9de6d in build/nvfuser_tests)
frame #18: <unknown function> + 0x1407ca0 (0x556cdca67ca0 in build/nvfuser_tests)
frame #19: <unknown function> + 0x1407c19 (0x556cdca67c19 in build/nvfuser_tests)
frame #20: <unknown function> + 0x29d90 (0x7f616c7d4d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #21: __libc_start_main + 0x80 (0x7f616c7d4e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x11e9d5 (0x556cdb77e9d5 in build/nvfuser_tests)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1711120799 NVFUSER_TEST_ATEN_RANDOM_SEED=0 nvfuser_tests --gtest_filter='SerialGridReductionTest.Scheduling'
[  FAILED  ] SerialGridReductionTest.Scheduling (5669 ms)
[----------] 1 test from SerialGridReductionTest (5669 ms total)
```
This test runs with serial grid reduction, then with `gridReduce`. Each
time it runs two grid reductions. Both serial grid reductions succeed
because the semaphore buffer is properly zeroed. The `gridReduce`
succeeds the first time since the memory pool calls `at::zeros` again to
request a larger buffer size (`gridReduce` requires more semaphores
since there is one per thread segment vs one for each each block
segment). However, the second call to `gridReduce` fails because it has
not cleaned up its semaphores. Hacking that function to force
`PERSISTENT=1` would clean up the semaphores resulting in success in
this case. I'm leaving those kind of modifications for a follow-up.
naoyam added a commit that referenced this pull request Nov 6, 2024
Type error was detected with #3263 while I was testing it with a Debug
build.

```
pytest -v tests/python/test_python_frontend.py -k test_pad_dynamic
```

It has a fusion of:

```
Inputs:
  T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ]
Outputs:
  T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ]

%kernel_math {
f7 = (float)(7);
f9 = float(2.5) * f7;
i11 = (int64_t)(f9);
i14 = (nvfuser_index_t)(i11);
i16 = (nvfuser_index_t)(i11);
i18 = (nvfuser_index_t)(i11);
i20 = (nvfuser_index_t)(i11);
T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ]
   = pad( T0_g_float[ bS0{1}, iS1{i1}, iS2{i2} ], {0, 0, i14, i16, i18, i20} )
T1_g_float[ bS11{1}, iS12{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}, iS13{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )} ]
   = Set( T2_l_float[ bS3{1}, iS5{( ( i1 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf, iS7{( ( i2 + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) ) + ( (nvfuser_index_t)(( (int64_t)(( float(2.5) * ( (float)(7) ) )) )) ) )}rf ], cache_op=Streaming )
} // %kernel_math
```

Stack trace:
```
#0  __cxxabiv1::__cxa_throw (obj=0xabc6ea0, tinfo=0x7ffeba81c248 <typeinfo for nvfuser::nvfError>, dest=0x7ffeba059370 <nvfuser::nvfError::~nvfError()>) at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:80
#1  0x00007ffeba058665 in nvfuser::nvfCheckFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119,
    msg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.cpp:283
#2  0x00007ffeb9c1be4b in nvfuser::nvfErrorFail (func=0x7ffeb9927c33 "as", file=0x7ffeb99d9a1b "/raid/nmaruyama/debug1/csrc/utils.h", line=119,
    condMsg=0x7ffeb99c36e6 " INTERNAL ASSERT FAILED at \"/raid/nmaruyama/debug1/csrc/utils.h\":119, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. ") at /raid/nmaruyama/debug1/csrc/exceptions.h:229
#3  0x00007ffeb9c1bbe4 in nvfuser::PolymorphicBase::as<nvfuser::TensorView> (this=0xac07490) at /raid/nmaruyama/debug1/csrc/utils.h:119
#4  0x00007ffeba54c67f in nvfuser::(anonymous namespace)::isLoadGlobalToLocal (expr=0xabd7c50) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:61
#5  0x00007ffeba54c599 in nvfuser::refineCachePolicy (fusion=0xabef940) at /raid/nmaruyama/debug1/csrc/scheduler/cache_policy_refiner.cpp:153
```
zasdfgbnm added a commit that referenced this pull request Feb 27, 2025
Example error message:

```CUDA
[ RUN      ] TMemTest.AddKernelSameRegion
unknown file: Failure
C++ exception with description " INTERNAL ASSERT FAILED at "/home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp":169, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. 
// Codegen generated utilities

namespace tmem {
__device__ __inline__ void alloc(uint32_t in0, uint32_t in1) {
  asm volatile("tcgen05.alloc.cta_group::1.sync.aligned.shared::cta.b32 [%0], %1;\n"::"r"(in0), "r"(in1));
}
__device__ __inline__ void relinquishAllocPermit() {
  asm volatile("tcgen05.relinquish_alloc_permit.cta_group::1.sync.aligned;\n");
}
__device__ __inline__ void store(uint32_t in0, Array<float, 1, 1> in1) {
  asm volatile(
    "tcgen05.st.sync.aligned.32x32b.x1.b32 [%0], {%1};\n"
    :
    :"r"(in0),
     "f"(in1[0])
  );
}
__device__ __inline__ void waitStore() {
  asm volatile("tcgen05.wait::st.sync.aligned;\n");
}
__device__ __inline__ void load(Array<float, 1, 1>& out0, uint32_t in0) {
  asm(
    "tcgen05.ld.sync.aligned.32x32b.x1.b32 {%0}, [%1];\n"
    :"=f"(out0[0])
    :"r"(in0)
  );
}
__device__ __inline__ void waitLoad() {
  asm volatile("tcgen05.wait::ld.sync.aligned;\n");
}
} // namespace tmem
__global__ void nvfuser_none_f0_c0_r0_g0(Tensor<float, 1, 1> T0, Tensor<float, 1, 1> T4, Tensor<float, 1, 1> T9) {
  alignas(16) extern __shared__ char array[];
  const unsigned smem_offset = 0;
  nvfuser_index_t i0;
  i0 = ((nvfuser_index_t)threadIdx.x) + (32 * ((nvfuser_index_t)blockIdx.x));
  bool b1;
  b1 = i0 < T0.logical_size[0LL];
  uint32_t* T10 = reinterpret_cast<uint32_t*>(array + smem_offset + 0);
  tmem::alloc((uint32_t)(toSmem(T10)), (uint32_t)(32));
  tmem::relinquishAllocPermit();
  __syncthreads();
  Array<float, 1, 1> T1;
  T1[0] = 0;
  if (b1) {
    T1[0]
       = T0[((T0.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T0.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  TMemTensor T2(T10[0], 0, (uint16_t)(0));
  tmem::store((uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T1[0])));
  tmem::waitStore();
  Array<float, 1, 1> T3;
  tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T3[0])), (uint32_t)(T2 + Array<uint16_t, 2, 1>{0, 0}));
  tmem::waitLoad();
  asm volatile("tcgen05.dealloc.cta_group::1.sync.aligned.b32 %0, %1;\n"::"r"(T10[0]), "r"((uint32_t)(32)));
  Array<float, 1, 1> T5;
  T5[0] = 0;
  if (b1) {
    T5[0]
       = T4[((T4.alloc_stride[0LL] * ((nvfuser_index_t)threadIdx.x)) + ((32 * T4.alloc_stride[0LL]) * ((nvfuser_index_t)blockIdx.x)))];
  }
  TMemTensor T6(T10[0], 0, (uint16_t)(1));
  tmem::store((uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0}), (*reinterpret_cast<Array<float, 1, 1>*>(&T5[0])));
  tmem::waitStore();
  Array<float, 1, 1> T7;
  tmem::load((*reinterpret_cast<Array<float, 1, 1>*>(&T7[0])), (uint32_t)(T6 + Array<uint16_t, 2, 1>{0, 0}));
  tmem::waitLoad();
  Array<float, 1, 1> T8;
  T8[0]
    = T3[0]
    + T7[0];
  if (b1) {
    T9[i0]
       = T8[0];
  }
}
}

CUDA NVRTC compile error: ptxas application ptx input, line 48; error   : Instruction 'tcgen05.alloc' not supported on .target 'sm_89'
ptxas application ptx input, line 48; error   : Feature '.cta_group::1' not supported on .target 'sm_89'
ptxas application ptx input, line 52; error   : Instruction 'tcgen05.relinquish_alloc_permit' not supported on .target 'sm_89'
ptxas application ptx input, line 52; error   : Feature '.cta_group::1' not supported on .target 'sm_89'
ptxas application ptx input, line 69; error   : Feature '.32x32b' not supported on .target 'sm_89'
ptxas application ptx input, line 69; error   : Instruction 'tcgen05.st' not supported on .target 'sm_89'
ptxas application ptx input, line 73; error   : Instruction 'tcgen05.wait' not supported on .target 'sm_89'
ptxas application ptx input, line 77; error   : Feature '.32x32b' not supported on .target 'sm_89'
ptxas application ptx input, line 77; error   : Instruction 'tcgen05.ld' not supported on .target 'sm_89'
ptxas application ptx input, line 81; error   : Instruction 'tcgen05.wait' not supported on .target 'sm_89'
ptxas application ptx input, line 86; error   : Instruction 'tcgen05.dealloc' not supported on .target 'sm_89'
ptxas application ptx input, line 86; error   : Feature '.cta_group::1' not supported on .target 'sm_89'
ptxas application ptx input, line 101; error   : Feature '.32x32b' not supported on .target 'sm_89'
ptxas application ptx input, line 101; error   : Instruction 'tcgen05.st' not supported on .target 'sm_89'
ptxas application ptx input, line 105; error   : Instruction 'tcgen05.wait' not supported on .target 'sm_89'
ptxas application ptx input, line 109; error   : Feature '.32x32b' not supported on .target 'sm_89'
ptxas application ptx input, line 109; error   : Instruction 'tcgen05.ld' not supported on .target 'sm_89'
ptxas application ptx input, line 113; error   : Instruction 'tcgen05.wait' not supported on .target 'sm_89'
ptxas fatal   : Ptx assembly aborted due to errors

Exception raised from invoke at /home/gaoxiang/Fuser/csrc/runtime/compiled_kernel.cpp:169 (most recent call first):
frame #0: <unknown function> + 0x1f3e89 (0x5f8f19a46e89 in ./bin/test_nvfuser)
frame #1: <unknown function> + 0x5fc9ac (0x5f8f19e4f9ac in ./bin/test_nvfuser)
frame #2: <unknown function> + 0x920965 (0x5f8f1a173965 in ./bin/test_nvfuser)
frame #3: <unknown function> + 0x923318 (0x5f8f1a176318 in ./bin/test_nvfuser)
frame #4: <unknown function> + 0x935e30 (0x5f8f1a188e30 in ./bin/test_nvfuser)
frame #5: <unknown function> + 0x100f4f9 (0x5f8f1a8624f9 in ./bin/test_nvfuser)
frame #6: <unknown function> + 0x1267437 (0x5f8f1aaba437 in ./bin/test_nvfuser)
frame #7: <unknown function> + 0x1250676 (0x5f8f1aaa3676 in ./bin/test_nvfuser)
frame #8: <unknown function> + 0x12508b5 (0x5f8f1aaa38b5 in ./bin/test_nvfuser)
frame #9: <unknown function> + 0x125115b (0x5f8f1aaa415b in ./bin/test_nvfuser)
frame #10: <unknown function> + 0x125ee25 (0x5f8f1aab1e25 in ./bin/test_nvfuser)
frame #11: <unknown function> + 0x1267ac7 (0x5f8f1aabaac7 in ./bin/test_nvfuser)
frame #12: <unknown function> + 0x125099f (0x5f8f1aaa399f in ./bin/test_nvfuser)
frame #13: <unknown function> + 0x3cafcb (0x5f8f19c1dfcb in ./bin/test_nvfuser)
frame #14: <unknown function> + 0x27488 (0x7a5456a35488 in /usr/lib/libc.so.6)
frame #15: __libc_start_main + 0x8c (0x7a5456a3554c in /usr/lib/libc.so.6)
frame #16: <unknown function> + 0x3cb535 (0x5f8f19c1e535 in ./bin/test_nvfuser)
" thrown in the test body.

To reproduce: NVFUSER_TEST_RANDOM_SEED=1740626485 NVFUSER_TEST_ATEN_RANDOM_SEED=0 test_nvfuser --gtest_filter='TMemTest.AddKernelSameRegion'
[  FAILED  ] TMemTest.AddKernelSameRegion (67 ms)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants