Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultipleDoubleBuffer_CUDA might fail sometimes #39

Closed
zasdfgbnm opened this issue Mar 20, 2023 · 0 comments · Fixed by #47
Closed

MultipleDoubleBuffer_CUDA might fail sometimes #39

zasdfgbnm opened this issue Mar 20, 2023 · 0 comments · Fixed by #47

Comments

@zasdfgbnm
Copy link
Collaborator

Have no idea why, but I am seeing the following failure non-deterministically after we become our own repo.

[ RUN      ] LoopRotationTest.MultipleDoubleBuffer_CUDA
unknown file: Failure
C++ exception with description "aten_output_tensor.allclose( fusion_output_tensor.to(aten_output_tensor.dtype()), tolerance_values.second, tolerance_values.first, true) INTERNAL ASSERT FAILED at "/home/gaoxiang/Fuser/test/test_gpu_validator.h":400, please report a bug to PyTorch. 

Validation error in output 0 on line 712 in file /home/gaoxiang/Fuser/test/test_loop_rotation.cpp.
  Detected abs error of: 2.55172
    absolute tolerance was set to 1.68222e-06
    and relative tolerance set to 2.23704e-06
Exception raised from testValidate at /home/gaoxiang/Fuser/test/test_gpu_validator.h:400 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x5c (0x7f33003bccdc in /home/gaoxiang/pytorch-viable/build/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x64 (0x7f33003865f6 in /home/gaoxiang/pytorch-viable/build/lib/libc10.so)
frame #2: c10::detail::torchInternalAssertFail(char const*, char const*, unsigned int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x4f (0x7f33003bacaf in /home/gaoxiang/pytorch-viable/build/lib/libc10.so)
frame #3: <unknown function> + 0x37c27d (0x560a19f3b27d in ./build/bin/nvfuser_tests)
frame #4: <unknown function> + 0x3813af (0x560a19f403af in ./build/bin/nvfuser_tests)
frame #5: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) + 0x87 (0x560a1a096097 in ./build/bin/nvfuser_tests)
frame #6: testing::Test::Run() + 0xf6 (0x560a1a08a6b6 in ./build/bin/nvfuser_tests)
frame #7: <unknown function> + 0x4cb8b5 (0x560a1a08a8b5 in ./build/bin/nvfuser_tests)
frame #8: <unknown function> + 0x4cbfba (0x560a1a08afba in ./build/bin/nvfuser_tests)
frame #9: testing::internal::UnitTestImpl::RunAllTests() + 0x754 (0x560a1a08b9f4 in ./build/bin/nvfuser_tests)
frame #10: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) + 0x87 (0x560a1a096607 in ./build/bin/nvfuser_tests)
frame #11: testing::UnitTest::Run() + 0x91 (0x560a1a08a9d1 in ./build/bin/nvfuser_tests)
frame #12: <unknown function> + 0x119882 (0x560a19cd8882 in ./build/bin/nvfuser_tests)
frame #13: <unknown function> + 0x23790 (0x7f32cec3c790 in /usr/lib/libc.so.6)
frame #14: __libc_start_main + 0x8a (0x7f32cec3c84a in /usr/lib/libc.so.6)
frame #15: _start + 0x25 (0x560a19d079a5 in ./build/bin/nvfuser_tests)
" thrown in the test body.
[  FAILED  ] LoopRotationTest.MultipleDoubleBuffer_CUDA (477 ms)
zasdfgbnm added a commit that referenced this issue Mar 22, 2023
wujingyue added a commit that referenced this issue Oct 11, 2023
```
Traceback (most recent call last):
  File "/opt/pytorch/nvfuser/nvfuser/__init__.py", line 122, in execute
    result = self._execute(
RuntimeError: isSame(values_[it.first], it.second) INTERNAL ASSERT FAILED at "/opt/pytorch/nvfuser/csrc/evaluator_common.cpp":314, please report a bug with repro script to NVFuser at https://github.com/NVIDIA/Fuser/issues. Precomputed values failed to validate.
Something unexpected changed between the compilation and execution.
nan != nan
Exception raised from validate at /opt/pytorch/nvfuser/csrc/evaluator_common.cpp:314 (most recent call first):
frame #0: nvfuser::nvfCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x8d (0x7fdc9919fe3b in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #1: nvfuser::nvfErrorFail(char const*, char const*, unsigned int, char const*, std::string const&) + 0x53 (0x7fdc992ded63 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #2: nvfuser::PrecomputedValues::validate() + 0x172 (0x7fdc993190f2 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #3: nvfuser::PrecomputedValues::evaluate() + 0x66 (0x7fdc9931fde6 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #4: nvfuser::FusionExecutor::inferOutputSizes(nvfuser::Fusion*, nvfuser::KernelArgumentHolder const&) + 0x8d (0x7fdc992ea12d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #5: nvfuser::FusionKernelRuntime::compileFusionParallel(nvfuser::KernelArgumentHolder) + 0x46d (0x7fdc9943a6ad in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #6: nvfuser::FusionExecutorCache::runFusionWithInputs(c10::ArrayRef<c10::IValue> const&, std::optional<nvfuser::PrimDataType>, std::optional<signed char>) + 0xa8d (0x7fdc99443c9d in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #7: nvfuser::python_frontend::FusionDefinition::execute(c10::ArrayRef<c10::IValue> const&, bool, bool, std::optional<signed char>) const + 0x331 (0x7fdc997450e1 in /usr/local/lib/python3.10/site-packages/torch/lib/libnvfuser_codegen.so)
frame #8: <unknown function> + 0xeec2e (0x7fdbe8274c2e in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
frame #9: <unknown function> + 0x16e137 (0x7fdbe82f4137 in /opt/pytorch/nvfuser/nvfuser/_C.cpython-310-x86_64-linux-gnu.so)
<omitting python frames>
frame #38: <unknown function> + 0x29d90 (0x7fdd26ea0d90 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #39: __libc_start_main + 0x80 (0x7fdd26ea0e40 in /usr/lib/x86_64-linux-gnu/libc.so.6)
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant