You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
We're using Thrust with Torch 1.7.1 and MinkowskiEngine 0.5.4 and experience a deterministic issue which makes the library unusable for long-running processes.
When we run parallel_for on large arrays there is several memory allocation steps and if we encounter an OOM error it simply results in a terminate instead of an error which could be handled. This is usually a result of an error thrown in a noexcept function.
#0 0x00007f07d8108fb7 in raise () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00007f07d810a921 in abort () from /lib/x86_64-linux-gnu/libc.so.6
#2 0x00007f07d8afd957 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007f07d8b03ae6 in std::rethrow_exception(std::__exception_ptr::exception_ptr) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007f07d8b02b49 in __cxa_throw_bad_array_new_length () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007f07d8b034b8 in __gxx_personality_v0 () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007f07d84cb573 in _Unwind_GetTextRelBase () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#7 0x00007f07d84cbad1 in _Unwind_RaiseException () from /lib/x86_64-linux-gnu/libgcc_s.so.1
#8 0x00007f07d8b03d47 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#9 0x000056485a33fb4d in thrust::cuda_cub::throw_on_error(cudaError, char const*) ()
#10 0x000056485a35c134 in void thrust::cuda_cub::parallel_for<thrust::cuda_cub::par_t, thrust::cuda_cub::for_each_f<thrust::pointer<thrust::tuple<float, int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<thrust::detail::allocator_traits_detail::gozer, void> >, long>(thrust::cuda_cub::execution_policy<thrust::cuda_cub::par_t>&, thrust::cuda_cub::for_each_f<thrust::pointer<thrust::tuple<float, int, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type, thrust::null_type>, thrust::cuda_cub::par_t, thrust::use_default, thrust::use_default>, thrust::detail::wrapped_function<thrust::detail::allocator_traits_detail::gozer, void> >, long) ()
I found that the issue is not present on CUDA 11.1, so we started to migrate the codebase over, but it might be useful for someone else who encounters this issue.
The text was updated successfully, but these errors were encountered:
Yes, it's fixed in CUDA>=11.0.221 (tested with 11.1 too), but we needed to build custom PyTorch to fix some of the know issues similar to this: isl-org/Open3D#3324 and #1401. We solved the issue by building PyTorch with -Xcompiler=-fno-gnu-unique.
We're using Thrust with Torch 1.7.1 and MinkowskiEngine 0.5.4 and experience a deterministic issue which makes the library unusable for long-running processes.
When we run
parallel_for
on large arrays there is several memory allocation steps and if we encounter an OOM error it simply results in a terminate instead of an error which could be handled. This is usually a result of an error thrown in anoexcept
function.Environment:
Code to reproduce:
Traceback:
I found that the issue is not present on CUDA 11.1, so we started to migrate the codebase over, but it might be useful for someone else who encounters this issue.
The text was updated successfully, but these errors were encountered: