Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

Closed
leezu opened this issue Apr 4, 2020 · 33 comments
Closed

Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

leezu opened this issue Apr 4, 2020 · 33 comments
Assignees

Comments

@leezu
Copy link

leezu commented Apr 4, 2020

We experience intermittent compilation failure on our CI server.
The CXX compiler identification is MSVC 19.25.28612.0. The CUDA compiler identification is NVIDIA 10.2.89.

Retrying the compilation typically succeeds. Our CI server now retries compiling the project up to 5 times to avoid this issue. (The issue has never occurred 5 times in a row yet.)

The error looks as follows

[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_value_type<T>' being compiled
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2065: 'U1': undeclared identifier
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2238: unexpected token(s) preceding ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2988: unrecognizable template declaration/definition
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: '<end Parse>'

This affects at least two projects

@brycelelbach
Copy link
Collaborator

Hi, it's going to be very hard for us to debug this without more information and a reproducer. Please see https://github.com/brycelelbach/cpp_bug_reporting_guidelines

At the very least we need full logs and we need to see the code that you are trying to build. Thanks!

@brycelelbach
Copy link
Collaborator

@allisonvacanti can you take a look at this and see if you have any thoughts?

@leezu
Copy link
Author

leezu commented May 1, 2020

It can be reproduced by building MXNet or PyTorch from source. I understand that's not really minimal, sorry. I see that pytorch/pytorch#25393 contains some recent deep dive on this issue and claims a relation to #1030

One exemplar complete error log is at http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/15/nodes/39/log/?start=0

@alliepiper
Copy link
Collaborator

@leezu I recently pushed a ton of MSVC fixes (including a fix for #1030 and the atanh issue mentioned in pytorch/pytorch#25393 (comment)). Might be worth trying to build against a checkout of thrust master.

I don't recall encountering any issues involving __formal, though. I suspect it's not a thrust bug, since you're seeing this in the Eigen headers too. If this can be repro'd in a small thrust-only reproducer I will take a look.

@mikoro
Copy link

mikoro commented May 2, 2020

For me this started appearing after some MSVC 2019 update a few months ago. The error is random and it comes and goes whenever you are editing the files. Usually you can fix the error by adding some random int dummy = 0; to some header. This changes some state in the compiler and the problem goes away for a while. Also changing the compiler flags will sometimes make the error appear or go away. Sometimes recompiling the file helps, sometimes not.

It seems that yesterday I was able to fix it permanently. I just cloned the latest thrust repo with git clone --recurse-submodules to get the cub also. Then I deleted the old thrust include folder from the CUDA toolkit directory and replaced it with the thrust folder from the git master. I also copied the cub folder there.

@alliepiper
Copy link
Collaborator

That is bizarre! Hopefully it stays fixed for good, in which case the next toolkit release should resolve this.

For a less invasive workaround, you can also include the cloned thrust/cub directories directly rather than modifying the toolkit path (-I[thrust src dir] -I[thrust src dir]/dependencies/cub).

@leezu
Copy link
Author

leezu commented May 2, 2020

Based on preliminary data, the issue also goes away with thrust 1.9.8 which appears will be shipped in Cuda 11. At least the CI testing apache/mxnet#18218 did not experience any intermittent compilation failures on MSVC

@leezu
Copy link
Author

leezu commented May 3, 2020

Unfortunately the issue still occurs with thrust 1.9.8, though apparently more rarely

[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(394): note: see reference to class template instantiation 'thrust::cuda_cub::__set_operations::SetOpAgent<KeysIt1,KeysIt2,ValuesIt1,ValuesIt2,KeysOutputIt,ValuesOutputIt,Size,CompareOp,SetOp,HAS_VALUES>::PtxPlan<Arch>' being compiled
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(827): note: see reference to class template instantiation 'thrust::cuda_cub::__set_operations::SetOpAgent<KeysIt1,KeysIt2,ValuesIt1,ValuesIt2,KeysOutputIt,ValuesOutputIt,Size,CompareOp,SetOp,HAS_VALUES>' being compiled
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(393): note: see reference to class template instantiation 'thrust::cuda_cub::__set_operations::SetOpAgent<KeysIt1,KeysIt2,ValuesIt1,ValuesIt2,KeysOutputIt,ValuesOutputIt,Size,CompareOp,SetOp,HAS_VALUES>::PtxPlan<Arch>::TempStorage' being compiled
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-18146/runs/48/nodes/40/steps/77/log/?start=0

@leezu
Copy link
Author

leezu commented May 4, 2020

This also affects https://github.com/NVlabs/cub

[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(891): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(909): note: see reference to class template instantiation 'cub::BinaryOpHasIdxParam<T,BinaryOp>' being compiled
[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(891): error C2065: '__T0': undeclared identifier
[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(891): error C2923: 'std::_Select<__formal>::_Apply': '__T0' is not a valid template type argument for parameter '<unnamed-symbol>'

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-18146/runs/51/nodes/40/steps/84/log/?start=0

@mikoro
Copy link

mikoro commented May 4, 2020

Actually same for me, I did get the same error once in the last few days. It seems to appear less frequently with the newer thrust though.

leezu added a commit to apache/mxnet that referenced this issue May 4, 2020
Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090
@alliepiper
Copy link
Collaborator

Have you tried reporting an issue on the MSVC STL github repo? It sounds like something somewhere is passing a typename to their internal std::_Select utility. If it's not a bug on their end, they might know of a way to track things down.

@leezu
Copy link
Author

leezu commented May 5, 2020

I opened microsoft/STL#792

@alliepiper
Copy link
Collaborator

alliepiper commented May 5, 2020

@leezu @mikoro I just found something that might somehow tie into this. Do either of your codebases use the thrust::detail::is_allocator<T> trait?

https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L383-L386

If so, please link me the code that does and try making this change to see if the problem goes away:

Replace this section of the linked code

    : allocator_traits_detail::has_value_type<T>

with:

    : typename allocator_traits_detail::has_value_type<T>::type

@leezu
Copy link
Author

leezu commented May 5, 2020

Thanks @allisonvacanti. MXNet does not seem to use the trait, at least not directly: https://github.com/apache/incubator-mxnet/search?q=is_allocator&unscoped_q=is_allocator

@mikoro
Copy link

mikoro commented May 6, 2020

Do either of your codebases use the thrust::detail::is_allocator<T> trait?

I searched through all our codebase and there was not anything using that trait.

One additional bit of info: I have seen the error coming from compiling multiple different .cu files. So it is not always just one .cu file that gives the problems. What is common to all these .cu files that have produced the error is that they end up including following thrust headers:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

@alliepiper
Copy link
Collaborator

This is a long shot, but what about thrust::detail::allocator_traits_detail::has_value_type<T>?

That's the trait the error message mentioned, but it's only used internally by is_allocator<T>, which does not appear to be used at all internally. Since these aren't used, I can't figure out why the compiler is even trying to instantiate this trait.

Both of the vector headers would bring in the allocators, so that makes some sense.

@CaseyCarter
Copy link

CaseyCarter commented May 6, 2020

This is a long shot, but what about thrust::detail::allocator_traits_detail::has_value_type<T>?

That's the trait the error message mentioned, but it's only used internally by is_allocator<T>, which does not appear to be used at all internally. Since these aren't used, I can't figure out why the compiler is even trying to instantiate this trait.

I suppose this could be a diagnostic error? Has someone tried building with is_allocator and has_value_type commented out to see whether the problem goes away or if the diagnostic targets a different source location?

MSVC had some diagnostic location issues in roughly the 16.5 time frame, but (1) I had the impression that use of C++ Concepts was necessary to trigger it (which is certainly not the case here), and (2) I thought it was fixed before we released. If we can confirm "moving diagnostics" I can dig in to this more with the compiler team.

@alliepiper
Copy link
Collaborator

@CaseyCarter I've built Thrust proper with static_assert that would trigger if has_value_type was used and confirmed that we don't use it internally.

I'm also curious what @mikoro and @leezu will see if they remove these. The bits that need to be comment out are:

https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L42 and
https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L383-L386

@brycelelbach
Copy link
Collaborator

I'm getting lots of reports from people seeing this same issue. I don't understand how this is just showing up now? Did a new release of MSVC 2019 go out?

@brycelelbach
Copy link
Collaborator

pytorch/pytorch#38024 may be related

@brycelelbach
Copy link
Collaborator

@CaseyCarter did y'all put out a release recently? Like in the past month or two?

@BillyONeal
Copy link

@brycelelbach
Copy link
Collaborator

Okay squad I'm starting to see wider impacts from this in the PyTorch ecosystem. @allisonvacanti could you try what Casey suggested?

@alliepiper
Copy link
Collaborator

I haven't been able to reproduce this yet, but I have confirmed that the trait mentioned in the diagnostic is not instantiated by thrust directly, and doesn't appear to be used in either of the affected projects. I've been using the same MSVC2019 and NVCC versions as the bug reports for several weeks now, it's odd that we aren't seeing the same behavior. We still need to track down why this is happening.

I noticed that the internal bug report yesterday is pointing at function_traits.h:42 instead of allocator_traits.h:42.

So clearly the issue is related to life, the universe, and everything.

Or maybe it's just because both of those files use __THRUST_DEFINE_HAS_NESTED_TYPE for the first time at line 42.

I'll put together a patch with @StephanTLavavej's suggestion and replace the implementation of that macro with the equivalent void_t code. We'll need someone who can reproduce the issue to test it and see if it helps.

@peterjc123
Copy link

Just to add a bit more info. This issue also occurs with MSVC 2017 (C++ toolchain 14.11 / 14.16). But I agree that it is a bit difficult to reproduce.

@alliepiper alliepiper self-assigned this May 8, 2020
@alliepiper
Copy link
Collaborator

My local build just started failing with the error:

cmd.exe /C "C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin\nvcc.exe  -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -I..\ -I..\dependencies\cub -ftemplate-backtrace-limit 64 -gencode arch=compute_75,code=sm_75 -Xcompiler=/WX -Xcompiler=/wd4244 -Xcompiler=/wd4267 -Xcompiler=/wd4800 -Xcompiler=/wd4146 -Xcompiler=/wd4494 -Xcompiler=/bigobj -Werror all-warnings -Xcudafe --display_error_number -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -x cu -c headers\system\cuda\memory_resource.h.cu -o CMakeFiles\header-test.dir\headers\system\cuda\memory_resource.h.cu.obj -Xcompiler=-FdCMakeFiles\header-test.dir\,-FS && C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin\nvcc.exe -ccbin=C:\PROGRA~2\MICROS~1\2019\COMMUN~1\VC\Tools\MSVC\1425~1.286\bin\Hostx64\x64\cl.exe -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -I..\ -I..\dependencies\cub -ftemplate-backtrace-limit 64 -gencode arch=compute_75,code=sm_75 -Xcompiler=/WX -Xcompiler=/wd4244 -Xcompiler=/wd4267 -Xcompiler=/wd4800 -Xcompiler=/wd4146 -Xcompiler=/wd4494 -Xcompiler=/bigobj -Werror all-warnings -Xcudafe --display_error_number -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -x cu -M headers\system\cuda\memory_resource.h.cu -MT CMakeFiles\header-test.dir\headers\system\cuda\memory_resource.h.cu.obj -o CMakeFiles\header-test.dir\headers\system\cuda\memory_resource.h.cu.obj.d"
memory_resource.h.cu
../thrust/detail/allocator/allocator_traits.inl(102): error C2993: 'T': illegal type for non-type template parameter '__formal'
../thrust/detail/allocator/allocator_traits.inl(102): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_member_construct2_impl_has_member<T,Result(Arg)>' being compiled
../thrust/detail/allocator/allocator_traits.inl(102): error C2065: 't': undeclared identifier
../thrust/detail/allocator/allocator_traits.inl(102): error C2923: 'std::_Select<__formal>::_Apply': 't' is not a valid template type argument for parameter '<unnamed-symbol>'
../thrust/detail/allocator/allocator_traits.inl(102): error C2062: type 'unknown-type' unexpected

This time it's in a different but similar macro:

https://github.com/thrust/thrust/blob/master/thrust/detail/type_traits/is_call_possible.h#L56-L160

Oddly enough, I don't see this error on master. I changed the __THRUST_DEFINE_HAS_NEST_TYPE macro to use void_t per STL's suggestion, and then it started happening in a different macro. I reverted the void_t changes, now everything compiles again.

This is madness.

I'll keep investigating.

alliepiper added a commit to alliepiper/thrust that referenced this issue May 8, 2020
These files are used to reproduce NVIDIA#1090:

- repro-1090.cu  # Source file
- repro-1090.bat # Always repros the error
- repro-1090-short.bat # Fewer options, usually reproduces the error
- repro-1090-expr.bat # Smallest set of options that reliably fail

Some of the paths in the batch script might need to be updated to locate
compilers.

This was reproduced using:
- cl.exe version 19.25.28614 for x64
- nvcc.exe version V10.2.89
@StephanTLavavej
Copy link

I changed the __THRUST_DEFINE_HAS_NEST_TYPE macro to use void_t per STL's suggestion, and then it started happening in a different macro.

If the different macro is also performing SFINAE, what happens if you keep going and change that one to use void_t?

I am being vaguely reminded of an old MSVC bug where the compiler would get confused by Expression SFINAE, and answers generated for one Expression SFINAE question would be used for another unrelated question, because they looked "structurally" identical to the compiler (i.e. same code pattern except for names). Those bugs were fixed, as far as I know, but perhaps this is a lingering occurrence?

@alliepiper
Copy link
Collaborator

I'll try that next. The new macro is doing a very similar form of SFINAE, but is much more complex and will take some effort to update. Before I go that route, I just found a promising clue -- the error does not seem to happen with newer versions of nvcc.

Now that I have a reproducer, I'm internally escalating this to the nvcc team to see if they remember fixing anything that might have caused this.

Steps to reproduce (using NVCC 10.2.89 and cl 19.25.28614 x64):

> git clone --recursive https://github.com/allisonvacanti/thrust.git
> cd thrust
> git checkout bug/github/intermittent_msvc_error/1090
> git submodule update
> repro-1090-expr.bat

(The path to nvcc in the batch script may need to be adjusted)

There are three batch scripts:

  • repro-1090.bat: The full set of commands that originally triggered the error. Almost always works.
  • repro-1090-short.bat: A single command that almost always triggers the error.
  • repro-1090-expr.bat: NVCC invoked with the fewest options to somewhat reliably reproduce the error.

The source file that reproduces the error is repro-1090.cu. It just includes a single header file from thrust.

Internal nvbug: 2971098

@alliepiper
Copy link
Collaborator

It looks like the easiest way to get intermediate files and detailed output from nvcc is -v -keep. Seems obvious now :)

There are indeed differences in the files being fed into MSVC, so it doesn't seem to be an issue with MSVC at this point. @BillyONeal @StephanTLavavej @CaseyCarter, thanks for helping us investigate, it is very much appreciated!

The nvcc folks have been brought up to speed are trying to reproduce this now. I'll update this issue once we figure something out.

Current Status

For the curious, what's happening in my reproduction of the bug is:

However, the files prepared for MSVC sometimes have issues in this expansion:

$ diff -u success/bad_expansion_snippet.cpp error/bad_expansion_snippet.cpp
--- success/bad_expansion_snippet.cpp   2020-05-08 20:21:05.147434600 -0400
+++ error/bad_expansion_snippet.cpp     2020-05-08 20:20:09.443944300 -0400
@@ -1,4 +1,4 @@
-// Succeeds
+// Fails
 template <class T, class Signature>
 class has_member_construct2_impl_has_member;
 template <class T, class Result>
@@ -55,7 +55,7 @@
   class helper
   {};
   template <class U>
-  static no deduce(U*, helper<Result (base_mixin::*)(Arg), &U::construct>* = 0);
+  static no deduce(U*, helper<Result (base_mixin::*)(Arg), &std::_Select<T>::template _Apply<U, t>::construct>* = 0);
   static yes deduce(...);
 public:
   static const bool value = (sizeof(yes) ==

@alliepiper
Copy link
Collaborator

I confirmed with the nvcc team that this issue was recently fixed and that the fix will be available in the next version of the CUDA Toolkit. They are not aware of any way to work around these issues in source, unfortunately, so we'll just have to wait.

Closing since this is no longer an actionable thrust bug.

@leezu
Copy link
Author

leezu commented May 11, 2020

Thank you @allisonvacanti. Could you clarify if next version refers to Cuda 11 or if it will also be fixed in a minor update to Cuda 10.2?

@alliepiper
Copy link
Collaborator

@leezu Good question -- the information I was given said 11.0.

@NHarishGit
Copy link

I'm getting lots of reports from people seeing this same issue. I don't understand how this is just showing up now? Did a new release of MSVC 2019 go out?

Correct, we are facing similar issue in OpenCV 4.3.0 also, maybe stemming from MSVC 2019 16.5.5, see detail here: opencv/opencv#17289

AntiZpvoh pushed a commit to AntiZpvoh/incubator-mxnet that referenced this issue Jul 6, 2020
Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090
stu1130 pushed a commit to stu1130/incubator-mxnet that referenced this issue Aug 20, 2020
Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090
stu1130 pushed a commit to stu1130/incubator-mxnet that referenced this issue Aug 21, 2020
Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090
stu1130 pushed a commit to stu1130/incubator-mxnet that referenced this issue Sep 8, 2020
Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090
leezu added a commit to leezu/mxnet that referenced this issue Oct 1, 2020
Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090
samskalicky pushed a commit to apache/mxnet that referenced this issue Oct 2, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <vexilligera@gmail.com>

* backport mixed type

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
samskalicky pushed a commit to samskalicky/incubator-mxnet that referenced this issue Oct 2, 2020
* * Fix einsum gradient (apache#18482)

* [v1.7.x] Backport PRs of numpy features (apache#18653)

* add zero grad for npi_unique (apache#18080)

* fix np.clip scalar input case (apache#17788)

* fix true_divide (apache#18393)

Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>

* [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649)

* Fix Windows GPU CI (apache#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <vexilligera@gmail.com>

* backport mixed type

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>

* revise activations (apache#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* Fail build_windows.py if all retries failed (apache#18177)

* Update to thrust 1.9.8 on Windows (apache#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (apache#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
samskalicky added a commit to apache/mxnet that referenced this issue Oct 3, 2020
* * Fix einsum gradient (#18482)

* [v1.7.x] Backport PRs of numpy features (#18653)

* add zero grad for npi_unique (#18080)

* fix np.clip scalar input case (#17788)

* fix true_divide (#18393)

Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>

* [v1.7.x] backport mixed type binary ops to v1.7.x (#18649)

* Fix Windows GPU CI (#17962)

Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness.

Co-authored-by: vexilligera <vexilligera@gmail.com>

* backport mixed type

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>

* revise activations (#18700)

* [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703)

* Fix the monitor_callback invalid issue during calibration with variable input shapes

* retrigger CI

* Add UT for monitor check and disable codecov

Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* Fail build_windows.py if all retries failed (#18177)

* Update to thrust 1.9.8 on Windows (#18218)

* Update to thrust 1.9.8 on Windows

* Remove debug logic

* Re-enable build retries on MSVC (#18230)

Updating thrust alone did not help. Similar issues (though less often) still
occur with updated thrust, and also with nvidia cub. Tracked upstream at
NVIDIA/thrust#1090

Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>

Co-authored-by: Leonard Lausen <lausen@amazon.com>
Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com>
Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk>
Co-authored-by: Hao Jin <hjjn.amzn@gmail.com>
Co-authored-by: Xi Wang <xidulu@gmail.com>
Co-authored-by: Yijun Chen <chenyijun0902@gmail.com>
Co-authored-by: vexilligera <vexilligera@gmail.com>
Co-authored-by: ciyong <ciyong.chen@intel.com>
Co-authored-by: Tao Lv <tao.a.lv@intel.com>
tcojean added a commit to ginkgo-project/ginkgo that referenced this issue Aug 5, 2021
Fix Readme and disable MSVC-CUDA 10.2

+ Update to the new package status. Simplify the HIP-related INSTALL.md section.
+ Disable the MSVC-CUDA 10.2 job.

There is an issue with the CUDA implementation which prevents a proper
execution. See pytorch/pytorch#25393 and NVIDIA/thrust#1090. Tweaking
the compiler settings would allow getting fewer errors, but it seems
impossible to prevent the errors altogether.

Related PR: #852
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants