Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

leezu · 2020-04-04T00:10:17Z

We experience intermittent compilation failure on our CI server.
The CXX compiler identification is MSVC 19.25.28612.0. The CUDA compiler identification is NVIDIA 10.2.89.

Retrying the compilation typically succeeds. Our CI server now retries compiling the project up to 5 times to avoid this issue. (The issue has never occurred 5 times in a row yet.)

The error looks as follows

[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_value_type<T>' being compiled
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2065: 'U1': undeclared identifier
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2923: 'std::_Select<__formal>::_Apply': 'U1' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C4430: missing type specifier - int assumed. Note: C++ does not support default-int
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2144: syntax error: 'unknown-type' should be preceded by ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2238: unexpected token(s) preceding ';'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: ')'
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2988: unrecognizable template declaration/definition
[2020-03-29T04:47:50.014Z] C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin/../include\thrust/detail/allocator/allocator_traits.h(42): error C2059: syntax error: '<end Parse>'

This affects at least two projects

The text was updated successfully, but these errors were encountered:

brycelelbach · 2020-05-01T15:59:59Z

Hi, it's going to be very hard for us to debug this without more information and a reproducer. Please see https://github.com/brycelelbach/cpp_bug_reporting_guidelines

At the very least we need full logs and we need to see the code that you are trying to build. Thanks!

brycelelbach · 2020-05-01T16:06:14Z

@allisonvacanti can you take a look at this and see if you have any thoughts?

leezu · 2020-05-01T17:49:14Z

It can be reproduced by building MXNet or PyTorch from source. I understand that's not really minimal, sorry. I see that pytorch/pytorch#25393 contains some recent deep dive on this issue and claims a relation to #1030

One exemplar complete error log is at http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/15/nodes/39/log/?start=0

alliepiper · 2020-05-01T21:12:35Z

@leezu I recently pushed a ton of MSVC fixes (including a fix for #1030 and the atanh issue mentioned in pytorch/pytorch#25393 (comment)). Might be worth trying to build against a checkout of thrust master.

I don't recall encountering any issues involving __formal, though. I suspect it's not a thrust bug, since you're seeing this in the Eigen headers too. If this can be repro'd in a small thrust-only reproducer I will take a look.

mikoro · 2020-05-02T08:22:57Z

For me this started appearing after some MSVC 2019 update a few months ago. The error is random and it comes and goes whenever you are editing the files. Usually you can fix the error by adding some random int dummy = 0; to some header. This changes some state in the compiler and the problem goes away for a while. Also changing the compiler flags will sometimes make the error appear or go away. Sometimes recompiling the file helps, sometimes not.

It seems that yesterday I was able to fix it permanently. I just cloned the latest thrust repo with git clone --recurse-submodules to get the cub also. Then I deleted the old thrust include folder from the CUDA toolkit directory and replaced it with the thrust folder from the git master. I also copied the cub folder there.

alliepiper · 2020-05-02T14:39:31Z

That is bizarre! Hopefully it stays fixed for good, in which case the next toolkit release should resolve this.

For a less invasive workaround, you can also include the cloned thrust/cub directories directly rather than modifying the toolkit path (-I[thrust src dir] -I[thrust src dir]/dependencies/cub).

leezu · 2020-05-02T20:29:12Z

Based on preliminary data, the issue also goes away with thrust 1.9.8 which appears will be shipped in Cuda 11. At least the CI testing apache/mxnet#18218 did not experience any intermittent compilation failures on MSVC

leezu · 2020-05-03T20:24:43Z

Unfortunately the issue still occurs with thrust 1.9.8, though apparently more rarely

[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(394): note: see reference to class template instantiation 'thrust::cuda_cub::__set_operations::SetOpAgent<KeysIt1,KeysIt2,ValuesIt1,ValuesIt2,KeysOutputIt,ValuesOutputIt,Size,CompareOp,SetOp,HAS_VALUES>::PtxPlan<Arch>' being compiled
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(827): note: see reference to class template instantiation 'thrust::cuda_cub::__set_operations::SetOpAgent<KeysIt1,KeysIt2,ValuesIt1,ValuesIt2,KeysOutputIt,ValuesOutputIt,Size,CompareOp,SetOp,HAS_VALUES>' being compiled
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(348): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(356): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(393): note: see reference to class template instantiation 'thrust::cuda_cub::__set_operations::SetOpAgent<KeysIt1,KeysIt2,ValuesIt1,ValuesIt2,KeysOutputIt,ValuesOutputIt,Size,CompareOp,SetOp,HAS_VALUES>::PtxPlan<Arch>::TempStorage' being compiled
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(364): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2065: 'OtherIncrementable': undeclared identifier
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2993: 'KeysIt1': illegal type for non-type template parameter '__formal'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2923: 'std::_Select<__formal>::_Apply': 'OtherIncrementable' is not a valid template type argument for parameter '<unnamed-symbol>'
[2020-05-03T01:49:29.282Z] C:/Windows/TEMP/tmp60g75_h0/thrust-1.9.8\thrust/system/cuda/detail/set_operations.h(365): error C2993: 'unknown-type': illegal type for non-type template parameter 'PTX_ARCH'

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-18146/runs/48/nodes/40/steps/77/log/?start=0

leezu · 2020-05-04T05:24:32Z

This also affects https://github.com/NVlabs/cub

[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(891): error C2993: 'T': illegal type for non-type template parameter '__formal'
[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(909): note: see reference to class template instantiation 'cub::BinaryOpHasIdxParam<T,BinaryOp>' being compiled
[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(891): error C2065: '__T0': undeclared identifier
[2020-05-04T00:27:22.078Z] C:\jenkins_slave\workspace\build-gpu\3rdparty\nvidia_cub\cub\util_type.cuh(891): error C2923: 'std::_Select<__formal>::_Apply': '__T0' is not a valid template type argument for parameter '<unnamed-symbol>'

http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-18146/runs/51/nodes/40/steps/84/log/?start=0

mikoro · 2020-05-04T06:14:15Z

Actually same for me, I did get the same error once in the last few days. It seems to appear less frequently with the newer thrust though.

Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090

alliepiper · 2020-05-05T00:24:43Z

Have you tried reporting an issue on the MSVC STL github repo? It sounds like something somewhere is passing a typename to their internal std::_Select utility. If it's not a bug on their end, they might know of a way to track things down.

leezu · 2020-05-05T03:56:46Z

I opened microsoft/STL#792

alliepiper · 2020-05-05T21:01:51Z

@leezu @mikoro I just found something that might somehow tie into this. Do either of your codebases use the thrust::detail::is_allocator<T> trait?

https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L383-L386

If so, please link me the code that does and try making this change to see if the problem goes away:

Replace this section of the linked code

    : allocator_traits_detail::has_value_type<T>

with:

    : typename allocator_traits_detail::has_value_type<T>::type

leezu · 2020-05-05T21:27:27Z

Thanks @allisonvacanti. MXNet does not seem to use the trait, at least not directly: https://github.com/apache/incubator-mxnet/search?q=is_allocator&unscoped_q=is_allocator

mikoro · 2020-05-06T06:48:46Z

Do either of your codebases use the thrust::detail::is_allocator<T> trait?

I searched through all our codebase and there was not anything using that trait.

One additional bit of info: I have seen the error coming from compiling multiple different .cu files. So it is not always just one .cu file that gives the problems. What is common to all these .cu files that have produced the error is that they end up including following thrust headers:

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>

alliepiper · 2020-05-06T16:37:04Z

This is a long shot, but what about thrust::detail::allocator_traits_detail::has_value_type<T>?

That's the trait the error message mentioned, but it's only used internally by is_allocator<T>, which does not appear to be used at all internally. Since these aren't used, I can't figure out why the compiler is even trying to instantiate this trait.

Both of the vector headers would bring in the allocators, so that makes some sense.

CaseyCarter · 2020-05-06T16:59:00Z

This is a long shot, but what about thrust::detail::allocator_traits_detail::has_value_type<T>?

That's the trait the error message mentioned, but it's only used internally by is_allocator<T>, which does not appear to be used at all internally. Since these aren't used, I can't figure out why the compiler is even trying to instantiate this trait.

I suppose this could be a diagnostic error? Has someone tried building with is_allocator and has_value_type commented out to see whether the problem goes away or if the diagnostic targets a different source location?

MSVC had some diagnostic location issues in roughly the 16.5 time frame, but (1) I had the impression that use of C++ Concepts was necessary to trigger it (which is certainly not the case here), and (2) I thought it was fixed before we released. If we can confirm "moving diagnostics" I can dig in to this more with the compiler team.

alliepiper · 2020-05-06T19:14:26Z

@CaseyCarter I've built Thrust proper with static_assert that would trigger if has_value_type was used and confirmed that we don't use it internally.

I'm also curious what @mikoro and @leezu will see if they remove these. The bits that need to be comment out are:

https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L42 and
https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L383-L386

brycelelbach · 2020-05-08T04:30:07Z

I'm getting lots of reports from people seeing this same issue. I don't understand how this is just showing up now? Did a new release of MSVC 2019 go out?

brycelelbach · 2020-05-08T04:30:28Z

pytorch/pytorch#38024 may be related

brycelelbach · 2020-05-08T04:41:44Z

@CaseyCarter did y'all put out a release recently? Like in the past month or two?

BillyONeal · 2020-05-08T04:44:11Z

16.5 came out in March: https://devblogs.microsoft.com/visualstudio/visual-studio-2019-version-16-5/
There have been some servicing releases: https://docs.microsoft.com/en-us/visualstudio/releases/2019/release-notes

brycelelbach · 2020-05-08T04:59:59Z

Okay squad I'm starting to see wider impacts from this in the PyTorch ecosystem. @allisonvacanti could you try what Casey suggested?

alliepiper · 2020-05-08T15:29:41Z

I haven't been able to reproduce this yet, but I have confirmed that the trait mentioned in the diagnostic is not instantiated by thrust directly, and doesn't appear to be used in either of the affected projects. I've been using the same MSVC2019 and NVCC versions as the bug reports for several weeks now, it's odd that we aren't seeing the same behavior. We still need to track down why this is happening.

I noticed that the internal bug report yesterday is pointing at function_traits.h:42 instead of allocator_traits.h:42.

So clearly the issue is related to life, the universe, and everything.

Or maybe it's just because both of those files use __THRUST_DEFINE_HAS_NESTED_TYPE for the first time at line 42.

I'll put together a patch with @StephanTLavavej's suggestion and replace the implementation of that macro with the equivalent void_t code. We'll need someone who can reproduce the issue to test it and see if it helps.

peterjc123 · 2020-05-08T16:10:44Z

Just to add a bit more info. This issue also occurs with MSVC 2017 (C++ toolchain 14.11 / 14.16). But I agree that it is a bit difficult to reproduce.

alliepiper · 2020-05-08T19:47:24Z

My local build just started failing with the error:

cmd.exe /C "C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin\nvcc.exe  -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -I..\ -I..\dependencies\cub -ftemplate-backtrace-limit 64 -gencode arch=compute_75,code=sm_75 -Xcompiler=/WX -Xcompiler=/wd4244 -Xcompiler=/wd4267 -Xcompiler=/wd4800 -Xcompiler=/wd4146 -Xcompiler=/wd4494 -Xcompiler=/bigobj -Werror all-warnings -Xcudafe --display_error_number -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -x cu -c headers\system\cuda\memory_resource.h.cu -o CMakeFiles\header-test.dir\headers\system\cuda\memory_resource.h.cu.obj -Xcompiler=-FdCMakeFiles\header-test.dir\,-FS && C:\PROGRA~1\NVIDIA~2\CUDA\v10.2\bin\nvcc.exe -ccbin=C:\PROGRA~2\MICROS~1\2019\COMMUN~1\VC\Tools\MSVC\1425~1.286\bin\Hostx64\x64\cl.exe -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_CUDA -DTHRUST_HOST_SYSTEM=THRUST_HOST_SYSTEM_CPP -I..\ -I..\dependencies\cub -ftemplate-backtrace-limit 64 -gencode arch=compute_75,code=sm_75 -Xcompiler=/WX -Xcompiler=/wd4244 -Xcompiler=/wd4267 -Xcompiler=/wd4800 -Xcompiler=/wd4146 -Xcompiler=/wd4494 -Xcompiler=/bigobj -Werror all-warnings -Xcudafe --display_error_number -Xcompiler="-MD -O2 -Ob2" -DNDEBUG -x cu -M headers\system\cuda\memory_resource.h.cu -MT CMakeFiles\header-test.dir\headers\system\cuda\memory_resource.h.cu.obj -o CMakeFiles\header-test.dir\headers\system\cuda\memory_resource.h.cu.obj.d"
memory_resource.h.cu
../thrust/detail/allocator/allocator_traits.inl(102): error C2993: 'T': illegal type for non-type template parameter '__formal'
../thrust/detail/allocator/allocator_traits.inl(102): note: see reference to class template instantiation 'thrust::detail::allocator_traits_detail::has_member_construct2_impl_has_member<T,Result(Arg)>' being compiled
../thrust/detail/allocator/allocator_traits.inl(102): error C2065: 't': undeclared identifier
../thrust/detail/allocator/allocator_traits.inl(102): error C2923: 'std::_Select<__formal>::_Apply': 't' is not a valid template type argument for parameter '<unnamed-symbol>'
../thrust/detail/allocator/allocator_traits.inl(102): error C2062: type 'unknown-type' unexpected

This time it's in a different but similar macro:

https://github.com/thrust/thrust/blob/master/thrust/detail/type_traits/is_call_possible.h#L56-L160

Oddly enough, I don't see this error on master. I changed the __THRUST_DEFINE_HAS_NEST_TYPE macro to use void_t per STL's suggestion, and then it started happening in a different macro. I reverted the void_t changes, now everything compiles again.

This is madness.

I'll keep investigating.

These files are used to reproduce NVIDIA#1090: - repro-1090.cu # Source file - repro-1090.bat # Always repros the error - repro-1090-short.bat # Fewer options, usually reproduces the error - repro-1090-expr.bat # Smallest set of options that reliably fail Some of the paths in the batch script might need to be updated to locate compilers. This was reproduced using: - cl.exe version 19.25.28614 for x64 - nvcc.exe version V10.2.89

StephanTLavavej · 2020-05-08T21:46:09Z

I changed the __THRUST_DEFINE_HAS_NEST_TYPE macro to use void_t per STL's suggestion, and then it started happening in a different macro.

If the different macro is also performing SFINAE, what happens if you keep going and change that one to use void_t?

I am being vaguely reminded of an old MSVC bug where the compiler would get confused by Expression SFINAE, and answers generated for one Expression SFINAE question would be used for another unrelated question, because they looked "structurally" identical to the compiler (i.e. same code pattern except for names). Those bugs were fixed, as far as I know, but perhaps this is a lingering occurrence?

alliepiper · 2020-05-08T21:56:22Z

I'll try that next. The new macro is doing a very similar form of SFINAE, but is much more complex and will take some effort to update. Before I go that route, I just found a promising clue -- the error does not seem to happen with newer versions of nvcc.

Now that I have a reproducer, I'm internally escalating this to the nvcc team to see if they remember fixing anything that might have caused this.

Steps to reproduce (using NVCC 10.2.89 and cl 19.25.28614 x64):

> git clone --recursive https://github.com/allisonvacanti/thrust.git
> cd thrust
> git checkout bug/github/intermittent_msvc_error/1090
> git submodule update
> repro-1090-expr.bat

(The path to nvcc in the batch script may need to be adjusted)

There are three batch scripts:

repro-1090.bat: The full set of commands that originally triggered the error. Almost always works.
repro-1090-short.bat: A single command that almost always triggers the error.
repro-1090-expr.bat: NVCC invoked with the fewest options to somewhat reliably reproduce the error.

The source file that reproduces the error is repro-1090.cu. It just includes a single header file from thrust.

Internal nvbug: 2971098

alliepiper · 2020-05-09T01:05:46Z

It looks like the easiest way to get intermediate files and detailed output from nvcc is -v -keep. Seems obvious now :)

There are indeed differences in the files being fed into MSVC, so it doesn't seem to be an issue with MSVC at this point. @BillyONeal @StephanTLavavej @CaseyCarter, thanks for helping us investigate, it is very much appreciated!

The nvcc folks have been brought up to speed are trying to reproduce this now. I'll update this issue once we figure something out.

Current Status

For the curious, what's happening in my reproduction of the bug is:

This macro expansion is initiated: https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.inl#L102
Which hops through into another macro: https://github.com/thrust/thrust/blob/master/thrust/detail/type_traits/is_call_possible.h#L57
and is finally expanded here: https://github.com/thrust/thrust/blob/master/thrust/detail/type_traits/has_member_function.h#L55

However, the files prepared for MSVC sometimes have issues in this expansion:

$ diff -u success/bad_expansion_snippet.cpp error/bad_expansion_snippet.cpp
--- success/bad_expansion_snippet.cpp   2020-05-08 20:21:05.147434600 -0400
+++ error/bad_expansion_snippet.cpp     2020-05-08 20:20:09.443944300 -0400
@@ -1,4 +1,4 @@
-// Succeeds
+// Fails
 template <class T, class Signature>
 class has_member_construct2_impl_has_member;
 template <class T, class Result>
@@ -55,7 +55,7 @@
   class helper
   {};
   template <class U>
-  static no deduce(U*, helper<Result (base_mixin::*)(Arg), &U::construct>* = 0);
+  static no deduce(U*, helper<Result (base_mixin::*)(Arg), &std::_Select<T>::template _Apply<U, t>::construct>* = 0);
   static yes deduce(...);
 public:
   static const bool value = (sizeof(yes) ==

alliepiper · 2020-05-11T17:47:35Z

I confirmed with the nvcc team that this issue was recently fixed and that the fix will be available in the next version of the CUDA Toolkit. They are not aware of any way to work around these issues in source, unfortunately, so we'll just have to wait.

Closing since this is no longer an actionable thrust bug.

leezu · 2020-05-11T17:54:31Z

Thank you @allisonvacanti. Could you clarify if next version refers to Cuda 11 or if it will also be fixed in a minor update to Cuda 10.2?

alliepiper · 2020-05-11T17:59:27Z

@leezu Good question -- the information I was given said 11.0.

NHarishGit · 2020-05-14T20:49:56Z

I'm getting lots of reports from people seeing this same issue. I don't understand how this is just showing up now? Did a new release of MSVC 2019 go out?

Correct, we are facing similar issue in OpenCV 4.3.0 also, maybe stemming from MSVC 2019 16.5.5, see detail here: opencv/opencv#17289

Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>

Fix Readme and disable MSVC-CUDA 10.2 + Update to the new package status. Simplify the HIP-related INSTALL.md section. + Disable the MSVC-CUDA 10.2 job. There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393 and NVIDIA/thrust#1090. Tweaking the compiler settings would allow getting fewer errors, but it seems impossible to prevent the errors altogether. Related PR: #852

leezu mentioned this issue Apr 4, 2020

Windows CI CUDA Intermittent error C2993 apache/mxnet#17935

Open

leezu mentioned this issue May 4, 2020

Re-enable build retries on MSVC apache/mxnet#18230

Merged

leezu added a commit to apache/mxnet that referenced this issue May 4, 2020

Re-enable build retries on MSVC (#18230)

322aa53

Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090

leezu mentioned this issue May 5, 2020

Intermittent compilation failures due to std::_Select utility microsoft/STL#792

Closed

brycelelbach mentioned this issue May 8, 2020

CUDA debug build failed on Windows pytorch/pytorch#38024

Closed

alliepiper self-assigned this May 8, 2020

alliepiper mentioned this issue May 8, 2020

WIP: MSVC + NVCC 10.2 intermittent error reproducer #1136

Closed

leezu mentioned this issue May 9, 2020

Windows CI intermittent error: C2993: 'Derived': illegal type for non-type template parameter '__formal pytorch/pytorch#25393

Closed

alliepiper closed this as completed May 11, 2020

NHarishGit mentioned this issue May 14, 2020

OpenCV 4.3.0 RELEASE build error in traits.hpp opencv/opencv#17289

Open

mworchel mentioned this issue Jun 16, 2020

Experimental GPU support for Windows BachiLi/redner#103

Merged

leezu mentioned this issue Aug 20, 2020

[RFC] v1.8.0 release apache/mxnet#18800

Open

leezu mentioned this issue Sep 10, 2020

Fix race condition in NaiveEngine::PushAsync apache/mxnet#19108

Merged

leezu mentioned this issue Nov 7, 2020

Fix compilation for large tensor with MKL apache/mxnet#19067

Merged

justanotheruser mentioned this issue Mar 31, 2021

Unable to build v1.8.x with cpp_package in VS2019 apache/mxnet#20099

Closed

yhmtsai mentioned this issue Aug 4, 2021

Fix Readme and disable MSVC-CUDA 10.2 ginkgo-project/ginkgo#852

Merged

peterjc123 mentioned this issue Sep 8, 2021

Nondeterministic Windows CUDA compilation failure pytorch/pytorch#64605

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

leezu commented Apr 4, 2020 •

edited

Loading

brycelelbach commented May 1, 2020

brycelelbach commented May 1, 2020

leezu commented May 1, 2020 •

edited

Loading

alliepiper commented May 1, 2020

mikoro commented May 2, 2020

alliepiper commented May 2, 2020

leezu commented May 2, 2020

leezu commented May 3, 2020

leezu commented May 4, 2020

mikoro commented May 4, 2020

alliepiper commented May 5, 2020

leezu commented May 5, 2020

alliepiper commented May 5, 2020 •

edited

Loading

leezu commented May 5, 2020

mikoro commented May 6, 2020

alliepiper commented May 6, 2020

CaseyCarter commented May 6, 2020 •

edited

Loading

alliepiper commented May 6, 2020

brycelelbach commented May 8, 2020

brycelelbach commented May 8, 2020

brycelelbach commented May 8, 2020

BillyONeal commented May 8, 2020

brycelelbach commented May 8, 2020

alliepiper commented May 8, 2020

peterjc123 commented May 8, 2020

alliepiper commented May 8, 2020

StephanTLavavej commented May 8, 2020

alliepiper commented May 8, 2020

alliepiper commented May 9, 2020

alliepiper commented May 11, 2020

leezu commented May 11, 2020

alliepiper commented May 11, 2020

NHarishGit commented May 14, 2020

Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090

Comments

leezu commented Apr 4, 2020 • edited Loading

brycelelbach commented May 1, 2020

brycelelbach commented May 1, 2020

leezu commented May 1, 2020 • edited Loading

alliepiper commented May 1, 2020

mikoro commented May 2, 2020

alliepiper commented May 2, 2020

leezu commented May 2, 2020

leezu commented May 3, 2020

leezu commented May 4, 2020

mikoro commented May 4, 2020

alliepiper commented May 5, 2020

leezu commented May 5, 2020

alliepiper commented May 5, 2020 • edited Loading

leezu commented May 5, 2020

mikoro commented May 6, 2020

alliepiper commented May 6, 2020

CaseyCarter commented May 6, 2020 • edited Loading

alliepiper commented May 6, 2020

brycelelbach commented May 8, 2020

brycelelbach commented May 8, 2020

brycelelbach commented May 8, 2020

BillyONeal commented May 8, 2020

brycelelbach commented May 8, 2020

alliepiper commented May 8, 2020

peterjc123 commented May 8, 2020

alliepiper commented May 8, 2020

StephanTLavavej commented May 8, 2020

alliepiper commented May 8, 2020

alliepiper commented May 9, 2020

Current Status

alliepiper commented May 11, 2020

leezu commented May 11, 2020

alliepiper commented May 11, 2020

NHarishGit commented May 14, 2020

leezu commented Apr 4, 2020 •

edited

Loading

leezu commented May 1, 2020 •

edited

Loading

alliepiper commented May 5, 2020 •

edited

Loading

CaseyCarter commented May 6, 2020 •

edited

Loading