-
Notifications
You must be signed in to change notification settings - Fork 758
Intermittent compilation failures with thrust, cuda 10.2 and MSVC 2019 #1090
Comments
Hi, it's going to be very hard for us to debug this without more information and a reproducer. Please see https://github.com/brycelelbach/cpp_bug_reporting_guidelines At the very least we need full logs and we need to see the code that you are trying to build. Thanks! |
@allisonvacanti can you take a look at this and see if you have any thoughts? |
It can be reproduced by building MXNet or PyTorch from source. I understand that's not really minimal, sorry. I see that pytorch/pytorch#25393 contains some recent deep dive on this issue and claims a relation to #1030 One exemplar complete error log is at http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/windows-gpu/branches/PR-17808/runs/15/nodes/39/log/?start=0 |
@leezu I recently pushed a ton of MSVC fixes (including a fix for #1030 and the I don't recall encountering any issues involving |
For me this started appearing after some MSVC 2019 update a few months ago. The error is random and it comes and goes whenever you are editing the files. Usually you can fix the error by adding some random It seems that yesterday I was able to fix it permanently. I just cloned the latest thrust repo with |
That is bizarre! Hopefully it stays fixed for good, in which case the next toolkit release should resolve this. For a less invasive workaround, you can also include the cloned thrust/cub directories directly rather than modifying the toolkit path ( |
Based on preliminary data, the issue also goes away with thrust 1.9.8 which appears will be shipped in Cuda 11. At least the CI testing apache/mxnet#18218 did not experience any intermittent compilation failures on MSVC |
Unfortunately the issue still occurs with thrust 1.9.8, though apparently more rarely
|
This also affects https://github.com/NVlabs/cub
|
Actually same for me, I did get the same error once in the last few days. It seems to appear less frequently with the newer thrust though. |
Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090
Have you tried reporting an issue on the MSVC STL github repo? It sounds like something somewhere is passing a typename to their internal std::_Select utility. If it's not a bug on their end, they might know of a way to track things down. |
I opened microsoft/STL#792 |
@leezu @mikoro I just found something that might somehow tie into this. Do either of your codebases use the If so, please link me the code that does and try making this change to see if the problem goes away: Replace this section of the linked code
with:
|
Thanks @allisonvacanti. MXNet does not seem to use the trait, at least not directly: https://github.com/apache/incubator-mxnet/search?q=is_allocator&unscoped_q=is_allocator |
I searched through all our codebase and there was not anything using that trait. One additional bit of info: I have seen the error coming from compiling multiple different .cu files. So it is not always just one .cu file that gives the problems. What is common to all these .cu files that have produced the error is that they end up including following thrust headers:
|
This is a long shot, but what about That's the trait the error message mentioned, but it's only used internally by Both of the vector headers would bring in the allocators, so that makes some sense. |
I suppose this could be a diagnostic error? Has someone tried building with MSVC had some diagnostic location issues in roughly the 16.5 time frame, but (1) I had the impression that use of C++ Concepts was necessary to trigger it (which is certainly not the case here), and (2) I thought it was fixed before we released. If we can confirm "moving diagnostics" I can dig in to this more with the compiler team. |
@CaseyCarter I've built Thrust proper with static_assert that would trigger if I'm also curious what @mikoro and @leezu will see if they remove these. The bits that need to be comment out are: https://github.com/thrust/thrust/blob/master/thrust/detail/allocator/allocator_traits.h#L42 and |
I'm getting lots of reports from people seeing this same issue. I don't understand how this is just showing up now? Did a new release of MSVC 2019 go out? |
pytorch/pytorch#38024 may be related |
@CaseyCarter did y'all put out a release recently? Like in the past month or two? |
16.5 came out in March: https://devblogs.microsoft.com/visualstudio/visual-studio-2019-version-16-5/ |
Okay squad I'm starting to see wider impacts from this in the PyTorch ecosystem. @allisonvacanti could you try what Casey suggested? |
I haven't been able to reproduce this yet, but I have confirmed that the trait mentioned in the diagnostic is not instantiated by thrust directly, and doesn't appear to be used in either of the affected projects. I've been using the same MSVC2019 and NVCC versions as the bug reports for several weeks now, it's odd that we aren't seeing the same behavior. We still need to track down why this is happening. I noticed that the internal bug report yesterday is pointing at function_traits.h:42 instead of allocator_traits.h:42. So clearly the issue is related to life, the universe, and everything. Or maybe it's just because both of those files use I'll put together a patch with @StephanTLavavej's suggestion and replace the implementation of that macro with the equivalent |
Just to add a bit more info. This issue also occurs with MSVC 2017 (C++ toolchain 14.11 / 14.16). But I agree that it is a bit difficult to reproduce. |
My local build just started failing with the error:
This time it's in a different but similar macro: Oddly enough, I don't see this error on master. I changed the This is madness. I'll keep investigating. |
These files are used to reproduce NVIDIA#1090: - repro-1090.cu # Source file - repro-1090.bat # Always repros the error - repro-1090-short.bat # Fewer options, usually reproduces the error - repro-1090-expr.bat # Smallest set of options that reliably fail Some of the paths in the batch script might need to be updated to locate compilers. This was reproduced using: - cl.exe version 19.25.28614 for x64 - nvcc.exe version V10.2.89
If the different macro is also performing SFINAE, what happens if you keep going and change that one to use I am being vaguely reminded of an old MSVC bug where the compiler would get confused by Expression SFINAE, and answers generated for one Expression SFINAE question would be used for another unrelated question, because they looked "structurally" identical to the compiler (i.e. same code pattern except for names). Those bugs were fixed, as far as I know, but perhaps this is a lingering occurrence? |
I'll try that next. The new macro is doing a very similar form of SFINAE, but is much more complex and will take some effort to update. Before I go that route, I just found a promising clue -- the error does not seem to happen with newer versions of nvcc. Now that I have a reproducer, I'm internally escalating this to the nvcc team to see if they remember fixing anything that might have caused this. Steps to reproduce (using NVCC 10.2.89 and cl 19.25.28614 x64):
(The path to nvcc in the batch script may need to be adjusted) There are three batch scripts:
The source file that reproduces the error is repro-1090.cu. It just includes a single header file from thrust. Internal nvbug: 2971098 |
It looks like the easiest way to get intermediate files and detailed output from nvcc is There are indeed differences in the files being fed into MSVC, so it doesn't seem to be an issue with MSVC at this point. @BillyONeal @StephanTLavavej @CaseyCarter, thanks for helping us investigate, it is very much appreciated! The nvcc folks have been brought up to speed are trying to reproduce this now. I'll update this issue once we figure something out. Current StatusFor the curious, what's happening in my reproduction of the bug is:
However, the files prepared for MSVC sometimes have issues in this expansion:
|
I confirmed with the nvcc team that this issue was recently fixed and that the fix will be available in the next version of the CUDA Toolkit. They are not aware of any way to work around these issues in source, unfortunately, so we'll just have to wait. Closing since this is no longer an actionable thrust bug. |
Thank you @allisonvacanti. Could you clarify if next version refers to Cuda 11 or if it will also be fixed in a minor update to Cuda 10.2? |
@leezu Good question -- the information I was given said 11.0. |
Correct, we are facing similar issue in OpenCV 4.3.0 also, maybe stemming from MSVC 2019 16.5.5, see detail here: opencv/opencv#17289 |
Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090
Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090
Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090
Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090
Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090
* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>
* * Fix einsum gradient (apache#18482) * [v1.7.x] Backport PRs of numpy features (apache#18653) * add zero grad for npi_unique (apache#18080) * fix np.clip scalar input case (apache#17788) * fix true_divide (apache#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (apache#18649) * Fix Windows GPU CI (apache#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in apache#17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (apache#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (apache#18632) (apache#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (apache#18177) * Update to thrust 1.9.8 on Windows (apache#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (apache#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>
* * Fix einsum gradient (#18482) * [v1.7.x] Backport PRs of numpy features (#18653) * add zero grad for npi_unique (#18080) * fix np.clip scalar input case (#17788) * fix true_divide (#18393) Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> * [v1.7.x] backport mixed type binary ops to v1.7.x (#18649) * Fix Windows GPU CI (#17962) Update Windows CI to use VS 2019 and enable x64 bit toolchain. Previously we are using an older 32 bit toolchain causing OOM errors during linking. Switching to x64 bit toolchain on the older VS version previously used by the CI was attempted in #17912 and did not work. Update to Cuda 10.2 as it is required by VS 2019. Switch to ninja-build on Windows to speed up build as ninja-build is now preinstalled. Remove logic to install cmake 3.16 on every PR as cmake 3.17 is now preinstalled. Add build retrials due to cuda thrust + VS2019 flakyness. Co-authored-by: vexilligera <vexilligera@gmail.com> * backport mixed type Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: vexilligera <vexilligera@gmail.com> * revise activations (#18700) * [v1.6] Fix the monitor_callback invalid issue during calibration with variable input shapes (#18632) (#18703) * Fix the monitor_callback invalid issue during calibration with variable input shapes * retrigger CI * Add UT for monitor check and disable codecov Co-authored-by: Tao Lv <tao.a.lv@intel.com> * Fail build_windows.py if all retries failed (#18177) * Update to thrust 1.9.8 on Windows (#18218) * Update to thrust 1.9.8 on Windows * Remove debug logic * Re-enable build retries on MSVC (#18230) Updating thrust alone did not help. Similar issues (though less often) still occur with updated thrust, and also with nvidia cub. Tracked upstream at NVIDIA/thrust#1090 Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com> Co-authored-by: Leonard Lausen <lausen@amazon.com> Co-authored-by: Ke Han <38852697+hanke580@users.noreply.github.com> Co-authored-by: Xingjian Shi <xshiab@connect.ust.hk> Co-authored-by: Hao Jin <hjjn.amzn@gmail.com> Co-authored-by: Xi Wang <xidulu@gmail.com> Co-authored-by: Yijun Chen <chenyijun0902@gmail.com> Co-authored-by: vexilligera <vexilligera@gmail.com> Co-authored-by: ciyong <ciyong.chen@intel.com> Co-authored-by: Tao Lv <tao.a.lv@intel.com>
Fix Readme and disable MSVC-CUDA 10.2 + Update to the new package status. Simplify the HIP-related INSTALL.md section. + Disable the MSVC-CUDA 10.2 job. There is an issue with the CUDA implementation which prevents a proper execution. See pytorch/pytorch#25393 and NVIDIA/thrust#1090. Tweaking the compiler settings would allow getting fewer errors, but it seems impossible to prevent the errors altogether. Related PR: #852
We experience intermittent compilation failure on our CI server.
The CXX compiler identification is MSVC 19.25.28612.0. The CUDA compiler identification is NVIDIA 10.2.89.
Retrying the compilation typically succeeds. Our CI server now retries compiling the project up to 5 times to avoid this issue. (The issue has never occurred 5 times in a row yet.)
The error looks as follows
This affects at least two projects
The text was updated successfully, but these errors were encountered: