-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SD test failure #3826
Comments
Already mentioned in #3445 (comment), I forgot to open a ticket. IIRC it works by tweaking the CUDA include paths, but can't remember the exact procedure. It probably involved my custom CUDA library with the patched thrust version. SD on GPU is experimental until this is sorted out. espresso/src/config/features.def Line 32 in b61090a
|
If it is experimental it should be opt-in and not turned on |
Also related, why is |
|
I can't see why this should be treated differently. I think a better solution would be to have an |
@jgrad decided to have |
The question is why does the result differ between a GTX 1080 and an RTX 2080. Is it just slightly different numerical precision and the error bound is too tight, or is it an actual bug. |
This commit is quite long, and some of the changes (e.g. the time step) were reverted later. I don't think exploring this commit further would help. On my workstation ( |
My take on the offline discussion at the Esprseso meeting: Personally, I think this has to converge by mid-August. In the weeks around the Espresso school, new users will potentially download the code and run the test suite. By that time, the issue needs to be gone one way or the other. |
Well, I initially suggested pinning the Thrust version by including it as a submodule which would have also saved many hours that were wasted on the |
Good find. Should be easy enough to bisect NVIDIA/thrust@1.9.3...1.9.5 to find the cause. |
That wouldn't have worked as Thrust and CUDA versions are not arbitrarily upward and downward compatible.
CUDA 9.1.85 with Thrust 1.9.2, 1.9.3, 1.9.5 and 1.9.7 exhibits the error. Here is my git-bisect-compatible build script: #!/bin/bash
cd ~/Documents/espresso/build || exit 192
rm -rf * || exit 192
cmake .. \
-DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.0 -DCUDA_NVCC_EXECUTABLE=/usr/local/cuda-10.0/bin/nvcc \
-DWITH_CUDA=ON -DCMAKE_CXX_FLAGS="-I/tikhome/mkuron/Documents/espresso/thrust" \
-DCUDA_NVCC_FLAGS="-I/tikhome/mkuron/Documents/espresso/thrust" || exit 192
cp ../maintainer/configs/maxset.hpp myconfig.hpp || exit 192
cmake . || exit 192
make -j 24 VERBOSE=1 || exit 125
make python_test_data || exit 192
./pypresso ../testsuite/python/stokesian_dynamics_gpu.py The way I see it, it's not related to Thrust version, but CUDA version. Disable building Stokesian Dynamics on CUDA less than 10 and we're done. |
I need to amend this statement: there is no error with CUDA 10.0. I misread my CMake log output. The test fails with CUDA 9, either 9.0 from the ICP or CUDA 9.1.85 from the docker image, when the hardware is a GeForce RTX 2080.
Indeed, thrust is likely not at fault here. However, it depends also on the hardware. The test runs fine in CI on CUDA 9.
This requires making the CMake logic for the Disabling SD GPU for CUDA 9 is also problematic, because we have committed ourselves to supporting CUDA 9 in bugfix releases until October 2021 (target release date for espresso 4.3). We already disabled SD GPU on ROCm in CI because it doesn't run, and disabled the diffusion test on all GPU jobs due to timeouts. Other GPU features of espresso do not have this special treatment. |
Ah, I forgot about that. So in summary, it only fails when you use CUDA 9 with a Turing GPU. According to Nvidia's documentation, newer GPUs support code compiled with older CUDA versions (Turing was released after CUDA 9.1), but here they clearly don't. We've actually had a similar case before (#1412 (comment)), but the trick employed back then ( I've tried comparing the generated PTX, but that's futile because so much changed between versions. /usr/local/cuda-10.0/bin/nvcc _deps/stokesian_dynamics-src/src/sd_gpu.cu --ptx -o ptx10.txt -Dsd_gpu_EXPORTS -DSD_USE_THRUST -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -O3 -I/usr/include -I_deps/stokesian_dynamics-src/include
/usr/bin/nvcc _deps/stokesian_dynamics-src/src/sd_gpu.cu --ptx -o ptx9.txt -Dsd_gpu_EXPORTS -DSD_USE_THRUST -gencode=arch=compute_70,code=sm_70 -gencode=arch=compute_70,code=compute_70 -std=c++14 -O3 -I/usr/include -I_deps/stokesian_dynamics-src/include
diff -u ptx9.txt ptx10.txt |
Disable Stokesian Dynamics on GPU until the build system issue is sorted out and the GPU code passes CI on all platforms, as discussed in the [2020-07-28 ESPResSo meeting](https://github.com/espressomd/espresso/wiki/Espresso-meeting-2020-07-28) to avoid failing espresso builds (#3836) and failing python tests (#3826).
why is all the SD dependency management happening on our side? This should be handled by the cmake of the external library. |
The Stokasian Dynamics GPU test fails on my machine with
System Info
The GPU is a
NVidia GeForce RTX 2080
.The text was updated successfully, but these errors were encountered: