ROCm 3.3 support #3623

mkuron · 2020-04-02T11:31:02Z

The ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/ issue has been worked around by properly setting HCC_PATH on the CMake side.
The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing cudaDestoryStream though.

Fixes #3620 (according to ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000).
Fixes #3587 (according to ctest -R ek_charged_plate --repeat-until-fail 100).

TODO

https://github.com/espressomd/docker/blob/master/docker/rocm-python3/Dockerfile-latest needs to be updated to ROCm 3.3 once this pull request is merged.

codecov · 2020-04-02T11:52:46Z

Codecov Report

Merging #3623 into python will decrease coverage by <1%.
The diff coverage is n/a.

@@           Coverage Diff            @@
##           python   #3623     +/-   ##
========================================
- Coverage      87%     87%     -1%     
========================================
  Files         524     512     -12     
  Lines       23409   22036   -1373     
========================================
- Hits        20595   19371   -1224     
+ Misses       2814    2665    -149

Impacted Files	Coverage Δ
...al_sites/lb_inertialess_tracers_cuda_interface.cpp	`0% <0%> (-100%)`	⬇️
...rialization/ibm_cuda_particle_velocities_input.hpp	`0% <0%> (-100%)`	⬇️
src/core/actor/ActorList.cpp	`0% <0%> (-80%)`	⬇️
src/core/EspressoSystemInterface.cpp	`40% <0%> (-51%)`	⬇️
...tatics_magnetostatics/magnetic_non_p3m_methods.cpp	`0% <0%> (-32%)`	⬇️
src/script_interface/shapes/Slitpore.hpp	`80% <0%> (-11%)`	⬇️
...rc/core/grid_based_algorithms/lb_interpolation.cpp	`85% <0%> (-10%)`	⬇️
src/core/electrostatics_magnetostatics/dipole.cpp	`58% <0%> (-6%)`	⬇️
src/core/forces.cpp	`96% <0%> (-4%)`	⬇️
src/core/energy.cpp	`95% <0%> (-3%)`	⬇️
... and 153 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc2d279...d0b4c33. Read the comment docs.

mkuron · 2020-04-02T12:40:17Z

No idea what's wrong with the ek_fluctuations test. Philox works, LB thermalization statistics are correct, EK thermalization statistics are incorrect. The responsible function does not look suspicious in any way:

espresso/src/core/grid_based_algorithms/electrokinetics_cuda.cu

Lines 1346 to 1414 in 8813c4c

 __device__ void ek_add_fluctuations_to_flux(unsigned int index, 

 unsigned int species_index, 

 unsigned int *neighborindex, 

 LB_nodes_gpu lb_node, 

 uint64_t philox_counter) { 

 if (index < ek_parameters_gpu->number_of_nodes) { 

 float density = ek_parameters_gpu->rho[species_index][index]; 

 float *flux = ek_parameters_gpu->j; 

 float diffusion = ek_parameters_gpu->D[species_index]; 

 float time_step = ek_parameters_gpu->time_step; 

 float agrid = ek_parameters_gpu->agrid; 

 float4 random_floats; 

 float random; 

 #ifdef EK_DEBUG 

 float *flux_fluc = ek_parameters_gpu->j_fluc; 

 #endif 

 float fluc = 0.0f; 

 for (int i = 0; i < 9; i++) { 

 if (i % 4 == 0) { 

 random_floats = random_wrapper_philox(index, i + 40, philox_counter); 

 random = (random_floats.w - 0.5f) * 2.0f; 

 } else if (i % 4 == 1) { 

 random = (random_floats.x - 0.5f) * 2.0f; 

 } else if (i % 4 == 2) { 

 random = (random_floats.y - 0.5f) * 2.0f; 

 } else if (i % 4 == 3) { 

 random = (random_floats.z - 0.5f) * 2.0f; 

 } 

 float H = 0.0f; 

 float HN = 0.0f; 

 float neighbor_density = 

 ek_parameters_gpu->rho[species_index][neighborindex[i]]; 

 H = (density >= 0.0f) * min(density, 1.0f); 

 HN = (neighbor_density >= 0.0f) * min(neighbor_density, 1.0f); 

 float average_density = H * HN * (density + neighbor_density) / 2.0f; 

 if (i > 2) { 

 fluc = 1.0f * 

 powf(2.0f * average_density * diffusion * time_step / 

 (agrid * agrid), 

 0.5f) * 

 random * ek_parameters_gpu->fluctuation_amplitude / sqrt(2.0f); 

 fluc *= 

 !(lb_node.boundary[index] || lb_node.boundary[neighborindex[i]]); 

 #ifdef EK_DEBUG 

 flux_fluc[jindex_getByRhoLinear(index, i)] = fluc; 

 #endif 

 flux[jindex_getByRhoLinear(index, i)] += fluc; 

 } else { 

 fluc = 1.0f * 

 powf(2.0f * average_density * diffusion * time_step / 

 (agrid * agrid), 

 0.5f) * 

 random * ek_parameters_gpu->fluctuation_amplitude; 

 fluc *= 

 !(lb_node.boundary[index] || lb_node.boundary[neighborindex[i]]); 

 #ifdef EK_DEBUG 

 flux_fluc[jindex_getByRhoLinear(index, i)] = fluc; 

 #endif 

 flux[jindex_getByRhoLinear(index, i)] += fluc; 

 } 

 } 

 } 

 }

Is anyone familiar with that test?

jngrad · 2020-04-02T16:34:56Z

Note: while tinkering with /opt/rocm/bin/hipcc compiler flags in #3582, I get ek_fluctuations to fail reproducibly when passing flags -O3, -Og or -g to the compiler for .cu files.

cmake/FindCUDACompiler.cmake

this flag was added by Clang 10

mkuron · 2020-04-02T16:57:30Z

I get ek_fluctuations to fail reproducibly when passing flags -O3, -Og or -g to the compiler for .cu files.

Interesting. I'll bisect optimization flags tomorrow to find out what is causing it.

EDIT: -fsave-optimization-record helped me discover that -fno-inline-functions works around the issue. I guess there is an unsafe inline somewhere.

also make sure that the EK uses only 32-bit floats and that it calls its own RNG wrapper and not the LB's

mkuron · 2020-04-03T07:56:47Z

Fixed now. That nullptr hack seems to have weird interactions with inlining in recent LLVM versions.

jngrad · 2020-04-03T08:42:35Z

ek_fluctuations is now fixed, thanks!

What should we do now with regards to ROCm support? We aren't the only ones affected by the disastrous ROCm versioning strategy. Only supporting ROCm 3.3 might be an issue for users who pinned an earlier ROCm version on their system to stabilize their environment. Should we test both ROCm 3.0 and 3.3 in CI? We could test one on a weekly schedule to limit workload. Only testing 3.3 in CI runs the risk of running into a regression for 3.0.

mkuron · 2020-04-03T08:57:44Z

Only supporting ROCm 3.3 might be an issue for users who pinned an earlier ROCm version on their system to stabilize their environment.

This has become less of an issue now that you can install multiple ROCm versions side-by-side.

The changes in this merge request certainly won't break compatibility with 3.0. Testing 3.0 only wouldn't be sufficient either as someone might be pinning an even older version. Testing every release since 2.0 isn't an option either. That means we can only support older versions on a best-effort basis and simply guard any problematic changes with version checks. Looking back at my ROCm compatibility patches since v2.0, they primarily deal with CMake issues. Silent breakage like the ek_fluctuations test is a rather rare thing.

ROCm 3.3 support

8086600

fweik assigned jngrad Apr 2, 2020

jngrad reviewed Apr 2, 2020

View reviewed changes

cmake/FindCUDACompiler.cmake Outdated Show resolved Hide resolved

HIP: add -Wno-deprecated-copy only for ROCm 3.3+

1dae66f

this flag was added by Clang 10

fix HIP Philox

5da80a9

also make sure that the EK uses only 32-bit floats and that it calls its own RNG wrapper and not the LB's

mkuron requested a review from jngrad April 3, 2020 07:55

Merge branch 'python' into rocm

26638a2

jngrad added this to the Espresso 4.1.3 milestone Apr 3, 2020

jngrad added the BugFix label Apr 3, 2020

jngrad approved these changes Apr 3, 2020

View reviewed changes

jngrad added the automerge Merge with kodiak label Apr 3, 2020

Merge branch 'python' into rocm

d0b4c33

kodiakhq bot merged commit 23fd9f8 into espressomd:python Apr 3, 2020

mkuron mentioned this pull request Apr 3, 2020

Updated ROCm to latest espressomd/docker#166

Closed

mkuron deleted the rocm branch April 3, 2020 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm 3.3 support #3623

ROCm 3.3 support #3623

mkuron commented Apr 2, 2020 •

edited

Loading

codecov bot commented Apr 2, 2020 •

edited

Loading

mkuron commented Apr 2, 2020

jngrad commented Apr 2, 2020 •

edited

Loading

mkuron commented Apr 2, 2020 •

edited

Loading

mkuron commented Apr 3, 2020

jngrad commented Apr 3, 2020

mkuron commented Apr 3, 2020

ROCm 3.3 support #3623

ROCm 3.3 support #3623

Conversation

mkuron commented Apr 2, 2020 • edited Loading

codecov bot commented Apr 2, 2020 • edited Loading

Codecov Report

mkuron commented Apr 2, 2020

jngrad commented Apr 2, 2020 • edited Loading

mkuron commented Apr 2, 2020 • edited Loading

mkuron commented Apr 3, 2020

jngrad commented Apr 3, 2020

mkuron commented Apr 3, 2020

mkuron commented Apr 2, 2020 •

edited

Loading

codecov bot commented Apr 2, 2020 •

edited

Loading

jngrad commented Apr 2, 2020 •

edited

Loading

mkuron commented Apr 2, 2020 •

edited

Loading