Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm 3.3 support #3623

Merged
merged 5 commits into from
Apr 3, 2020
Merged

ROCm 3.3 support #3623

merged 5 commits into from
Apr 3, 2020

Conversation

mkuron
Copy link
Member

@mkuron mkuron commented Apr 2, 2020

The ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/ issue has been worked around by properly setting HCC_PATH on the CMake side.
The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing cudaDestoryStream though.

Fixes #3620 (according to ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000).
Fixes #3587 (according to ctest -R ek_charged_plate --repeat-until-fail 100).

TODO

@codecov
Copy link

codecov bot commented Apr 2, 2020

Codecov Report

Merging #3623 into python will decrease coverage by <1%.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff            @@
##           python   #3623     +/-   ##
========================================
- Coverage      87%     87%     -1%     
========================================
  Files         524     512     -12     
  Lines       23409   22036   -1373     
========================================
- Hits        20595   19371   -1224     
+ Misses       2814    2665    -149
Impacted Files Coverage Δ
...al_sites/lb_inertialess_tracers_cuda_interface.cpp 0% <0%> (-100%) ⬇️
...rialization/ibm_cuda_particle_velocities_input.hpp 0% <0%> (-100%) ⬇️
src/core/actor/ActorList.cpp 0% <0%> (-80%) ⬇️
src/core/EspressoSystemInterface.cpp 40% <0%> (-51%) ⬇️
...tatics_magnetostatics/magnetic_non_p3m_methods.cpp 0% <0%> (-32%) ⬇️
src/script_interface/shapes/Slitpore.hpp 80% <0%> (-11%) ⬇️
...rc/core/grid_based_algorithms/lb_interpolation.cpp 85% <0%> (-10%) ⬇️
src/core/electrostatics_magnetostatics/dipole.cpp 58% <0%> (-6%) ⬇️
src/core/forces.cpp 96% <0%> (-4%) ⬇️
src/core/energy.cpp 95% <0%> (-3%) ⬇️
... and 153 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc2d279...d0b4c33. Read the comment docs.

@mkuron
Copy link
Member Author

mkuron commented Apr 2, 2020

No idea what's wrong with the ek_fluctuations test. Philox works, LB thermalization statistics are correct, EK thermalization statistics are incorrect. The responsible function does not look suspicious in any way:

__device__ void ek_add_fluctuations_to_flux(unsigned int index,
unsigned int species_index,
unsigned int *neighborindex,
LB_nodes_gpu lb_node,
uint64_t philox_counter) {
if (index < ek_parameters_gpu->number_of_nodes) {
float density = ek_parameters_gpu->rho[species_index][index];
float *flux = ek_parameters_gpu->j;
float diffusion = ek_parameters_gpu->D[species_index];
float time_step = ek_parameters_gpu->time_step;
float agrid = ek_parameters_gpu->agrid;
float4 random_floats;
float random;
#ifdef EK_DEBUG
float *flux_fluc = ek_parameters_gpu->j_fluc;
#endif
float fluc = 0.0f;
for (int i = 0; i < 9; i++) {
if (i % 4 == 0) {
random_floats = random_wrapper_philox(index, i + 40, philox_counter);
random = (random_floats.w - 0.5f) * 2.0f;
} else if (i % 4 == 1) {
random = (random_floats.x - 0.5f) * 2.0f;
} else if (i % 4 == 2) {
random = (random_floats.y - 0.5f) * 2.0f;
} else if (i % 4 == 3) {
random = (random_floats.z - 0.5f) * 2.0f;
}
float H = 0.0f;
float HN = 0.0f;
float neighbor_density =
ek_parameters_gpu->rho[species_index][neighborindex[i]];
H = (density >= 0.0f) * min(density, 1.0f);
HN = (neighbor_density >= 0.0f) * min(neighbor_density, 1.0f);
float average_density = H * HN * (density + neighbor_density) / 2.0f;
if (i > 2) {
fluc = 1.0f *
powf(2.0f * average_density * diffusion * time_step /
(agrid * agrid),
0.5f) *
random * ek_parameters_gpu->fluctuation_amplitude / sqrt(2.0f);
fluc *=
!(lb_node.boundary[index] || lb_node.boundary[neighborindex[i]]);
#ifdef EK_DEBUG
flux_fluc[jindex_getByRhoLinear(index, i)] = fluc;
#endif
flux[jindex_getByRhoLinear(index, i)] += fluc;
} else {
fluc = 1.0f *
powf(2.0f * average_density * diffusion * time_step /
(agrid * agrid),
0.5f) *
random * ek_parameters_gpu->fluctuation_amplitude;
fluc *=
!(lb_node.boundary[index] || lb_node.boundary[neighborindex[i]]);
#ifdef EK_DEBUG
flux_fluc[jindex_getByRhoLinear(index, i)] = fluc;
#endif
flux[jindex_getByRhoLinear(index, i)] += fluc;
}
}
}
}

Is anyone familiar with that test?

@jngrad
Copy link
Member

jngrad commented Apr 2, 2020

Note: while tinkering with /opt/rocm/bin/hipcc compiler flags in #3582, I get ek_fluctuations to fail reproducibly when passing flags -O3, -Og or -g to the compiler for .cu files.

cmake/FindCUDACompiler.cmake Outdated Show resolved Hide resolved
this flag was added by Clang 10
@mkuron
Copy link
Member Author

mkuron commented Apr 2, 2020

I get ek_fluctuations to fail reproducibly when passing flags -O3, -Og or -g to the compiler for .cu files.

Interesting. I'll bisect optimization flags tomorrow to find out what is causing it.

EDIT: -fsave-optimization-record helped me discover that -fno-inline-functions works around the issue. I guess there is an unsafe inline somewhere.

also make sure that the EK uses only 32-bit floats and that it calls its own RNG wrapper and not the LB's
@mkuron mkuron requested a review from jngrad April 3, 2020 07:55
@mkuron
Copy link
Member Author

mkuron commented Apr 3, 2020

Fixed now. That nullptr hack seems to have weird interactions with inlining in recent LLVM versions.

@jngrad
Copy link
Member

jngrad commented Apr 3, 2020

ek_fluctuations is now fixed, thanks!

What should we do now with regards to ROCm support? We aren't the only ones affected by the disastrous ROCm versioning strategy. Only supporting ROCm 3.3 might be an issue for users who pinned an earlier ROCm version on their system to stabilize their environment. Should we test both ROCm 3.0 and 3.3 in CI? We could test one on a weekly schedule to limit workload. Only testing 3.3 in CI runs the risk of running into a regression for 3.0.

@jngrad jngrad added this to the Espresso 4.1.3 milestone Apr 3, 2020
@jngrad jngrad added the BugFix label Apr 3, 2020
@mkuron
Copy link
Member Author

mkuron commented Apr 3, 2020

Only supporting ROCm 3.3 might be an issue for users who pinned an earlier ROCm version on their system to stabilize their environment.

This has become less of an issue now that you can install multiple ROCm versions side-by-side.

The changes in this merge request certainly won't break compatibility with 3.0. Testing 3.0 only wouldn't be sufficient either as someone might be pinning an even older version. Testing every release since 2.0 isn't an option either. That means we can only support older versions on a best-effort basis and simply guard any problematic changes with version checks. Looking back at my ROCm compatibility patches since v2.0, they primarily deal with CMake issues. Silent breakage like the ek_fluctuations test is a rather rare thing.

@jngrad jngrad added the automerge Merge with kodiak label Apr 3, 2020
@kodiakhq kodiakhq bot merged commit 23fd9f8 into espressomd:python Apr 3, 2020
@mkuron mkuron deleted the rocm branch April 3, 2020 11:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge Merge with kodiak BugFix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ROCm: Memory access fault by GPU node CI build failed for merged PR
2 participants