Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm: Memory access fault by GPU node #3620

Closed
espresso-ci opened this issue Apr 1, 2020 · 1 comment · Fixed by #3623
Closed

ROCm: Memory access fault by GPU node #3620

espresso-ci opened this issue Apr 1, 2020 · 1 comment · Fixed by #3623

Comments

@espresso-ci
Copy link

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/pipelines/11601

@jngrad
Copy link
Member

jngrad commented Apr 1, 2020

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/219168

  59/149 Test  #37: mass-and-rinertia_per_particle ............................***Exception: Child aborted  1.24 sec
1270 Memory access fault by GPU node-5 (Agent handle: 0x556fb8a2bdb0) on address 0x7fe79c84c000. Reason: Page not present or supervisor privilege.

@jngrad jngrad changed the title CI build failed for merged PR ROCm: Memory access fault by GPU node Apr 1, 2020
@kodiakhq kodiakhq bot closed this as completed in #3623 Apr 3, 2020
kodiakhq bot added a commit that referenced this issue Apr 3, 2020
The `ln -s /opt/rocm/bin/hcc* /opt/rocm/hip/bin/` issue has been worked around by properly setting `HCC_PATH` on the CMake side.
The shutdown issue has been worked around by replacing interrupts with polling (suggested at ROCm/roctracer#22 (comment)). Something is wrong with the destruction order in our code, but I cannot easily identify what. It's not the missing `cudaDestoryStream` though.

Fixes #3620 (according to `ctest -R save_checkpoint_lb.cpu-p3m.cpu-lj-therm.lb_1 --repeat-until-fail 1000`).
Fixes #3587 (according to `ctest -R ek_charged_plate --repeat-until-fail 100`).

**TODO**
- https://github.com/espressomd/docker/blob/master/docker/rocm-python3/Dockerfile-latest needs to be updated to ROCm 3.3 once this pull request is merged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants