-
Notifications
You must be signed in to change notification settings - Fork 860
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vader transport appears to leave SHM files laying around after successful termination #7220
Comments
Looks like this problem does not exist in 4.0.2. I haven't figured out which commit corrects the issue, however. |
This does not seem to be true. |
Okay - I tried backporting the patch from #6565 because it fit much of the description, but it does not actually fix the problem for 3.1.4. I tried testing 3.1.5 but failed to build it due to the GLIBC_PRIVATE issue. |
This comment has been minimized.
This comment has been minimized.
These files are supposed to be cleaned up by PMIx. Not sure why that isn't happening in this case. |
FWIW: we discussed this on the weekly OMPI call today:
|
I examined OMPI v4.0.2 and it appears to be doing everything correctly (ditto for master). I cannot see any reason why it would be leaving those files behind. Even the terminate-by-signal path flows thru the cleanup. No real ideas here - can anyone replicate this behavior? I can't on my VMs - it all works correctly. |
@rhc54, I can confirm it's working with I think it's the same underlying I'm running into in #7308: If another user left behind a segment file and there is a segment file name conflict with my current job, the run will abort with "permission denied" as the existing segment file can't be opened. As @jsquyres pointed out, it seems like it is an issue with PMIx 2.x. While @hjelmn is looking into possible workarounds, I'm wondering if we can use PMIx 3.x with Open MPI 3.1.5? |
Sorry for the confusion: It was a bug in our setup. I can now confirm that /dev/shm/vader* files are cleaned up after SIGTERM in Open MPI 4.0.2. |
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
3.1.4
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
Packaged with Intel OPA 10.10.0.0.445
Please describe the system on which you are running
Back-to-back Xeon systems running RHEL 7.6 on one and RHEL 8.0 on the other.
Details of the problem
I was using OMPI to do some stress testing of some minor changes to the OPA PSM library, when I discovered that the vader transport appears to be leaking memory mapped files.
I wrote a bash script to run the OSU micro benchmarks in a continuous loop, alternating between using the PSM2 MTL and the OFI MTL. After a 24 hour run, I ran into some "resource exhausted" issues when trying to start new shells, execute shell scripts, etc..
Investigating, I found over 100k shared memory files in /dev/shm, all of the form
vader_segment.<hostname>.<hex number>.<decimal number>
It's not clear at this point that the shared memory files are the cause of the problems I had, but they certainly shouldn't be there!
Sample run lines:
Script that was used to run the benchmarks:
The text was updated successfully, but these errors were encountered: