-
Notifications
You must be signed in to change notification settings - Fork 893
Memory leak with persistent MPI sends and the ob1 "get" protocol #6565
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Ah, did #6550 fix the issue? @aingerson @s-kuberski Can you check the v4.0.x nightly snapshot tarball that will be generated tonight (i.e., in a few hours) and see if the issue is fixed for you? https://www.open-mpi.org/nightly/v4.0.x/ |
Will do! |
I still see the behaviour with the mpirun-executable built from the tarball (i.e. |
I've tested the nightly snapshot twice with all of our tests and don't seem to be seeing the problem anymore |
An interesting disparity -- @s-kuberski can you try with the openmpi-v4.0.x-201904100241-811dfc6.tar.bz2 (or later) tarball from https://www.open-mpi.org/nightly/v4.0.x/? |
I will also add that for fun I just tested on the nightly tar ball from a couple of days ago (before the fix was applied) and am still not seeing the error anymore.... I'm going to do a bit more testing and figure out what exact test is triggering this issue. This is a race condition, right? |
#6550 went in to v4.0.x a couple of days ago. |
It looks like it went in yesterday and I tested the snapshot from 4/6 |
I don't think my issue is actually related to this or to what I originally thought the fix was... |
This is the version I tried yesterday. Still a linear increase in the used memory... |
I am able to reproduce this problem on master. Two things I notice so far:
The vader CMA get/put methods are downright simple; I can't see where a leak would happen there (particularly if the receiver is doing the CMA read, but the sender process is the one that is growing without bound). This implies that it's an OB1 issue...? |
Spoiler: I can reproduce the problem with the TCP BTL. Expanding on the list from above:
Since the problem also happens with the TCP BTL, I'd say that vader is in the clear. This is likely an ob1 problem, or perhaps a general persistent send problem. Here's how I reproduced the problem: # Vader
# Both "emulated" and "cma" trigger the problem
$ mpirun --mca btl vader,self \
--mca btl_vader_flags send,get,inplace,atomics,fetching-atomics \
--mca btl_vader_single_copy_mechanism emulated \
-np 2 ./leaky-mcleakface 40000000
# TCP
$ mpirun --mca btl tcp,self \
--mca btl_tcp_flags send,get,inplace \
-np 2 ./leaky-mcleakface 40000000 |
I commented out
I ran:
|
For blocking and nonblocking operations, the For persistent operations, the Do we need to return I don't have enough time to investigate more this week... |
Fix the leak of fragments for persistent sends (issue #6565)
Resolved on v4.0.x in PR #6634. |
It sounds like this issue can be closed? |
Closing - this went into the 4.0 series and is in master. 3.0 is closed as far as I know. If someone disagrees, please reopen. |
Background information
A memory leak appears when using persistent communication with the vader BTL and large message sizes.
What version of Open MPI are you using? (e.g., v1.10.3, v2.1.0, git branch name and hash, etc.)
4.0.0 on local computer, 2.1.2 and 3.1.3 on clusters
Describe how Open MPI was installed
4.0.0: from tarball
Please describe the system on which you are running
Details of the problem
Running a simulation program with Open MPI, memory leaks appeared causing the application to crash. The behaviour can be reproduced with the attached code block.
When the vader BTL is used, the used memory increases linearly over time. This bug is directly connected to the message size:
With
btl_vader_eager_limit 4096
and a message size of 4041 bytes, the bug appears. If the eager limit is raised or the message size is decreased, no problem occurs.Only
btl_vader_single_copy_mechanism
set tocma
could be tested, no problem is seen if the value is set tonone
.If buffered communication is used and the buffer is detached and attached manually, the problem does not appear.
Only the shared-memory communication with vader is affected. If the processes are located on different nodes or
-mca btl ^vader
is set, everything is fine.The text was updated successfully, but these errors were encountered: