-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Object Spilling] Plasma store uses more shared memory than object_store_memory sometimes when spilling #14182
Comments
I can reproduce this pretty easily by writing big files to /dev/shm ( |
It seems the issue here is pretty fundamental to the way we are using mmap. We are creating a file with unallocated pages in /dev/shm (to avoid immediately using memory). However, this means the application can get SIGBUS at any point if here are no more pages allocatable in /dev/shm. Possible alternatives include:
|
@rkooo567 can you try using the flag? If it works, I think we can add this to our documentation. |
The issue wasn't reproducible always. But I can try with 100GB shuffle just to see how slow it is when initiated. I can @clarkzinzow can also verify it with his Uber workload. |
FYI the workaround is to set RAY_PREALLOCATE_PLASMA_MEMORY=1; available in nightly builds (introduced in https://github.com/ray-project/ray/pull/15669/files) |
What is the problem?
It looks like sometimes (when there's a huge memory pressure), plasma store uses more memory than
object_store_memory
, which causes the SIGBUS. For example, I ran one stressful spilling workload, and I received SIGBUS while my /dev/shm size was 120GB and the object store memory limit was 80GB. When I killed the raylet, all memory was freed.Reproduction (REQUIRED)
Please provide a short code snippet (less than 50 lines if possible) that can be copy-pasted to reproduce the issue. The snippet should have no external library dependencies (i.e., use fake or mock data / environments):
No reproduction now, but it should be relatively easy to create the one.
The text was updated successfully, but these errors were encountered: