Prevent object store from allocating over the specified limit even if there is memory fragmentation #15951

ericl · 2021-05-20T19:18:51Z

Why are these changes needed?

Currently, we enforce the size of the plasma store at the server. However, dlmalloc can under the hood try to allocate more memory due to internal fragmentation. This disables these additional allocations, which can lead to SIGBUS if we run out of backing pages in /dev/shm.

Related to
#14182

suquark · 2021-05-20T20:07:59Z

(Just re-mention it here) I think you are doing it in the right way. The fundamental solution would be using another library that creates fewer fragments (dlmalloc is kind of too old). Another way is to change the alignment of dlmalloc. Currently it is forced to be aligned to 64 bytes, this would increase some internal fragments.

ericl · 2021-05-21T23:10:21Z

python/ray/tests/test_actor_failures.py

@@ -49,7 +49,7 @@ def create_object(self, size):
 # Submit enough methods on the actor so that they exceed the size of the
 # object store.
 objects = []
- num_objects = 20
+ num_objects = 40


20 doesn't work due to memory fragmentation (60MB chunks in 150MB => lots of fragmentation).

rkooo567 · 2021-05-24T20:36:39Z

src/ray/object_manager/plasma/store.cc

- if (space_needed > 0) {
+ // make room. NOTE(ekl) if we can't achieve this after a number of retries,
+ // it's because memory fragmentation in dlmalloc prevents us from allocating
+ // even if our footprint tracker here still says we have free space.


Why don't we use a different error for better error messages here? (e.g., FragmentationError?) (if retries == 10?)

I don't think we should be exposing these low-level details to the user.

Hmm I agree with that. But would there be other good idea to see if this causes the OOM issue? I think this could help us debugging issues when it happens. Probably writing a log message?

I don't think this causes any new OOM issues (previously, you'd get SIGBUS, now you'd get a proper error message or spilling).

clarkzinzow

Nice!!!

Given that this will effectively decrease the object store capacity for a given object store allocation, are there any changes that we should make to the docs? Should we change the default object store allocation to e.g. 90% of /dev/shm rather than 30% of available system memory? Anything like that?

src/ray/object_manager/plasma/dlmalloc.cc

clarkzinzow · 2021-05-24T21:17:05Z

src/ray/object_manager/plasma/dlmalloc.cc

+ RAY_LOG(DEBUG) << "fake_mmap called once already, refusing to allocate: " << size;
+ return MFAIL;
+ }
+ allocated_once = true;


Shouldn't this only be set if RayConfig::instance().preallocate_plasma_memory() is set? Otherwise, won't every object that's created after the first object fail if preallocation is turned off? Given that the tests are passing, I'm probably missing something here.

Plasma is always allocating "object_store_size" memory on the first allocation, the preallocation flag only controls whether we are telling the OS to pre-allocate the pages of the file (vs leaving them as allocate-on-write pages).

Ah, that's right, thanks!

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

ericl · 2021-05-25T00:56:07Z

Windows test failure was flaky (and succeeded on previous build).

stephanie-wang · 2021-05-25T17:24:52Z

src/ray/object_manager/plasma/store.cc

@@ -210,6 +210,7 @@ uint8_t *PlasmaStore::AllocateMemory(size_t size, MEMFD_TYPE *fd, int64_t *map_s

 // Try to evict objects until there is enough space.
 uint8_t *pointer = nullptr;
+ int num_tries = 0;


Do we need to retry here? We also retry at a higher level, at the CreateRequestQueue.

once

9803229

ericl added 4 commits May 21, 2021 15:42

Merge remote-tracking branch 'upstream/master' into alloc-once

ed78e4d

no infinite retry

a1648b6

note

c73d90a

comment

04b90c3

ericl commented May 21, 2021

View reviewed changes

ericl added 2 commits May 21, 2021 18:55

Merge remote-tracking branch 'upstream/master' into alloc-once

b17db12

fix ref count tests

c6aa3c0

clarkzinzow self-assigned this May 24, 2021

ericl added 2 commits May 24, 2021 13:13

Merge remote-tracking branch 'upstream/master' into alloc-once

15f9a3c

fix testgc

2805e2f

ericl changed the title ~~[WIP] Force dlmalloc to only allocate once on startup~~ Prevent object store from allocating over the specified limit even if there is memory fragmentation May 24, 2021

rkooo567 assigned rkooo567 and suquark May 24, 2021

rkooo567 approved these changes May 24, 2021

View reviewed changes

clarkzinzow approved these changes May 24, 2021

View reviewed changes

Update src/ray/object_manager/plasma/dlmalloc.cc

0050a3c

Co-authored-by: Clark Zinzow <clarkzinzow@gmail.com>

ericl merged commit ea6bdfb into ray-project:master May 25, 2021

stephanie-wang reviewed May 25, 2021

View reviewed changes

stephanie-wang mentioned this pull request May 31, 2021

[Object Spilling] Plasma store uses more shared memory than object_store_memory sometimes when spilling #14182

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent object store from allocating over the specified limit even if there is memory fragmentation #15951

Prevent object store from allocating over the specified limit even if there is memory fragmentation #15951

ericl commented May 20, 2021

suquark commented May 20, 2021

ericl May 21, 2021

rkooo567 May 24, 2021 •

edited

Loading

ericl May 24, 2021

rkooo567 May 24, 2021

ericl May 24, 2021

clarkzinzow left a comment

clarkzinzow May 24, 2021 •

edited

Loading

ericl May 24, 2021

clarkzinzow May 24, 2021 •

edited

Loading

ericl commented May 25, 2021

stephanie-wang May 25, 2021

Prevent object store from allocating over the specified limit even if there is memory fragmentation #15951

Prevent object store from allocating over the specified limit even if there is memory fragmentation #15951

Conversation

ericl commented May 20, 2021

Why are these changes needed?

suquark commented May 20, 2021

ericl May 21, 2021

Choose a reason for hiding this comment

rkooo567 May 24, 2021 • edited Loading

Choose a reason for hiding this comment

ericl May 24, 2021

Choose a reason for hiding this comment

rkooo567 May 24, 2021

Choose a reason for hiding this comment

ericl May 24, 2021

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow May 24, 2021 • edited Loading

Choose a reason for hiding this comment

ericl May 24, 2021

Choose a reason for hiding this comment

clarkzinzow May 24, 2021 • edited Loading

Choose a reason for hiding this comment

ericl commented May 25, 2021

stephanie-wang May 25, 2021

Choose a reason for hiding this comment

rkooo567 May 24, 2021 •

edited

Loading

clarkzinzow May 24, 2021 •

edited

Loading

clarkzinzow May 24, 2021 •

edited

Loading