[Object manager] don't abort entire pull request on race condition in concurrent chunk receive #18955

mwtian · 2021-09-28T22:00:10Z

Why are these changes needed?

See #18062 for investigation and background.

This change ensures there is at most 1 inflight operation to create buffer for an object, when handling multiple chunks pushed from the object. This avoids the race condition where multiple operations race to create the buffer for the object and fail, forcing pulling to be retried.

Test from #18143 is pull into this change.

Related issue number

#18062

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

ericl · 2021-09-28T23:09:33Z

Instead of retrying at a higher level, which can increase load in 1->many broadcast situations, why not fix the race condition in the first place? E.g., if the chunk already exists (but is unsealed), don't raise an IOError (e.g., comment out the exception).

rkooo567 · 2021-09-29T07:44:16Z

Yeah we should not reduce the retry interval. This will increase the load to systems a lot as our retry policy is pretty naive.

I think there are 2 approaches here.

The Eric's suggestion. I think it is similar to my original approach [WIP][Core][RFC] Fix pull manager race condition #18143.
We should retry within the pull manager. For example, in the HandlePull, instead of replying to other raylet right away, we cache the reply and retry creating the chunk in 1~2 seconds and reply after that. This will be simpler solution than 1 I believe

mwtian · 2021-09-29T15:48:33Z

Thanks @ericl and @rkooo567 for the suggestions. I went with a solution similar to Eric's and Sang's original approach. The difference is instead of allowing multiple inflight create buffer ops, only 1 inflight create buffer ops is allowed. Because if we ignore already exists errors during buffer creation, there still needs to be a way for operations to wait until their corresponding create_buffer_state_[object_id] becomes available, which is more complicated than not having duplicated buffer creation requests.

The new logic can be different from the previous logic in thread and memory usages, under race condition. These behaviors should be similar to the solution of ignoring already existed errors.

ericl · 2021-09-29T18:40:48Z

src/ray/common/ray_config_def.h

@@ -183,8 +183,7 @@ RAY_CONFIG(int64_t, worker_register_timeout_seconds, 30)
 RAY_CONFIG(int64_t, redis_db_connect_retries, 50)
 RAY_CONFIG(int64_t, redis_db_connect_wait_milliseconds, 100)

-/// Timeout, in milliseconds, to wait before retrying a failed pull in the
-/// ObjectManager.
+/// The object manager's global timer interval in milliseconds.


Please revert unrelated changes.

ericl

cc @iycheng for a more detailed review

ericl · 2021-09-29T19:12:47Z

src/ray/object_manager/object_buffer_pool.cc

+  RAY_CHECK(lock.owns_lock());
+
+  // Buffer for object_id already exists.
+  if (create_buffer_state_.contains(object_id)) return ray::Status::OK();


I think our style convention is to always put returns on a new line with braces, and never inline.

ericl · 2021-09-29T19:12:55Z

src/ray/object_manager/object_buffer_pool.cc

+    cond_var->wait(
+        lock, [this, object_id]() { return !create_buffer_ops_.contains(object_id); });
+    // Buffer already created.
+    if (create_buffer_state_.contains(object_id)) return ray::Status::OK();


return on new line

ericl · 2021-09-29T19:14:32Z

src/ray/object_manager/object_buffer_pool.h

  /// Determines the maximum chunk size to be transferred by a single thread.
  const uint64_t default_chunk_size_;
+
+  /// Mutex to protect create_buffer_ops_ and create_buffer_state_.
+  mutable std::mutex pool_mutex_;


Can you change this to use absl condition var support instead of std::condition_variable? https://abseil.io/docs/cpp/guides/synchronization

ericl · 2021-09-29T19:15:55Z

@rkooo567 , solution 2 isn't solving the fundamental issue. Solution 1 is simple, faster, and solves the root "bug" here.

ericl · 2021-09-30T00:52:05Z

src/ray/object_manager/object_buffer_pool.h

+  ray::Status EnsureBufferExists(const ObjectID &object_id,
+                                 const rpc::Address &owner_address, uint64_t data_size,
+                                 uint64_t metadata_size, uint64_t chunk_index)
+      ABSL_EXCLUSIVE_LOCKS_REQUIRED(pool_mutex_);


Btw I think we are omitting the ABSL_ prefix on annotations.

Good to know, updated.

ericl

Looks good, but please change the annotations to be consistent with others (no ABSL_ prefix).

ericl · 2021-09-30T17:19:50Z

Windows build broken in master

…ition in concurrent chunk receive (ray-project#18955)" This reverts commit d12e35c.

* Revert "[Object manager] fix comments" This reverts commit 56debfc. * Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955)" This reverts commit d12e35c. * Fix a lint issue

Mingwei Tian added 2 commits September 28, 2021 14:13

reduce retry interval

154d8a1

cleanup

37e25b8

mwtian requested a review from rkooo567 September 28, 2021 22:00

mwtian assigned rkooo567 Sep 28, 2021

mwtian requested a review from ericl September 28, 2021 23:04

mwtian assigned ericl Sep 28, 2021

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 28, 2021

fix

b284d59

allow only 1 create buffer op at a time

a5b51f2

mwtian force-pushed the pull branch from f29eb25 to a5b51f2 Compare September 29, 2021 15:23

mwtian changed the title ~~[Object manager] Retry pull requests with a shorter timeout~~ [Object manager] ensure at most 1 inflight create buffer operation when handling push Sep 29, 2021

mwtian removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021

bound destructor time

ab6dd8c

ericl reviewed Sep 29, 2021

View reviewed changes

ericl requested changes Sep 29, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021

ericl assigned fishbone and unassigned rkooo567 Sep 29, 2021

ericl changed the title ~~[Object manager] ensure at most 1 inflight create buffer operation when handling push~~ [Object manager] don't abort entire pull requests on race condition in receive chunk Sep 29, 2021

ericl changed the title ~~[Object manager] don't abort entire pull requests on race condition in receive chunk~~ [Object manager] don't abort entire pull request on race condition in concurrent chunk receive Sep 29, 2021

Mingwei Tian added 2 commits September 29, 2021 12:55

revert cleanup

ab0d280

Use absl::Mutex and absl::CondVar

aa1e1e4

mwtian removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021

ericl reviewed Sep 30, 2021

View reviewed changes

ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 30, 2021

ericl approved these changes Sep 30, 2021

View reviewed changes

fix

38f343f

mwtian removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 30, 2021

ericl merged commit d12e35c into ray-project:master Sep 30, 2021

mwtian mentioned this pull request Oct 1, 2021

[Object manager] fix comments #19039

Merged

6 tasks

rkooo567 added a commit to rkooo567/ray that referenced this pull request Oct 4, 2021

Revert "[Object manager] don't abort entire pull request on race cond…

13ad971

…ition in concurrent chunk receive (ray-project#18955)" This reverts commit d12e35c.

mwtian mentioned this pull request Oct 4, 2021

[Bug] dask_on_ray_large_scale_test_no_spilling failed in core nightly #19042

Closed

2 tasks

mwtian deleted the pull branch October 7, 2021 01:17

mwtian mentioned this pull request Oct 8, 2021

[Object manager] don't abort entire pull request on race condition from concurrent chunk receive - #2 #19216

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Object manager] don't abort entire pull request on race condition in concurrent chunk receive #18955

[Object manager] don't abort entire pull request on race condition in concurrent chunk receive #18955

mwtian commented Sep 28, 2021 •

edited

Loading

ericl commented Sep 28, 2021

rkooo567 commented Sep 29, 2021

mwtian commented Sep 29, 2021 •

edited

Loading

ericl Sep 29, 2021

ericl left a comment

ericl Sep 29, 2021

ericl Sep 29, 2021

ericl Sep 29, 2021

ericl commented Sep 29, 2021

ericl Sep 30, 2021

mwtian Sep 30, 2021

ericl left a comment

ericl commented Sep 30, 2021

[Object manager] don't abort entire pull request on race condition in concurrent chunk receive #18955

[Object manager] don't abort entire pull request on race condition in concurrent chunk receive #18955

Conversation

mwtian commented Sep 28, 2021 • edited Loading

Why are these changes needed?

Related issue number

Checks

ericl commented Sep 28, 2021

rkooo567 commented Sep 29, 2021

mwtian commented Sep 29, 2021 • edited Loading

ericl Sep 29, 2021

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl Sep 29, 2021

Choose a reason for hiding this comment

ericl Sep 29, 2021

Choose a reason for hiding this comment

ericl Sep 29, 2021

Choose a reason for hiding this comment

ericl commented Sep 29, 2021

ericl Sep 30, 2021

Choose a reason for hiding this comment

mwtian Sep 30, 2021

Choose a reason for hiding this comment

ericl left a comment

Choose a reason for hiding this comment

ericl commented Sep 30, 2021

mwtian commented Sep 28, 2021 •

edited

Loading

mwtian commented Sep 29, 2021 •

edited

Loading