Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Object manager] don't abort entire pull request on race condition in concurrent chunk receive #18955

Merged
merged 8 commits into from
Sep 30, 2021

Conversation

mwtian
Copy link
Member

@mwtian mwtian commented Sep 28, 2021

Why are these changes needed?

See #18062 for investigation and background.

This change ensures there is at most 1 inflight operation to create buffer for an object, when handling multiple chunks pushed from the object. This avoids the race condition where multiple operations race to create the buffer for the object and fail, forcing pulling to be retried.

Test from #18143 is pull into this change.

Related issue number

#18062

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl
Copy link
Contributor

ericl commented Sep 28, 2021

Instead of retrying at a higher level, which can increase load in 1->many broadcast situations, why not fix the race condition in the first place? E.g., if the chunk already exists (but is unsealed), don't raise an IOError (e.g., comment out the exception).

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 28, 2021
@rkooo567
Copy link
Contributor

Yeah we should not reduce the retry interval. This will increase the load to systems a lot as our retry policy is pretty naive.

I think there are 2 approaches here.

  1. The Eric's suggestion. I think it is similar to my original approach [WIP][Core][RFC] Fix pull manager race condition #18143.
  2. We should retry within the pull manager. For example, in the HandlePull, instead of replying to other raylet right away, we cache the reply and retry creating the chunk in 1~2 seconds and reply after that. This will be simpler solution than 1 I believe

@mwtian mwtian changed the title [Object manager] Retry pull requests with a shorter timeout [Object manager] ensure at most 1 inflight create buffer operation when handling push Sep 29, 2021
@mwtian
Copy link
Member Author

mwtian commented Sep 29, 2021

Thanks @ericl and @rkooo567 for the suggestions. I went with a solution similar to Eric's and Sang's original approach. The difference is instead of allowing multiple inflight create buffer ops, only 1 inflight create buffer ops is allowed. Because if we ignore already exists errors during buffer creation, there still needs to be a way for operations to wait until their corresponding create_buffer_state_[object_id] becomes available, which is more complicated than not having duplicated buffer creation requests.

The new logic can be different from the previous logic in thread and memory usages, under race condition. These behaviors should be similar to the solution of ignoring already existed errors.

@mwtian mwtian removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021
@@ -183,8 +183,7 @@ RAY_CONFIG(int64_t, worker_register_timeout_seconds, 30)
RAY_CONFIG(int64_t, redis_db_connect_retries, 50)
RAY_CONFIG(int64_t, redis_db_connect_wait_milliseconds, 100)

/// Timeout, in milliseconds, to wait before retrying a failed pull in the
/// ObjectManager.
/// The object manager's global timer interval in milliseconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please revert unrelated changes.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @iycheng for a more detailed review

RAY_CHECK(lock.owns_lock());

// Buffer for object_id already exists.
if (create_buffer_state_.contains(object_id)) return ray::Status::OK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our style convention is to always put returns on a new line with braces, and never inline.

cond_var->wait(
lock, [this, object_id]() { return !create_buffer_ops_.contains(object_id); });
// Buffer already created.
if (create_buffer_state_.contains(object_id)) return ray::Status::OK();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return on new line

/// Determines the maximum chunk size to be transferred by a single thread.
const uint64_t default_chunk_size_;

/// Mutex to protect create_buffer_ops_ and create_buffer_state_.
mutable std::mutex pool_mutex_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to use absl condition var support instead of std::condition_variable? https://abseil.io/docs/cpp/guides/synchronization

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021
@ericl ericl assigned fishbone and unassigned rkooo567 Sep 29, 2021
@ericl
Copy link
Contributor

ericl commented Sep 29, 2021

@rkooo567 , solution 2 isn't solving the fundamental issue. Solution 1 is simple, faster, and solves the root "bug" here.

@ericl ericl changed the title [Object manager] ensure at most 1 inflight create buffer operation when handling push [Object manager] don't abort entire pull requests on race condition in receive chunk Sep 29, 2021
@ericl ericl changed the title [Object manager] don't abort entire pull requests on race condition in receive chunk [Object manager] don't abort entire pull request on race condition in concurrent chunk receive Sep 29, 2021
@mwtian mwtian removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 29, 2021
ray::Status EnsureBufferExists(const ObjectID &object_id,
const rpc::Address &owner_address, uint64_t data_size,
uint64_t metadata_size, uint64_t chunk_index)
ABSL_EXCLUSIVE_LOCKS_REQUIRED(pool_mutex_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I think we are omitting the ABSL_ prefix on annotations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, updated.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but please change the annotations to be consistent with others (no ABSL_ prefix).

@ericl ericl added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 30, 2021
@mwtian mwtian removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 30, 2021
@ericl
Copy link
Contributor

ericl commented Sep 30, 2021

Windows build broken in master

@ericl ericl merged commit d12e35c into ray-project:master Sep 30, 2021
@mwtian mwtian mentioned this pull request Oct 1, 2021
6 tasks
rkooo567 added a commit to rkooo567/ray that referenced this pull request Oct 4, 2021
scv119 pushed a commit that referenced this pull request Oct 4, 2021
* Revert "[Object manager] fix comments"

This reverts commit 56debfc.

* Revert "[Object manager] don't abort entire pull request on race condition in concurrent chunk receive (#18955)"

This reverts commit d12e35c.

* Fix a lint issue
@mwtian mwtian deleted the pull branch October 7, 2021 01:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants