Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CI test linux://:mutable_object_test is flaky #44396

Closed
can-anyscale opened this issue Apr 1, 2024 · 22 comments · Fixed by #44413, #44720 or #44764
Closed

CI test linux://:mutable_object_test is flaky #44396

can-anyscale opened this issue Apr 1, 2024 · 22 comments · Fixed by #44413, #44720 or #44764
Assignees
Labels
bug Something that is supposed to be working; but isn't ci-test core Issues that should be addressed in Ray Core core-object-store flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy stability

Comments

@can-anyscale can-anyscale added bug Something that is supposed to be working; but isn't ci-test core Issues that should be addressed in Ray Core flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ ray-test-bot Issues managed by OSS test policy stability triage Needs triage (eg: priority, bug/not-bug, and owning component) weekly-release-blocker Issues that will be blocking Ray weekly releases labels Apr 1, 2024
@can-anyscale
Copy link
Collaborator Author

Image

@can-anyscale
Copy link
Collaborator Author

I'm not sure whether it's because of that PR but @jackhumphries is the master mind behind cpp anyway ;)

@jjyao jjyao removed their assignment Apr 1, 2024
@jjyao jjyao added P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Apr 1, 2024
jackhumphries added a commit to jackhumphries/ray that referenced this issue Apr 2, 2024
Based on a local issue I resolved, I believe this commit should fix the
flaky mutable_object_test in ray-project#44396.

The header of the mutable object was not being explicitly initialized.
In some cases, this caused deadlock due to a spinlock backed by garbage memory.

This commit explicitly initializes the header, which should resolve the
issue.

Tested: mutable_object_test

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
stephanie-wang pushed a commit that referenced this issue Apr 2, 2024
Based on a local issue I resolved, I believe this commit should fix the
flaky mutable_object_test in #44396.

The header of the mutable object was not being explicitly initialized.
In some cases, this caused deadlock due to a spinlock backed by garbage memory.

This commit explicitly initializes the header, which should resolve the
issue.

Tested: mutable_object_test

Signed-off-by: Jack Humphries <1645405+jackhumphries@users.noreply.github.com>
@can-anyscale
Copy link
Collaborator Author

@jjyao
Copy link
Collaborator

jjyao commented Apr 8, 2024

Still flaky.

@jjyao jjyao reopened this Apr 8, 2024
@jjyao jjyao removed the weekly-release-blocker Issues that will be blocking Ray weekly releases label Apr 8, 2024
@anyscalesam anyscalesam reopened this Apr 8, 2024
@can-anyscale
Copy link
Collaborator Author

@jackhumphries
Copy link
Contributor

From initial investigation, the test is "flaky" because it takes a long time to run. In ReadAcquire() in src/ray/object_manager/common.cc, there is a polling loop that all readers enter as they wait on the writer to update the object. In each loop, the readers increment and decrement the semaphore (which has a maximum value of 1), so there is significant contention on the semaphore.

@jackhumphries
Copy link
Contributor

jackhumphries commented Apr 13, 2024

A quick fix to reduce contention would be to add sched_yield() between sem_post() and TryToAcquireSemaphore(). A good long term fix would be to use a futex to sleep on the version number to avoid polling.

@stephanie-wang
Copy link
Contributor

stephanie-wang commented Apr 13, 2024 via email

@jackhumphries
Copy link
Contributor

I'd be inclined to keep it as is, because this test should run quickly. It's just unnecessary contention that shouldn't be there in the first place due to polling.

@stephanie-wang
Copy link
Contributor

stephanie-wang commented Apr 13, 2024 via email

@can-anyscale can-anyscale changed the title CI test linux://:mutable_object_test is flaky CI test linux://:mutable_object_test is consistently_failing Apr 13, 2024
@can-anyscale can-anyscale reopened this Apr 13, 2024
@can-anyscale can-anyscale changed the title CI test linux://:mutable_object_test is flaky CI test linux://:mutable_object_test is consistently_failing Apr 13, 2024
@can-anyscale can-anyscale reopened this Apr 13, 2024
2 similar comments
@can-anyscale can-anyscale changed the title CI test linux://:mutable_object_test is flaky CI test linux://:mutable_object_test is consistently_failing Apr 13, 2024
@can-anyscale can-anyscale reopened this Apr 13, 2024
@can-anyscale
Copy link
Collaborator Author

1 similar comment
@can-anyscale
Copy link
Collaborator Author

@can-anyscale can-anyscale changed the title CI test linux://:mutable_object_test is consistently_failing CI test linux://:mutable_object_test is flaky Apr 13, 2024
@can-anyscale can-anyscale reopened this Apr 13, 2024
@can-anyscale
Copy link
Collaborator Author

CI test linux://:mutable_object_test is flaky. Recent failures:

DataCaseName-linux://:mutable_object_test-END
Managed by OSS Test Policy

@can-anyscale
Copy link
Collaborator Author

@can-anyscale
Copy link
Collaborator Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't ci-test core Issues that should be addressed in Ray Core core-object-store flaky-tracker Issue created via Flaky Test Tracker https://flaky-tests.ray.io/ P0 Issues that should be fixed in short order ray-test-bot Issues managed by OSS test policy stability
Projects
None yet
5 participants