Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray 2.3 release] Last update to results json was too long ago CI flaky failure #31981

Open
cadedaniel opened this issue Jan 27, 2023 · 5 comments
Assignees
Labels
P1 Issue that should be fixed within a few weeks testing topics about testing

Comments

@cadedaniel
Copy link
Member

Traceback (most recent call last):
--
  | File "ray_release/scripts/run_release_test.py", line 153, in main
  | no_terminate=no_terminate,
  | File "/tmp/release-y0g2zN9kOn/release/ray_release/glue.py", line 404, in run_release_test
  | raise pipeline_exception
  | File "/tmp/release-y0g2zN9kOn/release/ray_release/glue.py", line 382, in run_release_test
  | handle_result(test, result)
  | File "/tmp/release-y0g2zN9kOn/release/ray_release/alerts/handle.py", line 39, in handle_result
  | raise ResultsAlert(error)
  | ray_release.exception.ResultsAlert: Last update to results json was too long ago (inf > 300)
@cadedaniel cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Jan 27, 2023
@cadedaniel cadedaniel self-assigned this Jan 27, 2023
@cadedaniel
Copy link
Member Author

cadedaniel commented Jan 27, 2023

from Kai:

[2:29 PM] The main problem is that the results file can’t be fetched:

ray_release.exception.FileDownloadError: Error downloading file
/tmp/release_test_out.json to /tmp/tmpx3c1t6pi

[2:29] And then the alerting step checks the last update time, but it defaults to inf if no data is found. That raises the error
[2:30] So the main problem is that either the file was not written or the download did not work correctly

@krfricke
Copy link
Contributor

To add to this, this result fetching procedure works for other tests. So if the "download did not work correctly" it could be e.g. because the jobs server died

@cadedaniel
Copy link
Member Author

for long_running_many_actor_tasks , the root cause exception is swallowed by the exception handler in an exponential backoff handler. Have a fix in #32014 but ofc not the root cause

Will look into long_running_actor_deaths now

@cadedaniel
Copy link
Member Author

The long_running_actor_deaths failure looks to have the same issue

krfricke pushed a commit that referenced this issue Jan 28, 2023
…info on failure (#32014)

It appears the root cause of flaky failures described in #31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`.

Signed-off-by: Cade Daniel <cade@anyscale.com>
@cadedaniel cadedaniel changed the title [Ray 2.3 release] CI flaky failure in long_running_many_actor_tasks[smoke] and long_running_actor_deaths[smoke] [Ray 2.3 release] Last update to results json was too long ago CI flaky failure Jan 31, 2023
@cadedaniel
Copy link
Member Author

Update: I thought that the exception handler around Executing pip install -q awscli && aws s3 cp /tmp/release_test_out.json s3://ray-release-automation-results/tmp/qjlzkodchv --acl bucket-owner-full-control with {} via ray job submit was failing but actually looks like the failure is in a different codepath -- in the Anyscale SDK itself (Could not fetch results from session). Will look more later.

In any case, these tests aren't crashing, and we currently don't check for performance regressions on them. So the tests are passing, we're just missing metrics from them with this bug.

@cadedaniel cadedaniel added P1 Issue that should be fixed within a few weeks and removed release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Feb 1, 2023
@cadedaniel cadedaniel removed their assignment Feb 1, 2023
@cadedaniel cadedaniel added the testing topics about testing label Mar 10, 2023
@can-anyscale can-anyscale self-assigned this Mar 21, 2023
edoakes pushed a commit to edoakes/ray that referenced this issue Mar 22, 2023
…info on failure (ray-project#32014)

It appears the root cause of flaky failures described in ray-project#31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`.

Signed-off-by: Cade Daniel <cade@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1 Issue that should be fixed within a few weeks testing topics about testing
Projects
None yet
Development

No branches or pull requests

3 participants