[Ray 2.3 release] `Last update to results json was too long ago` CI flaky failure #31981

cadedaniel · 2023-01-27T00:19:39Z

long_running_many_actor_tasks [smoke test] https://buildkite.com/ray-project/release-tests-branch/builds/1309#0185e5f7-46d3-4965-86c5-d30d800ad47e
long_running_actor_deaths [smoke test] https://buildkite.com/ray-project/release-tests-branch/builds/1307#0185e0b3-c5af-4020-9ed0-9b33d84e131a

Traceback (most recent call last):
--
  | File "ray_release/scripts/run_release_test.py", line 153, in main
  | no_terminate=no_terminate,
  | File "/tmp/release-y0g2zN9kOn/release/ray_release/glue.py", line 404, in run_release_test
  | raise pipeline_exception
  | File "/tmp/release-y0g2zN9kOn/release/ray_release/glue.py", line 382, in run_release_test
  | handle_result(test, result)
  | File "/tmp/release-y0g2zN9kOn/release/ray_release/alerts/handle.py", line 39, in handle_result
  | raise ResultsAlert(error)
  | ray_release.exception.ResultsAlert: Last update to results json was too long ago (inf > 300)

The text was updated successfully, but these errors were encountered:

cadedaniel · 2023-01-27T00:20:25Z

from Kai:

[2:29 PM] The main problem is that the results file can’t be fetched:
ray_release.exception.FileDownloadError: Error downloading file
/tmp/release_test_out.json to /tmp/tmpx3c1t6pi
[2:29] And then the alerting step checks the last update time, but it defaults to inf if no data is found. That raises the error
[2:30] So the main problem is that either the file was not written or the download did not work correctly

krfricke · 2023-01-27T00:36:44Z

To add to this, this result fetching procedure works for other tests. So if the "download did not work correctly" it could be e.g. because the jobs server died

cadedaniel · 2023-01-28T00:18:34Z

for long_running_many_actor_tasks , the root cause exception is swallowed by the exception handler in an exponential backoff handler. Have a fix in #32014 but ofc not the root cause

Will look into long_running_actor_deaths now

cadedaniel · 2023-01-28T00:19:48Z

The long_running_actor_deaths failure looks to have the same issue

…info on failure (#32014) It appears the root cause of flaky failures described in #31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`. Signed-off-by: Cade Daniel <cade@anyscale.com>

cadedaniel · 2023-02-01T00:49:15Z

Update: I thought that the exception handler around Executing pip install -q awscli && aws s3 cp /tmp/release_test_out.json s3://ray-release-automation-results/tmp/qjlzkodchv --acl bucket-owner-full-control with {} via ray job submit was failing but actually looks like the failure is in a different codepath -- in the Anyscale SDK itself (Could not fetch results from session). Will look more later.

In any case, these tests aren't crashing, and we currently don't check for performance regressions on them. So the tests are passing, we're just missing metrics from them with this bug.

…info on failure (ray-project#32014) It appears the root cause of flaky failures described in ray-project#31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`. Signed-off-by: Cade Daniel <cade@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

cadedaniel added release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Jan 27, 2023

cadedaniel self-assigned this Jan 27, 2023

cadedaniel mentioned this issue Jan 28, 2023

[Ray release test infra] Change exponential_backoff_retry to use error instead of info on failure #32014

Merged

cadedaniel changed the title ~~[Ray 2.3 release] CI flaky failure in long_running_many_actor_tasks[smoke] and long_running_actor_deaths[smoke]~~ [Ray 2.3 release] Last update to results json was too long ago CI flaky failure Jan 31, 2023

cadedaniel added P1 Issue that should be fixed within a few weeks and removed release-blocker P0 Issue that blocks the release P0 Issues that should be fixed in short order labels Feb 1, 2023

cadedaniel removed their assignment Feb 1, 2023

cadedaniel added the testing topics about testing label Mar 10, 2023

can-anyscale self-assigned this Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Ray 2.3 release] `Last update to results json was too long ago` CI flaky failure #31981

[Ray 2.3 release] `Last update to results json was too long ago` CI flaky failure #31981

cadedaniel commented Jan 27, 2023

cadedaniel commented Jan 27, 2023 •

edited

Loading

krfricke commented Jan 27, 2023

cadedaniel commented Jan 28, 2023

cadedaniel commented Jan 28, 2023

cadedaniel commented Feb 1, 2023

[Ray 2.3 release] Last update to results json was too long ago CI flaky failure #31981

[Ray 2.3 release] Last update to results json was too long ago CI flaky failure #31981

Comments

cadedaniel commented Jan 27, 2023

cadedaniel commented Jan 27, 2023 • edited Loading

krfricke commented Jan 27, 2023

cadedaniel commented Jan 28, 2023

cadedaniel commented Jan 28, 2023

cadedaniel commented Feb 1, 2023

[Ray 2.3 release] `Last update to results json was too long ago` CI flaky failure #31981

[Ray 2.3 release] `Last update to results json was too long ago` CI flaky failure #31981

cadedaniel commented Jan 27, 2023 •

edited

Loading