Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray release test infra] Change exponential_backoff_retry to use error instead of info on failure #32014

Merged

Conversation

cadedaniel
Copy link
Member

@cadedaniel cadedaniel commented Jan 28, 2023

It appears the root cause of flaky failures described in #31981 is suppressed because we're not logging exceptions in exponential_backoff_retry.

...
[INFO 2023-01-23 16:55:10,474] sdk_runner.py: 135  ... command still running ...(3540 seconds) ...
[INFO 2023-01-23 16:55:39,934] sdk_runner.py: 135  ... command still running ...(3570 seconds) ...
�[0m--- :floppy_disk: Fetching results�[0m
[INFO 2023-01-23 16:56:10,439] job_manager.py: 45  Executing pip install -q awscli && aws s3 cp /tmp/release_test_out.json s3://ray-release-automation-results/tmp/iumtsyohrf --acl bucket-owner-full-control with {} via ray job submit
[ERROR 2023-01-23 16:56:11,036] glue.py: 324  Could not fetch results for test command
[ERROR 2023-01-23 16:56:11,037] glue.py: 325  Could not fetch results from session: Error downloading file /tmp/release_test_out.json to /tmp/tmpx3c1t6pi
Traceback (most recent call last):
  File "/tmp/release-QnZJuLnWXj/release/ray_release/command_runner/sdk_runner.py", line 184, in _fetch_json
    self.file_manager.download(path, tmpfile)
  File "/tmp/release-QnZJuLnWXj/release/ray_release/file_manager/job_file_manager.py", line 63, in download
    raise FileDownloadError(f"Error downloading file {source} to {target}")
    ray_release.exception.FileDownloadError: Error downloading file /tmp/release_test_out.json to /tmp/tmpx3c1t6pi
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
  File "/tmp/release-QnZJuLnWXj/release/ray_release/glue.py", line 322, in run_release_test
    command_results = command_runner.fetch_results()
  File "/tmp/release-QnZJuLnWXj/release/ray_release/command_runner/sdk_runner.py", line 195, in fetch_results
    return self._fetch_json(self.result_output_json)
  File "/tmp/release-QnZJuLnWXj/release/ray_release/command_runner/sdk_runner.py", line 192, in _fetch_json
    raise ResultsError(f"Could not fetch results from session: {e}") from e
    ray_release.exception.ResultsError: Could not fetch results from session: Error downloading file /tmp/release_test_out.json to /tmp/tmpx3c1t6pi
...

Testing

#!/usr/bin/env python3

import logging
logger = logging.getLogger()
logger.setLevel(logging.INFO)

try:
    raise ValueError()
except ValueError as e:
    logger.info(f"INFO message message message {e}")
    logger.exception(f"EXCEPTION message message message {e}")
$ ./t
EXCEPTION message message message
Traceback (most recent call last):
  File "./t", line 8, in <module>
    raise ValueError()
ValueError

@@ -121,7 +121,7 @@ def exponential_backoff_retry(
retry_cnt += 1
if retry_cnt > max_retries:
raise
logger.info(
logger.warn(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know if warn will print to the logs? should I use exception or error instead cc @krfricke

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, gunna go for exception, if we have spammy logs due to this can revisit

@cadedaniel cadedaniel marked this pull request as ready for review January 28, 2023 00:10
@cadedaniel cadedaniel changed the title Change exponential_backoff_retry to use warn instead of info on failure [Ray release test infra] Change exponential_backoff_retry to use warn instead of info on failure Jan 28, 2023
Signed-off-by: Cade Daniel <cade@anyscale.com>
@cadedaniel cadedaniel force-pushed the release-test-exception-logging branch from c490ce7 to f3347ef Compare January 28, 2023 00:11
@cadedaniel cadedaniel changed the title [Ray release test infra] Change exponential_backoff_retry to use warn instead of info on failure [Ray release test infra] Change exponential_backoff_retry to use error instead of info on failure Jan 28, 2023
@krfricke krfricke merged commit b5899d4 into ray-project:master Jan 28, 2023
@cadedaniel cadedaniel deleted the release-test-exception-logging branch January 28, 2023 08:10
@cadedaniel
Copy link
Member Author

@krfricke when will this commit be used for release tests? The logs are still obscured in a failure from the most recent weekly test but looking at git log, this PR is included in the weekly tests. https://gist.github.com/cadedaniel/c85c9af3f901e127dee773f9fcdac367

from https://buildkite.com/ray-project/release-tests-branch/builds/1318#0185fc29-1d4c-483a-999b-ede500781c7a

[INFO 2023-01-29 23:33:27,123] job_manager.py: 45  Executing pip install -q awscli && aws s3 cp /tmp/release_test_out.json s3://ray-release-automation-results/tmp/eihwxbtgzy --acl bucket-owner-full-control with {} via ray job submit
[ERROR 2023-01-29 23:33:27,486] glue.py: 324  Could not fetch results for test command
[ERROR 2023-01-29 23:33:27,486] glue.py: 325  Could not fetch results from session: Error downloading file /tmp/release_test_out.json to /tmp/tmp25vitwhv

is this an issue with my PR or has the release test infra not yet picked up my changes?

@krfricke
Copy link
Contributor

This should be picked up but I think the code is just not invoked in this path.

@xwjiang2010 is working on similar fix for this issue: #32055

edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
…info on failure (ray-project#32014)

It appears the root cause of flaky failures described in ray-project#31981 is suppressed because we're not logging exceptions in `exponential_backoff_retry`.

Signed-off-by: Cade Daniel <cade@anyscale.com>
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-test release test
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants