Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[heartbeats] Better error handling #2193

Merged
merged 2 commits into from
Jan 4, 2025

Conversation

valayDave
Copy link
Collaborator

@valayDave valayDave commented Jan 2, 2025

  • Ensure better error handling so that any server related flakiness can be handled without crashing the heartbeat thread.
  • If the heartbeat thread crashes because of any server issues / DNS issues / Intermittent server flakiness, it will kill the heartbeat thread causing the UI to mark the run as failed.
  • This change ensures that we don't crash the thread an makes it exponentially backoff calling the server.

@valayDave valayDave changed the title [robust heartbeats] [heartbeats] change from threads to processes [WIP] Jan 2, 2025
@valayDave valayDave marked this pull request as draft January 2, 2025 23:59
@valayDave valayDave requested a review from savingoyal January 2, 2025 23:59
- Ensure better error handling so that any server related flakiness can be handled without crashing the heartbeat thread.
@valayDave valayDave changed the title [heartbeats] change from threads to processes [WIP] [heartbeats] Better error handling Jan 3, 2025
@valayDave valayDave marked this pull request as ready for review January 3, 2025 23:08
response = requests.post(
url=self.hb_url, data="{}", headers=self.headers.copy()
)
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we change time.sleep(4**retry_counter) to be less aggressive? It is likely that the UI may still mark the task as failed, given how aggressively the sleep value grows.

@savingoyal savingoyal merged commit e92990f into Netflix:master Jan 4, 2025
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants