Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Node reimaged due to expired heartbeat #1095

Closed
jagunter opened this issue Jul 20, 2021 · 2 comments
Closed

Node reimaged due to expired heartbeat #1095

jagunter opened this issue Jul 20, 2021 · 2 comments
Labels
bug Something isn't working

Comments

@jagunter
Copy link
Member

Information

  • Onefuzz version: 2.27.0
  • OS: Windows

Provide detailed reproduction steps (if any)

Scheduled job 2056f74a-d2c3-478b-a6f5-e8335d33543c to run for 1 week. However after a day it was stopped with the error:

    {"code": 468, "errors": ["node reimaged due to expired heartbeat", "scaleset_id:8ab21c8b-a504-4d78-b185-681c55d00fe9 machine_id:97da016a-10db-4b25-bd0a-d4830727bbbf", "last heartbeat:2021-07-20 21:23:11+00:00"]}

Expected result

Task to run to completion

Actual result

Node to which task was assigned was reimaged before task completion.

@jagunter jagunter added the bug Something isn't working label Jul 20, 2021
@ghost ghost added the Needs: triage label Jul 20, 2021
@bmc-msft
Copy link
Contributor

This was due to a "503 Service Unavailable" error coming back from the service after multiple retries.

Investigating why the service had issues.

@bmc-msft
Copy link
Contributor

The Azure Functions integration into SignalR failed, which caused the agent-commands to fail, which caused the supervisor to stop.

At the moment, we're using the Azure Functions integration for SignalR, rather than handling the communication ourselves.

We'll investigate making the supervisor more resilient to the service having issues as well as making the SignalR integration more resilient.

@ghost ghost locked as resolved and limited conversation to collaborators Aug 20, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants