Skip to content
This repository has been archived by the owner on Nov 1, 2023. It is now read-only.

Linux job failing due to "onefuzz-supervisor failed: Coordinator.send" #961

Closed
jagunter opened this issue Jun 3, 2021 · 5 comments · Fixed by #976
Closed

Linux job failing due to "onefuzz-supervisor failed: Coordinator.send" #961

jagunter opened this issue Jun 3, 2021 · 5 comments · Fixed by #976
Labels
bug Something isn't working

Comments

@jagunter
Copy link
Member

jagunter commented Jun 3, 2021

Error message seems to suggest this is an issue in OneFuzz. Note that the Linux hosts use the smaller OS disk which is unavoidable until #112 is resolved. This fuzzer expects quite a bit of disk to be available, so it's possible that disk is being exhausted. Not sure if that's how disk exhaustion would manifest as far as error messages go though.

Tried ssh'ing to the host to poke around a bit but get stuck on waiting on proxy ip

Information

  • Onefuzz version: 2.18
  • OS: Linux latest

Provide detailed reproduction steps (if any)

  1. Replay job c2427674-c3c7-47cb-a62a-8a925d50821e on prod01 instance

Expected result

Task success or failure due to an issue with the fuzzer

Actual result

    "error": {
        "code": 468,
        "errors": [
            "task failed. exit_status:code=1 signal=None success=False",
            "",
            "onefuzz-supervisor failed: Coordinator.send"
        ]
    },
@jagunter jagunter added the bug Something isn't working label Jun 3, 2021
@ghost ghost added the Needs: triage label Jun 3, 2021
@jagunter
Copy link
Member Author

jagunter commented Jun 3, 2021

Also checked the nodes afterward and looks like they're in the done state instead of halt. I would've expected the latter since the tasks failed, though maybe that itself is a clue.

        {
            "instance_id": "0",
            "machine_id": "4b2e10d7-d04e-4409-a459-8eb74d06c530",
            "state": "done"
        },
        {
            "instance_id": "1",
            "machine_id": "39bd7988-1f05-4037-be36-fb89de6c1cc5",
            "state": "done"
        },
        {
            "instance_id": "2",
            "machine_id": "f6d2c71b-7a99-4fd9-bbbc-6cd8f32e7a54",
            "state": "done"
        },

@bmc-msft
Copy link
Contributor

bmc-msft commented Jun 4, 2021

I've added two PRs that will add additional logging in the error context related to this.
#931, which was released in 2.19.0, and #963 which will be released soon. I'll try and replicate this job on our instance, but the additional information from these PRs will be helpful in figuring this out in the future.

@jagunter
Copy link
Member Author

jagunter commented Jun 9, 2021

Error from a more recent run on later version of OneFuzz (either 2.19 or 2.20):

onefuzz jobs tasks list 45e9851b-66ce-4840-94a2-dd3d97bca029 --query '[].error.errors' 
[
    [
        "prerequisite task failed"
    ],
    [
        "prerequisite task failed"
    ],
    [
        "task failed. exit_status:code=1 signal=None success=False",
        "",
        "onefuzz-supervisor failed: Coordinator.send\n\nCaused by:\n    0: request attempt 1 failed\n    1: HTTP status client error (401 Unauthorized) for url (https://redacted.azurewebsites.net/api/agents/commands)"
    ]
]

@bmc-msft
Copy link
Contributor

bmc-msft commented Jun 9, 2021

@jagunter thanks for the update. This helps.

@bmc-msft bmc-msft linked a pull request Jun 9, 2021 that will close this issue
@jagunter
Copy link
Member Author

Same error for job 84c4cb68-0885-4dc9-9829-d258d35105ac on OneFuzz version 2.21.

"onefuzz-supervisor failed: PollCommands\n\nCaused by:\n    0: non-status error after ensuring valid access token\n    1: Coordinator.send after refreshing access token\n    2: request attempt 6 failed\n    3: HTTP status client error (401 Unauthorized) for url (https://<redacted>.azurewebsites.net/api/agents/commands)"

@ghost ghost locked as resolved and limited conversation to collaborators Jul 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants