Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Health lambas still times out occasionally #6097

Closed
achave11-ucsc opened this issue Mar 26, 2024 · 5 comments
Closed

Health lambas still times out occasionally #6097

achave11-ucsc opened this issue Mar 26, 2024 · 5 comments
Assignees
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team

Comments

@achave11-ucsc
Copy link
Member

Task timed out after the request to the bundles endpoint took too long.

[
    {
        "@timestamp": "2024-03-24 06:35:29.809",
        "@message": "START RequestId: f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e Version: $LATEST\n"
    },
    {
        "@timestamp": "2024-03-24 06:35:29.810",
        "@message": "[INFO]\t2024-03-24T06:35:29.810Z\tf266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\tazul.health\tGetting health property 'api_endpoints'"
    },
    {
        "@timestamp": "2024-03-24 06:35:29.811",
        "@message": "[INFO]\t2024-03-24T06:35:29.811Z\tf266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\tazul.health\tMaking HEAD request to https://service.azul.data.humancellatlas.org/index/bundles?size=1"
    },
    {
        "@timestamp": "2024-03-24 06:36:08.883",
        "@message": "2024-03-24T06:36:08.883Z f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e Task timed out after 39.07 seconds\n\n"
    },
    {
        "@timestamp": "2024-03-24 06:36:08.883",
        "@message": "END RequestId: f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\n"
    },
    {
        "@timestamp": "2024-03-24 06:36:08.883",
        "@message": "REPORT RequestId: f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\tDuration: 39073.64 ms\tBilled Duration: 39000 ms\tMemory Size: 128 MB\tMax Memory Used: 121 MB\t\n"
    }
]
@achave11-ucsc achave11-ucsc added the orange [process] Done by the Azul team label Mar 26, 2024
@achave11-ucsc
Copy link
Member Author

@hannes-ucsc: "We should consider raising the error threshold."

@hannes-ucsc
Copy link
Member

Assignee to determine a reasonable threshold based on historic alarm data.

@hannes-ucsc hannes-ucsc added infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms bug [type] A defect preventing use of the system as specified enh [type] New feature or request debt [type] A defect incurring continued engineering cost + [priority] High labels Mar 28, 2024
@achave11-ucsc achave11-ucsc self-assigned this Mar 28, 2024
@achave11-ucsc
Copy link
Member Author

Excluding events prior to merging #5467 (which added added a gateway endpoint to S3 and DynamoDb that significantly reduced execution failures) into develop on Feb 6, and the lone standing event of #5927 which caused a single, transient execution failure, only the Prod account has seen execution failures.

The only two failures were on the servicecachehealth Lambda, 1) due to a ConnectionError on 2024-03-09, and 2) due to a Task timed out after 39.07 seconds on 2024-03-23.

So based on this data, it seems appropriate to set the retry limit for this lambdas to one, given that the occurrence rate is low at only two per month, and to only be evident in the prod deployment. The other deployment's haven't seen errors or timeouts since Feb 6, that wasn't related to #5927.

@dsotirho-ucsc
Copy link
Contributor

dsotirho-ucsc commented Apr 4, 2024

@hannes-ucsc: "Increase retry for the log forwarder Lambdas (see comment)"

@hannes-ucsc: "Changed my mind, the retry increase for log forwarder lambdas should occur in PR #6217 for #5622 which is all about log forwarder lambdas. The fix for this issue would just involve explicitly setting the retry for the health check lambdas to 0 and increasing the error alarm threshold to one per day."

@hannes-ucsc hannes-ucsc added - [priority] Medium and removed + [priority] High labels Apr 24, 2024
@hannes-ucsc
Copy link
Member

hannes-ucsc commented Jun 11, 2024

For demo, show that the health check lambdas aren't retried, and that there were no alarms during a day in which either or both of the health check lambdas timed out exactly once.

@hannes-ucsc hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Jun 11, 2024
@achave11-ucsc achave11-ucsc added the demoed [process] Successfully demonstrated to team label Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
- [priority] Medium bug [type] A defect preventing use of the system as specified debt [type] A defect incurring continued engineering cost demo [process] To be demonstrated at the end of the sprint demoed [process] Successfully demonstrated to team enh [type] New feature or request infra [subject] Project infrastructure like CI/CD, build and deployment scripts noise [subject] Causing many false alarms orange [process] Done by the Azul team
Projects
None yet
Development

No branches or pull requests

3 participants