Health lambas still times out occasionally #6097

achave11-ucsc · 2024-03-26T19:03:01Z

Task timed out after the request to the bundles endpoint took too long.

[
    {
        "@timestamp": "2024-03-24 06:35:29.809",
        "@message": "START RequestId: f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e Version: $LATEST\n"
    },
    {
        "@timestamp": "2024-03-24 06:35:29.810",
        "@message": "[INFO]\t2024-03-24T06:35:29.810Z\tf266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\tazul.health\tGetting health property 'api_endpoints'"
    },
    {
        "@timestamp": "2024-03-24 06:35:29.811",
        "@message": "[INFO]\t2024-03-24T06:35:29.811Z\tf266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\tazul.health\tMaking HEAD request to https://service.azul.data.humancellatlas.org/index/bundles?size=1"
    },
    {
        "@timestamp": "2024-03-24 06:36:08.883",
        "@message": "2024-03-24T06:36:08.883Z f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e Task timed out after 39.07 seconds\n\n"
    },
    {
        "@timestamp": "2024-03-24 06:36:08.883",
        "@message": "END RequestId: f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\n"
    },
    {
        "@timestamp": "2024-03-24 06:36:08.883",
        "@message": "REPORT RequestId: f266d06e-b3e2-4fb4-a5f1-3d5a4d42ea7e\tDuration: 39073.64 ms\tBilled Duration: 39000 ms\tMemory Size: 128 MB\tMax Memory Used: 121 MB\t\n"
    }
]

achave11-ucsc · 2024-03-26T19:03:28Z

@hannes-ucsc: "We should consider raising the error threshold."

hannes-ucsc · 2024-03-28T00:55:18Z

Assignee to determine a reasonable threshold based on historic alarm data.

achave11-ucsc · 2024-04-02T17:03:33Z

Excluding events prior to merging #5467 (which added added a gateway endpoint to S3 and DynamoDb that significantly reduced execution failures) into develop on Feb 6, and the lone standing event of #5927 which caused a single, transient execution failure, only the Prod account has seen execution failures.

The only two failures were on the servicecachehealth Lambda, 1) due to a ConnectionError on 2024-03-09, and 2) due to a Task timed out after 39.07 seconds on 2024-03-23.

So based on this data, it seems appropriate to set the retry limit for this lambdas to one, given that the occurrence rate is low at only two per month, and to only be evident in the prod deployment. The other deployment's haven't seen errors or timeouts since Feb 6, that wasn't related to #5927.

dsotirho-ucsc · 2024-04-04T18:36:37Z

~~@hannes-ucsc: "Increase retry for the log forwarder Lambdas (see comment)"~~

@hannes-ucsc: "Changed my mind, the retry increase for log forwarder lambdas should occur in PR #6217 for #5622 which is all about log forwarder lambdas. The fix for this issue would just involve explicitly setting the retry for the health check lambdas to 0 and increasing the error alarm threshold to one per day."

hannes-ucsc · 2024-06-11T18:28:55Z

For demo, show that the health check lambdas aren't retried, and that there were no alarms during a day in which either or both of the health check lambdas timed out exactly once.

achave11-ucsc added the orange [process] Done by the Azul team label Mar 26, 2024

achave11-ucsc self-assigned this Mar 28, 2024

achave11-ucsc added a commit that referenced this issue Apr 2, 2024

Fix: Health lambas still fail occasionally (#6097)

3b2831b

achave11-ucsc mentioned this issue Apr 2, 2024

Fix: Health lambas still time out occasionally (#6097) #6118

Merged

dsotirho-ucsc mentioned this issue Apr 4, 2024

S3 log forwarder lambda times out #5622

Closed

14 tasks

achave11-ucsc added a commit that referenced this issue Apr 4, 2024

Fix: Health lambas still fail occasionally (#6097)

8e639a9

achave11-ucsc added a commit that referenced this issue Apr 4, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

2d9a6ba

achave11-ucsc added a commit that referenced this issue Apr 4, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

342a8a6

achave11-ucsc added a commit that referenced this issue Apr 4, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

d97fda1

achave11-ucsc added a commit that referenced this issue Apr 4, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

70a4715

achave11-ucsc added a commit that referenced this issue Apr 5, 2024

Fix: Health lambas still fail occasionally (#6097)

b15e092

achave11-ucsc added a commit that referenced this issue Apr 5, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

5d71952

achave11-ucsc added a commit that referenced this issue Apr 5, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

6dc6024

achave11-ucsc added a commit that referenced this issue Apr 15, 2024

Fix: Health lambas still fail occasionally (#6097)

5d113db

achave11-ucsc added a commit that referenced this issue Apr 16, 2024

Fix: Health lambas still fail occasionally (#6097)

3137bac

hannes-ucsc added - [priority] Medium and removed + [priority] High labels Apr 24, 2024

achave11-ucsc mentioned this issue Apr 24, 2024

Connection reset when health check makes HEAD request to service endpoint #5702

Closed

14 tasks

dsotirho-ucsc mentioned this issue Apr 30, 2024

Lambda indexercachehealth timed out #6216

Closed

achave11-ucsc added a commit that referenced this issue May 1, 2024

Fix: Health lambas still fail occasionally (#6097)

19b9f3e

achave11-ucsc added a commit that referenced this issue May 1, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

d13be8d

achave11-ucsc added a commit that referenced this issue May 1, 2024

Fix: Health lambas still fail occasionally (#6097)

a4469ee

achave11-ucsc added a commit that referenced this issue May 1, 2024

fixup! Fix: Health lambas still fail occasionally (#6097)

82ea709

achave11-ucsc changed the title ~~Health lambas still fail occasionally~~ Health lambas still times out occasionally May 1, 2024

achave11-ucsc added a commit that referenced this issue May 2, 2024

Fix: Health lambas still fail occasionally (#6097)

0e99494

achave11-ucsc added a commit that referenced this issue May 3, 2024

Fix: Health lambas still fail occasionally (#6097)

a32516a

achave11-ucsc mentioned this issue May 22, 2024

Service Lambda timeouts cause user-facing 5xx responses #6284

Closed

achave11-ucsc added a commit that referenced this issue Jun 7, 2024

Fix: Health lambas still fail occasionally (#6097)

ea8e82e

achave11-ucsc added a commit that referenced this issue Jun 7, 2024

Fix: Health lambas still fail occasionally (#6097)

a75eb6e

hannes-ucsc added the demo [process] To be demonstrated at the end of the sprint label Jun 11, 2024

achave11-ucsc added a commit that referenced this issue Jun 11, 2024

Fix: Health lambas still fail occasionally (#6097)

63b8a78

achave11-ucsc added a commit that referenced this issue Jun 11, 2024

Fix: Health lambas still time out occasionally (#6097, PR #6118)

fcdec05

achave11-ucsc added the demoed [process] Successfully demonstrated to team label Jun 18, 2024

achave11-ucsc closed this as completed Jun 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Health lambas still times out occasionally #6097

Health lambas still times out occasionally #6097

achave11-ucsc commented Mar 26, 2024

achave11-ucsc commented Mar 26, 2024

hannes-ucsc commented Mar 28, 2024

achave11-ucsc commented Apr 2, 2024

dsotirho-ucsc commented Apr 4, 2024 •

edited by achave11-ucsc

Loading

hannes-ucsc commented Jun 11, 2024 •

edited by achave11-ucsc

Loading

Health lambas still times out occasionally #6097

Health lambas still times out occasionally #6097

Comments

achave11-ucsc commented Mar 26, 2024

achave11-ucsc commented Mar 26, 2024

hannes-ucsc commented Mar 28, 2024

achave11-ucsc commented Apr 2, 2024

dsotirho-ucsc commented Apr 4, 2024 • edited by achave11-ucsc Loading

hannes-ucsc commented Jun 11, 2024 • edited by achave11-ucsc Loading

dsotirho-ucsc commented Apr 4, 2024 •

edited by achave11-ucsc

Loading

hannes-ucsc commented Jun 11, 2024 •

edited by achave11-ucsc

Loading