Skip to content

Commit

Permalink
fix: Feed builder lambda times out (#1381)
Browse files Browse the repository at this point in the history
## Problem

The feed builder lambda was timing out and consistently failing causing
the atom feed to not be updated.

## Solution

After investigation, it was determined the default memory, 128MB, was
insufficient for the lambda to correctly execute. This PR increases the
memory to 1024MB, a number verified by dev testing, and adding an alarm
such that future issues will be flagged.

### Investigation notes

Below you can see the metrics from the lambda function before and after
increasing the memory allocation, approximately 21:45UTC.

<img width="1402" alt="Screenshot 2024-01-30 at 3 35 59 PM"
src="https://github.com/cdklabs/construct-hub/assets/139287474/42e47e4e-aaef-4daa-91a1-e3f637dfd152">

Note that when the issue was resolved the invocations, throttles and
events spike, then return to a low steady state. I believe this is due
to the long backlog of work the lambda had to work through as it hasn't
run correctly in ~seven months.

This fixes #1238.

----

*By submitting this pull request, I confirm that my contribution is made
under the terms of the Apache-2.0 license*
  • Loading branch information
truggeriaws authored Feb 1, 2024
1 parent 3c02595 commit 525e1fb
Show file tree
Hide file tree
Showing 5 changed files with 425 additions and 66 deletions.
53 changes: 34 additions & 19 deletions docs/operator-runbook.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ readily available.
### Storage Disaster

Every deployment of Construct Hub automatically allocates a failover bucket for every bucket it creates and uses.
The failover buckets are created with the exact same properites as the original buckets, but are not activated by default.
They exist in order for operators to perform scheduled snapshots of the original data, in preparation for a disater.
The failover buckets are created with the exact same properties as the original buckets, but are not activated by default.
They exist in order for operators to perform scheduled snapshots of the original data, in preparation for a disaster.

Construct Hub deployments provide CloudFormation outputs that list out the necessary commands you need to run in order to
create those snapshots, and backup your data into the failover buckets.
Expand Down Expand Up @@ -57,13 +57,34 @@ When you restore the original data and are ready to go back to the original buck
If the data loss/corruption is self-inflicted and continuous, i.e construct hub misbehaves and mutates its own data in a faulty manner.
In this case switching to the failover won't help because the bad behavior will be applied on the failover buckets.
#### When to use this procedure.
#### When to use this procedure
This procedure is designed to be used as a reaction to a single and isolated corruption/loss event, either by human error or by the system.
**Its imperative you validate the corruption is not continuous!**
## :rotating_light: ConstructHub Alarms
### `ConstructHub/FeedBuilder/Failure`
#### Description
This alarm goes off when the *Feed Builder Function* fails.
#### Investigation
The classical way of diagnosing Lambda Function failures is to dive into the
logs in CloudWatch Logs. Those can easily be found by looking under the
*Feed Builder Function* section of the backend dashboard, then clocking the *Search
Log Group* button.
For additional recommendations for diving into CloudWatch Logs, refer to the
[Diving into Lambda Function logs in CloudWatch Logs][#lambda-log-dive] section.
#### Resolution
The alarm will automatically go back to green once the Lambda function stops
failing.
### `ConstructHub/Ingestion/DLQNotEmpty`
#### Description
Expand Down Expand Up @@ -112,7 +133,6 @@ default). Once the dead-letter queue has cleared up, disable that trigger again.
> trigger and resume investigating. There may be a second problem that was
> hidden by the original one.
### `ConstructHub/Ingestion/Failure`
#### Description
Expand Down Expand Up @@ -145,7 +165,6 @@ queue, and caused the
[`ConstructHub/Ingestion/DLQNotEmpty`](#constructhubingestiondlqnotempty) alarm
to go off.
### `ConstructHub/InventoryCanary/Failures`
#### Description
Expand All @@ -169,7 +188,6 @@ For additional recommendations for diving into CloudWatch Logs, refer to the
The alarm will automatically go back to green once the Lambda function stops
failing. No further action is needed.
### `ConstructHub/InventoryCanary/NotRunning`
#### Description
Expand Down Expand Up @@ -202,7 +220,6 @@ request a quota increase.
The alarm will automatically go back to green once the Lambda function starts
running as scheduled again. No further action is needed.
### `ConstructHub/Orchestration/DLQ/NotEmpty`
#### Description
Expand Down Expand Up @@ -276,7 +293,6 @@ Machine for re-processing by running the *Redrive DLQ* function, linked from the
If messages are sent back to the dead-letter queue, perform the investigation
steps again.
### `ConstructHub/Orchestration/CatalogBuilder/ShrinkingCatalog`
#### Description
Expand Down Expand Up @@ -335,7 +351,6 @@ will always be different from the version ID you copied from):
}
```
### `ConstructHub/Orchestration/Resource/ExecutionsFailed`
#### Description
Expand Down Expand Up @@ -405,7 +420,6 @@ dead-letter queue need to be manually passed to new function invocations.
> :construction: An automated way to replay messages from the dead-letter queue
> will be provided in the future.
### `ConstructHub/Sources/CodeArtifact/*/Fowarder/Failures`
#### Description
Expand Down Expand Up @@ -434,7 +448,6 @@ Some messages may have been sent to the dead-letter queue, and caused the
[`ConstructHub/Sources/CodeArtifact/Fowarder/DLQNotEmpty`](#constructhubsourcescodeartifactfowarderdlqnotempty)
alarm to go off.
### `ConstructHub/Sources/NpmJs/Follower/Failures`
#### Description
Expand All @@ -459,7 +472,6 @@ For additional recommendations for diving into CloudWatch Logs, refer to the
This alarm will automatically go back to green once the `NpmJs` follower stops
failing. No futher action is needed.
### `ConstructHub/Sources/NpmJs/Follower/NotRunning`
#### Description
Expand Down Expand Up @@ -553,7 +565,7 @@ For additional recommendations for diving into CloudWatch Logs, refer to the
#### Resolution
Once the root cause of the failures has been addressed, the messgaes from the
Once the root cause of the failures has been addressed, the messages from the
dead-letter queue can be automatically re-processed through the Lambda function
by enabling the SQS Trigger that is automatically configured on the function,
but is disabled by default.
Expand All @@ -566,7 +578,7 @@ disable the SQS Trigger again.
## :repeat: Bulk Re-processing
In some cases, it might be useful to re-process indexed packages though parts or
all of the back-end. This section descripts the options offered by the back-end
all of the back-end. This section describes the options offered by the back-end
system and when it is appropriate to use them.
### Overview
Expand All @@ -577,7 +589,7 @@ Two workflows are available for bulk-reprocessing:
through the entire pipeline, including re-generating the `metadata.json`
object. This is usually not necessary, unless an issue has been identified
with many indexed packages (incorrect or missing `metadata.json`, incorrectly
identfied construct framework package, etc...). In most cases, re-generating
identified construct framework package, etc...). In most cases, re-generating
the documentation is sufficient.
1. The "re-generate all documentation" workflow re-runs all indexed packages
through the documentation-generation process. This is useful when a new
Expand Down Expand Up @@ -612,7 +624,7 @@ object such as the following:
}
```
These informations may be useful to other operations as they observe the side
This information may be useful to other operations as they observe the side
effects of executing these workflows.
--------------------------------------------------------------------------------
Expand Down Expand Up @@ -647,6 +659,7 @@ unable to evaluate the metric.
Otherwise, look for traces of the package version in the logs of each step in
the pipeline:
- The NpmJs follower function
- The NpmJs stager function
- The backend orchestration workflow
Expand Down Expand Up @@ -695,12 +708,13 @@ to determine what is happening and resolve the problem.
#### Resolution
Once the canary starts unning normally again, the alarm will clear itself
Once the canary starts running normally again, the alarm will clear itself
without requiring any further intervention.
## :information_source: General Recommendations
### Diving into Lambda Function logs in CloudWatch Logs
[#lambda-log-dive]: #diving-into-lambda-function-logs-in-cloudwatch-logs
Diving into Lambda Function logs can seem daunting at first. The following are
Expand All @@ -713,9 +727,9 @@ often good first steps to take in such investigations:
points.
- Once you've homed in on a log entry from a failed Lambda execution, identify
the request ID for this trace.
+ Lambda log entries are formatted like so:
- Lambda log entries are formatted like so:
`<timestamp> <request-id> <log-level> <message>`
+ Extract the `request-id` segment (it is a UUID), and copy it in the search
- Extract the `request-id` segment (it is a UUID), and copy it in the search
bar, surrounded by double quotes (`"`)
- Remember, the search bar of CloudWatch Logs requires quoting if the searched
string includes any character that is not alphanumeric or underscore.
Expand All @@ -729,6 +743,7 @@ often good first steps to take in such investigations:
https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html

### Diving into ECS logs in CloudWatch Logs

[#ecs-log-dive]: #diving-into-ecs-logs-in-cloudwatch-logs

ECS tasks emit logs into CloudWatch under a log group called
Expand Down
Loading

0 comments on commit 525e1fb

Please sign in to comment.