fix: Feed builder lambda times out (#1381)

## Problem The feed builder lambda was timing out and consistently failing causing the atom feed to not be updated. ## Solution After investigation, it was determined the default memory, 128MB, was insufficient for the lambda to correctly execute. This PR increases the memory to 1024MB, a number verified by dev testing, and adding an alarm such that future issues will be flagged. ### Investigation notes Below you can see the metrics from the lambda function before and after increasing the memory allocation, approximately 21:45UTC. <img width="1402" alt="Screenshot 2024-01-30 at 3 35 59 PM" src="https://github.com/cdklabs/construct-hub/assets/139287474/42e47e4e-aaef-4daa-91a1-e3f637dfd152"> Note that when the issue was resolved the invocations, throttles and events spike, then return to a low steady state. I believe this is due to the long backlog of work the lambda had to work through as it hasn't run correctly in ~seven months. This fixes #1238. ---- *By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license*
cdklabs · Feb 1, 2024 · 525e1fb · 525e1fb
1 parent 3c02595
commit 525e1fb
Show file tree

Hide file tree

Showing 5 changed files with 425 additions and 66 deletions.
diff --git a/docs/operator-runbook.md b/docs/operator-runbook.md
@@ -17,8 +17,8 @@ readily available.
 ### Storage Disaster
 
 Every deployment of Construct Hub automatically allocates a failover bucket for every bucket it creates and uses.
-The failover buckets are created with the exact same properites as the original buckets, but are not activated by default.
-They exist in order for operators to perform scheduled snapshots of the original data, in preparation for a disater.
+The failover buckets are created with the exact same properties as the original buckets, but are not activated by default.
+They exist in order for operators to perform scheduled snapshots of the original data, in preparation for a disaster.
 
 Construct Hub deployments provide CloudFormation outputs that list out the necessary commands you need to run in order to
 create those snapshots, and backup your data into the failover buckets.
@@ -57,13 +57,34 @@ When you restore the original data and are ready to go back to the original buck
 If the data loss/corruption is self-inflicted and continuous, i.e construct hub misbehaves and mutates its own data in a faulty manner.
 In this case switching to the failover won't help because the bad behavior will be applied on the failover buckets.
 
-#### When to use this procedure.
+#### When to use this procedure
 
 This procedure is designed to be used as a reaction to a single and isolated corruption/loss event, either by human error or by the system.
 **Its imperative you validate the corruption is not continuous!**
 
 ## :rotating_light: ConstructHub Alarms
 
+### `ConstructHub/FeedBuilder/Failure`
+
+#### Description
+
+This alarm goes off when the *Feed Builder Function* fails.
+
+#### Investigation
+
+The classical way of diagnosing Lambda Function failures is to dive into the
+logs in CloudWatch Logs. Those can easily be found by looking under the
+*Feed Builder Function* section of the backend dashboard, then clocking the *Search
+Log Group* button.
+
+For additional recommendations for diving into CloudWatch Logs, refer to the
+[Diving into Lambda Function logs in CloudWatch Logs][#lambda-log-dive] section.
+
+#### Resolution
+
+The alarm will automatically go back to green once the Lambda function stops
+failing.
+
 ### `ConstructHub/Ingestion/DLQNotEmpty`
 
 #### Description
@@ -112,7 +133,6 @@ default). Once the dead-letter queue has cleared up, disable that trigger again.
 > trigger and resume investigating. There may be a second problem that was
 > hidden by the original one.
 
-
 ### `ConstructHub/Ingestion/Failure`
 
 #### Description
@@ -145,7 +165,6 @@ queue, and caused the
 [`ConstructHub/Ingestion/DLQNotEmpty`](#constructhubingestiondlqnotempty) alarm
 to go off.
 
-
 ### `ConstructHub/InventoryCanary/Failures`
 
 #### Description
@@ -169,7 +188,6 @@ For additional recommendations for diving into CloudWatch Logs, refer to the
 The alarm will automatically go back to green once the Lambda function stops
 failing. No further action is needed.
 
-
 ### `ConstructHub/InventoryCanary/NotRunning`
 
 #### Description
@@ -202,7 +220,6 @@ request a quota increase.
 The alarm will automatically go back to green once the Lambda function starts
 running as scheduled again. No further action is needed.
 
-
 ### `ConstructHub/Orchestration/DLQ/NotEmpty`
 
 #### Description
@@ -276,7 +293,6 @@ Machine for re-processing by running the *Redrive DLQ* function, linked from the
 If messages are sent back to the dead-letter queue, perform the investigation
 steps again.
 
-
 ### `ConstructHub/Orchestration/CatalogBuilder/ShrinkingCatalog`
 
 #### Description
@@ -335,7 +351,6 @@ will always be different from the version ID you copied from):
 }
 ```
 
-
 ### `ConstructHub/Orchestration/Resource/ExecutionsFailed`
 
 #### Description
@@ -405,7 +420,6 @@ dead-letter queue need to be manually passed to new function invocations.
 > :construction: An automated way to replay messages from the dead-letter queue
 > will be provided in the future.
 
-
 ### `ConstructHub/Sources/CodeArtifact/*/Fowarder/Failures`
 
 #### Description
@@ -434,7 +448,6 @@ Some messages may have been sent to the dead-letter queue, and caused the
 [`ConstructHub/Sources/CodeArtifact/Fowarder/DLQNotEmpty`](#constructhubsourcescodeartifactfowarderdlqnotempty)
 alarm to go off.
 
-
 ### `ConstructHub/Sources/NpmJs/Follower/Failures`
 
 #### Description
@@ -459,7 +472,6 @@ For additional recommendations for diving into CloudWatch Logs, refer to the
 This alarm will automatically go back to green once the `NpmJs` follower stops
 failing. No futher action is needed.
 
-
 ### `ConstructHub/Sources/NpmJs/Follower/NotRunning`
 
 #### Description
@@ -553,7 +565,7 @@ For additional recommendations for diving into CloudWatch Logs, refer to the
 
 #### Resolution
 
-Once the root cause of the failures has been addressed, the messgaes from the
+Once the root cause of the failures has been addressed, the messages from the
 dead-letter queue can be automatically re-processed through the Lambda function
 by enabling the SQS Trigger that is automatically configured on the function,
 but is disabled by default.
@@ -566,7 +578,7 @@ disable the SQS Trigger again.
 ## :repeat: Bulk Re-processing
 
 In some cases, it might be useful to re-process indexed packages though parts or
-all of the back-end. This section descripts the options offered by the back-end
+all of the back-end. This section describes the options offered by the back-end
 system and when it is appropriate to use them.
 
 ### Overview
@@ -577,7 +589,7 @@ Two workflows are available for bulk-reprocessing:
    through the entire pipeline, including re-generating the `metadata.json`
    object. This is usually not necessary, unless an issue has been identified
    with many indexed packages (incorrect or missing `metadata.json`, incorrectly
-   identfied construct framework package, etc...). In most cases, re-generating
+   identified construct framework package, etc...). In most cases, re-generating
    the documentation is sufficient.
 1. The "re-generate all documentation" workflow re-runs all indexed packages
    through the documentation-generation process. This is useful when a new
@@ -612,7 +624,7 @@ object such as the following:
 }
 ```
 
-These informations may be useful to other operations as they observe the side
+This information may be useful to other operations as they observe the side
 effects of executing these workflows.
 
 --------------------------------------------------------------------------------
@@ -647,6 +659,7 @@ unable to evaluate the metric.
 
 Otherwise, look for traces of the package version in the logs of each step in
 the pipeline:
+
 - The NpmJs follower function
 - The NpmJs stager function
 - The backend orchestration workflow
@@ -695,12 +708,13 @@ to determine what is happening and resolve the problem.
 
 #### Resolution
 
-Once the canary starts unning normally again, the alarm will clear itself
+Once the canary starts running normally again, the alarm will clear itself
 without requiring any further intervention.
 
 ## :information_source: General Recommendations
 
 ### Diving into Lambda Function logs in CloudWatch Logs
+
 [#lambda-log-dive]: #diving-into-lambda-function-logs-in-cloudwatch-logs
 
 Diving into Lambda Function logs can seem daunting at first. The following are
@@ -713,9 +727,9 @@ often good first steps to take in such investigations:
   points.
 - Once you've homed in on a log entry from a failed Lambda execution, identify
   the request ID for this trace.
-  + Lambda log entries are formatted like so:
+  - Lambda log entries are formatted like so:
     `<timestamp> <request-id> <log-level> <message>`
-  + Extract the `request-id` segment (it is a UUID), and copy it in the search
+  - Extract the `request-id` segment (it is a UUID), and copy it in the search
     bar, surrounded by double quotes (`"`)
 - Remember, the search bar of CloudWatch Logs requires quoting if the searched
   string includes any character that is not alphanumeric or underscore.
@@ -729,6 +743,7 @@ often good first steps to take in such investigations:
   https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html
 
 ### Diving into ECS logs in CloudWatch Logs
+
 [#ecs-log-dive]: #diving-into-ecs-logs-in-cloudwatch-logs
 
 ECS tasks emit logs into CloudWatch under a log group called