This document aims at helping operators navigate a ConstructHub deployment to diagnose (and where possible, solve) problems. The table of contents of this file should always allow an operator to directly jump to the relevant article. As much as possible, articles should be self-contained, to reduce the need for cross-referencing documents.
Most of the investigation instructions in this document refer to the backend dashboard of ConstructHub. Operators should ensure this dashboard is always readily available.
Every deployment of Construct Hub automatically allocates a failover bucket for every bucket it creates and uses. The failover buckets are created with the exact same properties as the original buckets, but are not activated by default. They exist in order for operators to perform scheduled snapshots of the original data, in preparation for a disaster.
Construct Hub deployments provide CloudFormation outputs that list out the necessary commands you need to run in order to create those snapshots, and backup your data into the failover buckets.
After construct hub deployment finishes, at the time of your choosing, locate those outputs in the CloudFormation console, under the "Outputs" tab, and run the commands:
aws s3 sync s3://<deny-list-bucket> s3://<failover-deny-list-bucket>
aws s3 sync s3://<ingestion-config-bucket> s3://<failover-ingestion-config-bucket>
aws s3 sync s3://<license-list-bucket> s3://<failover-license-list-bucket>
aws s3 sync s3://<package-data-bucket> s3://<failover-package-data-bucket>
aws s3 sync s3://<staging-bucket> s3://<failover-staging-bucket>
aws s3 sync s3://<website-bucket> s3://<failover-website-bucket>
Its recommended you run these commands from within the AWS network, preferably in the same region as the deployment.
Once these commands finish, all your data will be backed up into the failover buckets, and you should be ready. When storage related disaster strikes, simply activate the failover buckets:
new ConstructHub(this, 'ConstructHub', {
failoverStorageActive: true,
...
}
And deploy this to your environment. This will swap out all the original buckets with the pre-populated failover buckets. Note that any data that was indexed in construct hub post the creation of the snapshot, will not be available immediately once you perform the failover. Construct hub will pick up discovery from the marker that was included in the last snapshot.
When you restore the original data and are ready to go back to the original buckets, simply remove this property and deploy again.
If the data loss/corruption is self-inflicted and continuous, i.e construct hub misbehaves and mutates its own data in a faulty manner. In this case switching to the failover won't help because the bad behavior will be applied on the failover buckets.
This procedure is designed to be used as a reaction to a single and isolated corruption/loss event, either by human error or by the system. Its imperative you validate the corruption is not continuous!
This alarm goes off when the Feed Builder Function fails.
The classical way of diagnosing Lambda Function failures is to dive into the logs in CloudWatch Logs. Those can easily be found by looking under the Feed Builder Function section of the backend dashboard, then clocking the Search Log Group button.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
The alarm will automatically go back to green once the Lambda function stops failing.
This alarm goes off when the dead-letter queue for the ingestion function is not empty. This means messages send by package sources to the ingestion SQS queue have failed processing through the ingestion function.
⚠️ Messages in the dead-letter queue can only be persisted there for up to 14 days. If a problem cannot be investigated and resolved within this time frame, it is recommended to copy those messages out to persistent storage for later re-processing.
Under the Ingestion Function section of the backend dashboard. The dead-letter
queue can be accessed by clicking the DLQ
button.
Messages in the dead-letter queue have attributes that can be used to determine the cause of the failure. These can be inspected by going into the Send and Receive messages panel of the SQS console.
If the message attributes were not sufficient to determine the failure cause,
the Lambda function logs can be searched using the Search Log Group button
in the backend dashboard. Since messages are sent to the dead-letter queue only
after having caused the Ingestion Function to fail several times, the Errors
metric will have data points that can be used to narrow the time-frame of log
searches.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
Once the cause of the issue has been identified, and resolved (unless it was a transient problem), messages from the dead-letter queue can be sent back for processing into the Ingestion Function by going to the Lambda console using the link in the backend dashboard, then browsing into Configuration, then Triggers, and enabling the SQS trigger from the dead-letter queue (this trigger is automatically configured by ConstructHub, but is disabled by default). Once the dead-letter queue has cleared up, disable that trigger again.
It is possible messages still fail processing, in which case they will remain in the dead-letter queue. If the queue is not cleared up after you have allowed sufficient time for all messages to be re-attempted, disable the trigger and resume investigating. There may be a second problem that was hidden by the original one.
This alarm goes off when the Ingestion Function fails. This has higher
sensitivity than the ConstructHub/Ingestion/DLQNotEmpty
, and may trigger
before messages make it to the dead-letter queue.
It may be indicative if a problem with the package sources (sending broken messages to the ingestion queue), or of a general degradation of the availability of dependencies of the Ingestion Function.
The classical way of diagnosing Lambda Function failures is to dive into the logs in CloudWatch Logs. Those can easily be found by looking under the Ingestion Function section of the backend dashboard, then clocking the Search Log Group button.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
The alarm will automatically go back to green once the Lambda function stops failing.
Some of the ingestion queue messages may however have made it to the dead-letter
queue, and caused the
ConstructHub/Ingestion/DLQNotEmpty
alarm
to go off.
The Inventory Canary is failing. This means the graphs in the backend dashboard under sections Catalog Overview and Documentation Generation may not contain accurate data (or no data at all).
The classical way of diagnosing Lambda Function failures is to dive into the logs in CloudWatch Logs. Those can easily be found by looking under the Catalog Overview section of the backend dashboard, then clocking the Search Canary Log Group button.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
The alarm will automatically go back to green once the Lambda function stops failing. No further action is needed.
The Inventory Canary is not running. This function is scheduled to run every
5 minutes
, and produces the CloudWatch metrics that back graphs under the
Catalog Overview and Documentation Generation sections of the backend
dashboard.
The Inventory Canary is triggered by an EventBridge rule, which can be reviewed in the Lambda console (accessed by clicking the Inventory Canary button under the Catalog Overview title of the backend dashboard), under Configuration then Triggers. Verify the EventBridge trigger exists (if not, it may need to be manually re-created).
If the rule exists, but the function does not run, the most likely cause is that
your account has run out of Lambda concurrent executions. This will manifest as
Throttles
being visible under the Monitoring then Metrics panel of the
Lambda console.
If the issue is that the account as run out of Lambda concurrency, consider filing for a quota increase with AWS Support. The AWS documentation provides more information about how to request a quota increase.
The alarm will automatically go back to green once the Lambda function starts running as scheduled again. No further action is needed.
The dead-letter queue of the orchestration state machine is not empty. This means that some packages could not be processed by the construct hub and therefore they might be missing documentation for one or more languages, or may not be referenced in the catalog at all.
⚠️ Messages in the dead-letter queue can only be persisted there for up to 14 days. If a problem cannot be investigated and resolved within this time frame, it is recommended to copy those messages out to persistent storage for later re-processing.
The dead-letter queue can be accessed by clicking the DLQ button under the Orchestration title of the backend dashboard. Messages in the queue include information about the failure that caused the message to be placed here. They include the following elements:
Key | Description |
---|---|
$TaskExecution |
References to the StateMachine execution |
DocGen[].error |
Error message returned by a DocGen task |
catalogBuilderOutput.error |
Error message returned by the catalog builder |
For each language supported by the Construct Hub, there should be an entry under
the DocGen
array. If the error
field has a value or an empty object ({}
) it
means that this specific language failed. If the information under error
is
not sufficient, a deeper dive into the execution logs of the specific doc gen
task is required.
To see the execution logs of a specific task, locate the step function execution
by clicking the State Machine button in the backend dashboard, and
search for the execution named at $TaskExecution.Name
.
Open the execution details and locate the failed tasks. Failed tasks are colored orange or red in the state diagram.
Reviewing the logs of various tasks can be useful to obtain more information. Tasks are retried automatically by the state machine, so it might be useful to review a few failures to identify if an error is endemic or transient.
Click on the URL under Resource in the Details tab in order to jump to the AWS console for this specific task execution and view logs from there.
In the case of ECS tasks, the CloudWatch logs for a particular execution can be found by following the links from the state machine execution events to the ECS task, then to the CloudWatch Logs stream for that execution.
In case ECS says "We couldn't find the requested content.", it means that the task execution was already deleted from ECS, and then you should be able to go directly to the CloudWatch logs for this task. see Diving into ECS logs in CloudWatch section for details on how to find the CloudWatch logs for this task based on the task ID.
For Lambda tasks, the request ID can be obtained from the corresponding
TaskSucceeded
or TaskFailed
event in the state machine execution trace,
which can be searched for in the Lambda function's CloudWatch Logs.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
Once the root cause has been identified and fixed (unless this was a transient issue), messages from the dead-letter queue can be sent back to the State Machine for re-processing by running the Redrive DLQ function, linked from the Orchestration section of the backend dashboard.
If messages are sent back to the dead-letter queue, perform the investigation steps again.
This alarm goes off if the count of packages in the catalog.json
object, which
backs the search experience of ConstructHub, reduces by more than 5 items,
meaning packages are no longer accessible for search.
Packages can be removed from catalog.json
in normal circumstances: when a
package is added the the deny-list of the deployment, it will eventually be
pruned from the catalog. If many packages are added to the deny-list at the same
time, this alarm might go off.
Review the CloudWatch metric associated to the alarm to understand if the magnitude of the catalog size change corresponds to a known or expected event. If the change corresponds to an expected event (i.e: due to a change in deny-list contents), you can treat the alarm as a false positive.
On the other hand, if the catalog contraction is unexpected, investigate the logs of the Catalog Builder function to identify any unexpected activity.
The package data bucket is configured with object versioning. You can
identify a previous "good" version of the catalog.json
object by reviewing the
object history in the S3 console (or using the AWS CLI or SDK). The number of
elements in the catalog.json
is reported in a metadata attribute of the object
in S3 - which can help identify the correct previous version without necessarily
having to download all of them for inspection.
When the relevant version has been identified, it can be restored
using the following AWS CLI command (replace <bucket-name>
with the
relevant package data bucket name, and <version-id>
with the S3 version ID
you have selected):
$ aws s3api copy-object \
--bucket='<bucket-name>' \
--copy-source='<bucket-name>/catalog.json?versionId=<version-id>' \
--key='catalog.json'
This will produce an output similar to the following (note that the VersionId
value there is the new current version of the catalog.json
object, which
will always be different from the version ID you copied from):
{
"CopyObjectResult": {
"LastModified": "2015-11-10T01:07:25.000Z",
"ETag": "\"589c8b79c230a6ecd5a7e1d040a9a030\""
},
"VersionId": "YdnYvTCVDqRRFA.NFJjy36p0hxifMlkA"
}
The orchestration state machine has failing executions. This means the workflow has failed on an unexpected error.
This is often the sign there is a bug in the state machine specification, or that some of the state machine's downstream dependencies are experiencing degraded availability.
⚠️ Failed state machine executions may not have succeeded sending their input to the dead-letter queue.
Review the failed state machine executions, which can be found after clicking
the State Machine button under Orchestration in the backend dashboard. You
may use the Filter by status dropdown list to isolate failed executions.
Review the execution trace, to find the ExecutionFailed
event (at the very end
of the events list).
If relevant, file a bug report to have the state machine specification fixed.
Failed state machines should be manually re-started using the StepFunctions console.
One instance of this alarms exists for each configured CodeArtifact source. It triggers when CodeArtifact events (via EventBridge) failed processing through the Forwarder Function enough times to make it to the dead-letter queue. Those events have not been notified to the ingestion queue and the packages that triggered them are not ingested.
⚠️ Messages in the dead-letter queue can only be persisted there for up to 14 days. If a problem cannot be investigated and resolved within this time frame, it is recommended to copy those messages out to persistent storage for later re-processing.
Locate the relevant CodeArtifact package source in the backend dashboard, and click the DLQ button to access the dead-letter queue. Messages in the queue have attributes providing information about the last failure that happened before they were sent to the dead-letter queue.
If that information is not sufficient to understand the problem, click the Search Log Group to dive into the function's logs in CloudWatch Logs.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
Once the root cause has been fixed, messages from the dead-letter queue need to be sent back to the Forwarder Function for processing. Messages from the dead-letter queue need to be manually passed to new function invocations.
🚧 An automated way to replay messages from the dead-letter queue will be provided in the future.
One instance of this alarms exists for each configured CodeArtifact source. It triggers when CodeArtifact events (via EventBridge) fail processing through the Forwarder Function, which filters messages and notifies the ingestion queue when appropriate. This means newly published packages from the CodeArtifact repository are not ingested anymore.
Locate the relevant CodeArtifact package source in the backend dashboard, and click the Search Log Group button to dive into the logs of the forwarder function.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
This alarm will automatically go back to green as the CodeArtifact forwarder stops failing.
Some messages may have been sent to the dead-letter queue, and caused the
ConstructHub/Sources/CodeArtifact/Fowarder/DLQNotEmpty
alarm to go off.
This alarm is only provisioned when the NpmJs
package source is configured. It
triggers when executions encounter failures, preventing new packages from
npmjs.com
from being discovered and ingested.
Click the NpmJs Follower button in the backend dashboard to reach the Lambda console for this function. Under Monitoring, then Logs, you will find a list of links to recent invocations, which is a great place to start for understanding what happens. In most cases, only the latest invocation is relevant.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
This alarm will automatically go back to green once the NpmJs
follower stops
failing. No futher action is needed.
This alarm is only provisioned when the NpmJs
package source is configured. It
triggers when the function is not running at the scheduled rate (every
5 minutes
). This means new packages published to npmjs.com
are not
discovered and injested.
The NpmJs Follower is triggered by an EventBridge rule, which can be reviewed in the Lambda console (accessed by clicking the NpmJs Follower button in the backend dashboard), under Configuration then Triggers. Verify the EventBridge trigger exists (if not, it may need to be manually re-created).
If the rule exists, but the function does not run, the most likely cause is that
your account has run out of Lambda concurrent executions. This will manifest as
Throttles
being visible under the Monitoring then Metrics panel of the
Lambda console.
If the issue is that the account as run out of Lambda concurrency, consider filing for a quota increase with AWS Support. The AWS documentation provides more information about how to request a quota increase.
The alarm will automatically go back to green once the Lambda function starts running as scheduled again. No further action is needed.
This alarm is only provisioned when the NpmJs
package source is configured. It
triggers when the function has not registered any changes from the npmjs.com
registry in 10 minutes.
The NpmJs Follower tracks changes from the CouchDB replica at
replicate.npmjs.com/registry
, and uses the seq
properties to determine what
changes have already been processed.
Occasionally, the replica instance will be replaced, and the sequence numbers will be reset by this action. The NpmJs Follower detects this condition and rolls back automatically in this case, so this should not trigger this alarm.
Look at npmjs.com
status updates and announcements. This alarm may go off in
case a major outage prevents npmjs.com
from accepting new package version for
more than 10 minutes. There is nothing you can do in this case.
If this has not happened, review the logs of the NpmJs Follower to identify any problem.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
The alarm will automatically go back to green once the Lambda function starts
reporting npmjs.com
registry changes again. No further action is needed.
This alarm is only provisioned when the NpmJs
package source is configured. It
triggers when the stager function has failed processing an input message 3 times
in a row, which resulted in that message being sent to the dead-letter queue.
The package versions that were targeted by those messages have hence not been ingested into ConstructHub.
The NpmJs Stager receives messages from the NpmJs Follower function for each new package version identified. It downloads the npm package tarball, stores it into a staging S3 bucket, then notifies the ConstructHub Ingestion queue so the package version is indexed into ConstructHub.
An npmjs.com
outage could result in failures to download the tarballs, so
start by checking the npmjs.com
status updates and announcements.
Additionally, review the logs of the NpmJs Stager function to identify any problem.
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
Once the root cause of the failures has been addressed, the messages from the dead-letter queue can be automatically re-processed through the Lambda function by enabling the SQS Trigger that is automatically configured on the function, but is disabled by default.
Once all messages have cleared from the dead-letter queue, do not forget to disable the SQS Trigger again.
In some cases, it might be useful to re-process indexed packages though parts or all of the back-end. This section describes the options offered by the back-end system and when it is appropriate to use them.
Two workflows are available for bulk-reprocessing:
- The "re-ingest everything" workflow can be used to re-process packages
through the entire pipeline, including re-generating the
metadata.json
object. This is usually not necessary, unless an issue has been identified with many indexed packages (incorrect or missingmetadata.json
, incorrectly identified construct framework package, etc...). In most cases, re-generating the documentation is sufficient. - The "re-generate all documentation" workflow re-runs all indexed packages through the documentation-generation process. This is useful when a new language is added to ConstructHub, or the rendered documentation has significantly changed, as it will guarantee all packages are on the latest version of it.
Optionally, it is possible to configure Construct Hub to automatically run the
"re-ingest everything" workflow on a periodic basis, so that documentation pages
are up to date with the latest changes in our documentation-generation tooling,
and to ensure packages are up to date with the latest packageTags
and
packageLinks
configuration. However, please note that this may be
computationally expensive.
new ConstructHub(stack, 'MyConstructHub', {
reprocessFrequency: cdk.Duration.days(1)
});
In the AWS Console, navigate to the StepFunctions console, and identify the ConstructHub workflows. Simply initiate a new execution of the workflow of your choice - the input payload is not relevant, and we recommend setting it to an object such as the following:
{
"requester": "your-username",
"reason": "A short comment explaining why this workflow was ran"
}
This information may be useful to other operations as they observe the side effects of executing these workflows.
This alarm is only provisioned in case the NpmJs package canary
was configured. It triggers when the canary detects that a recently published
package version (by default, the tracked package is construct-hub-probe
) was
not discovered and indexed within the predefined SLA period (by default, 5
minutes). This means the hub might not be discovering new packages versions.
The alarm will persist as long as any tracked version of the probe package is still missing from the ConstructHub instance past the configured SLA, or if the latest version was ingested out-of-SLA.
If the alarm went off due to insufficient data, the canary might not be emitting
metrics properly. In this case, start by ensuring the lambda function that
implements the canary is executing as intended. It is normally scheduled to run
every minute, but might have been unable to execute, for example, if your
account ran out of Lambda concurrent executions for a while. The Lambda function
can be found in the Lambda console: its description contains
Sources/NpmJs/PackageCanary
. If the function runs as intended,
dive into the Lambda logs to understand why it might be
unable to evaluate the metric.
Otherwise, look for traces of the package version in the logs of each step in the pipeline:
- The NpmJs follower function
- The NpmJs stager function
- The backend orchestration workflow
- The Doc-Gen ECS task logs
- The catalog builder
For additional recommendations for diving into CloudWatch Logs, refer to the Diving into Lambda Function logs in CloudWatch Logs section.
The alarm will automatically go back to green once all outstanding versions of the configured canary package are available in the ConstructHub instance, and the latest revision thereof is within SLA.
If there is a reason why a tracked version cannot possibly be ingested, the S3 object backing the canary state can be deleted, which will effectively re-initialize the canary to track only the latest available version.
This alarm is only provisioned in case the NpmJs package canary was configured. It triggers when the canary is not running as expected, or is reporting failures.
When the NpmJs package canary does not successfully run, the
ConstructHub/Sources/NpmJs/Canary/SLA-Breached
alarm cannot be triggered due
to lack of data. This may hence hide customer-visible problems.
In the AWS Console, verify whether the alarm triggered due to
ConstructHub/Sources/NpmJs/Canary/Failing
or
ConstructHub/Sources/NpmJs/Canary/NotRunning
.
If the canary is not running, verify that the scheduled trigger for the NpmJs
package canary is correctly enabled. If it is, and the canary
is not running, the account might have run out of available AWS Lambda
concurrency, and a limit increase request might be necessary. When that is the
case, the function will report this via the Throttled
metric.
Otherwise, dive into the Lambda logs of the Canary function to determine what is happening and resolve the problem.
Once the canary starts running normally again, the alarm will clear itself without requiring any further intervention.
Diving into Lambda Function logs can seem daunting at first. The following are often good first steps to take in such investigations:
- If possible, narrow down the CloudWatch Logs Search time range around the event you are investigating. Try to keep the time range as narrow as possible, ideally less than one hour wide.
- Start by searching for
"ERROR"
- this will often yield interesting starting points. - Once you've homed in on a log entry from a failed Lambda execution, identify
the request ID for this trace.
- Lambda log entries are formatted like so:
<timestamp> <request-id> <log-level> <message>
- Extract the
request-id
segment (it is a UUID), and copy it in the search bar, surrounded by double quotes ("
)
- Lambda log entries are formatted like so:
- Remember, the search bar of CloudWatch Logs requires quoting if the searched string includes any character that is not alphanumeric or underscore.
- For more information on CloudWatch Log search patterns, refer to the CloudWatch Logs documentation.
ECS tasks emit logs into CloudWatch under a log group called
ConstructHubOrchestrationTransliteratorLogGroup
in its name and the log stream transliterator/Resource/$TASKID
(e.g.
transliterator/Resource/6b5c48f0a7624396899c6a3c8474d5c7
).