SQS Webhook Buffer

With tests for deployed code and GroupIDs to ensure sequential processing within a group (i.e. OrderId).

Based on gh/cdk-patterns/serverless/the-scalable-webhook

Introduction

This is an implementation of the Webhook Buffer pattern - see the end of this README for the original docs this repo was forked from which detail the pattern.

A Webhook Buffer insets a buffer via a FIFO queue in front of an endpoint. Implemented with highly available and scalable AWS services this buffer shields you from backend scalability issues and downtime - in the even an important backend is down you don't loose any requests sent to the endpoint.

The Webhook Buffer is a passthrough that largely passes through whatever body and headers it recieves to the backend. We have added a grouping ID to maintain ordering - i.e. if you use your order_id as the Grouping ID then all operations for a given order with remain in-order.

The endpoint immediately returns 200 for all requests so this pattern is not suitable for endpoints that need to return anything to the caller. (ed: technically - it returns as soon as the request is queued.)

It expects a grouping ID to decide which things can be run in parallel or not (think: order_id). Implementation is with API Gateway and lambdas.

We have added integration with Sentry and DataDog.

CDK Caveats

We used this project as a chance to experiment with cdk vs terraform. We weren't overly impressed with some of cdk's limitations:

This project does not have a CI/CD pipeline. Deploys are run via npm deploy as below - but they do run post-deploy smoke tests.
Unlike terraform cdk will not show you a plan or diff and ask if you want to apply. Unless it encounters an error it will just deploy/update immediately. Note that CDK does not deal well with drift in the deployed setup so if you tweaked things manually and came here to reset them - you might find CDK just refuses to apply when resources have changed. See Drift Detection below. CDK will rollback on any error.
Outputs also differ from terraform. They are written to cdk.out.json but this is local and overwritten on every deploy so ./get-outputs.ts is provided to retrieve outputs from AWS without re-deploying environments.

View the CDK code in bin/buffered-webhook.ts and lib/buffered-webhook-stack.ts.

The Faux Backend

A special backend that you can optionally deploy. Used as part of the testing rig. If you use this backend you can run a complete set of test cases, including failure cases, against deployed code. You can also use this during development if the API you wish to buffer doesn't have a test environment.

Deploy with CDK

Assuming you have an AWS profile set up locally called my-profile run the following to configure your local env:

cdk bootstrap --profile my-profile

Configure your environment in cdk.json

...
    "<env>": {
      "awsAccount": "1234566788",
      "awsRegion": "us-east-1",

      "environment": "<env>",
      "service": "sqs-webhook-buffer",

      "forwarderArn": "arn:aws:lambda:us-east-1:11111111111:function:serverlessrepo-Datadog-Log-For-loglambdaddfunction-xxxxxxxx",
      "subdomain": "sqs",
      "domain": "mydomain.com",
      "acmCertificateArn": "arn:aws:acm:us-east-1:111111111:certificate/xxxxx-xxxx-xxxx-xxxxx-xxxxxxxxxxxxxx",
      "sentryDsn": "https://xxxxxxxxxxxxxxxxxxx@xxxxxxx.ingest.sentry.io/xxxxxxxx"
    },
...

Then setup the code with:

npm run deploy --env=<env> # NB: the = sign is required, a space doesn't work

This will run post deployment tests.

Check what will be modified on a deployment

npm run diff --env=<env>

Examine outputs for an environment with:

aws cloudformation describe-stacks --stack-name sqs-webhook-buffer-<env> --query "Stacks[0].Outputs"

Tear it all down with:

npm run destroy --env=<env> # NB this will leave the DynamoDB table that needs to be manually deleted

Run happy path tests against a deployed buffer:

npm run test --env=dev # See the Testing section below.

Note you may have to deploy a stack before testing it so the outputs are available in ./cdk.out.json. If you need to do this you will see the error message:

Your test suite must contain at least one test.

Run a more complete test set including failure cases, this requires the faux backend as the backend:

npm run test-faux-backend --env=dev

Useful commands

For a more detailed look at the commands check the scripts section of package.json.

npm run build compile typescript to js
npm run watch watch for changes and compile
npm run test perform the jest unit tests
npm run deploy --env=<env> deploy this stack to your default AWS account/region and run tests
npm run deploy-notest --env=<env> same as deploy but without tests
npm run diff --env=<env> compare deployed stack with current state for the env
npm run synth --env=<env> emits the synthesized CloudFormation template

Configuration

The cdk.json tells CDK how to execute the app and contains environment specific configuration that is accessible through the Config interface.

Steps for adding new parameters to the configuration

Add the parameter to each env in cdk.json
Declare the parameter in the Config interface in lib/config.ts
Assign the parameter in bin/buffered-webhook.ts

TL;DR

Find the code in lib/buffered-webhook-stack.ts and bin/buffered-webhook.ts

Caveats

CDK vs Terraform

CDK has state not by directory but instead by Stack. In particular if you have a multi-stack CDK project and delete one of the stacks from being created in the code this will not delete the stack. Within a stack behaviour appears to be similar to tf. however:
possible if objects go out of scope and get garbage collected then the resources they create do to? My faux backend lambda wasn't getting created at one point and I'm not sure what fixed it.
Log groups and databases don't get deleted when you destroy a stack. An example of a terraform version of a similar setup can be found here: https://github.com/blakegreendev/cdktf-typescript-aws-webservice/blob/master/main.ts

Architecture / Correctness

How are ACKs being handled? Many event delivery systems ACK individual massages - I haven't seen that in SQS yet. Ok, I ust found it here: https://lumigo.io/blog/sqs-and-lambda-the-missing-guide-on-failure-modes/ - the polling backend does the ACKs invisibly for us, meaning it's all or nothing. Thus it is very important we keep the batch size at 1.
Lambdas triggered by SQSEventSources by default can take in up to 10 events at a time. As we execute each event in series using await (to preserve ordering) this means we can hit the 3000ms lambda timeout when the queue is backed up and all 10 event slots are filled. We have limited the number of events per invocation to 1 to prevent this.
Sentry isn't catching top level unhandled exceptions. Lambda adds it's own hooks and there's some complexity around getting this working without breaking some other things. (need to chain the hooks etc)
We are using messageId as the key to store in DynamoDB. This means if a message can't be delivered it will get retried repeatedly by the subscribe lmabda with each new attempt overwriting the previous attempt in the DB. This has the nice effect that the eventual final sucessful run might be the only one left in the DB. It also means there is only the 'latest' attempt's data stored for forensics.
Deploy vs testing: possibly the tests would run immediately after the deploy while requests are still being routed to an old instance of the lambda.

Code Duplication

logger.ts is duplicated into the subscribe and publish lambdas, remember to update both.

Drift

Drift. CloudFormation doesn't detect and fix drift like tf does: aws/aws-cdk#1723. Remediation: https://aws.amazon.com/blogs/mt/implement-automatic-drift-remediation-for-aws-cloudformation-using-amazon-cloudwatch-and-aws-lambda/

Dead Letter Queues

We don't use a dead letter queue, because: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-dead-letter-queues.html#sqs-dead-letter-queues-when-to-use

Errors

Note that any errors (4xx/5xx) will result in the throwing of an exception and the event being left on the SQS queue.

Testing

We currently just test the happy path code as the error path tests would spam sentry on every deploy.

Tests always use the same order_id (SQS Group ID). This means if you disable consume_bad_messages the failure path messages will block future testing runs from passing.

Improvements: We could improve the test rig by having a tag that we pass in to tell the code not to generate sentry errors for a given message. We could also pass in the body the return code we want the fauxBackend to return for us. Adding a flag when running the tests to say to test the failure paths as well.

FAQ

If I destroy the stack and re-create it what happens with the DynamoBD database?: The original database will be left untouched, but the new stack will have a new database.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

SQS Webhook Buffer

Introduction

CDK Caveats

The Faux Backend

Deploy with CDK

Useful commands

Configuration

TL;DR

Caveats

CDK vs Terraform

Architecture / Correctness

Code Duplication

Drift

Dead Letter Queues

Errors

Testing

FAQ

Docs Links

Monitoring

DataDog Integration

Using DataDog

Porting to Terraform

Original Documentation - The Scalable Webhook

Desconstructing The Scalable Webhook

High Level Description

Pattern Background

Files

README.md

Latest commit

History

README.md

File metadata and controls

SQS Webhook Buffer

Introduction

CDK Caveats

The Faux Backend

Deploy with CDK

Useful commands

Configuration

TL;DR

Caveats

CDK vs Terraform

Architecture / Correctness

Code Duplication

Drift

Dead Letter Queues

Errors

Testing

FAQ

Docs Links

Monitoring

DataDog Integration

Using DataDog

Porting to Terraform

Original Documentation - The Scalable Webhook

Desconstructing The Scalable Webhook

High Level Description

Pattern Background