`awslambdaric` prevents graceful lambda shutdown

# Overview

I've followed the guide described in [graceful-shutdown-with-aws-lambda](https://github.com/aws-samples/graceful-shutdown-with-aws-lambda). Unfortunately, my signal handler is never invoked.

I spent time investigating and it looks like the native code in `awslambdaric` is not bubbling up the signal preventing the signal handler registered from ever running. I don't have experience with writing native code python extensions, so I don't know what needs to change in `runtime_client.cpp` so that this works.

The graceful shutdown guide actually has a note at the end of the readme which says:

    Please be aware that this feature does not work on Python 3.8 and 3.9 runtimes.

I think this is exactly because of the issue I am describing here (and mentioned here https://github.com/aws-samples/graceful-shutdown-with-aws-lambda/issues/2).

# Steps to reproduce.

I took a simple lambda function `handler.py`:

```py
import signal
from types import FrameType

def shutdown(_signum: int, _frame: FrameType | None) -> None:
    print("Shutdown hook invoked")


signal.signal(signal.SIGTERM, shutdown)


def lambda_handler(event, context):
    print(f"Handled event {event}")

    return "SUCCESS"
```

and created a lambda in `eu-central-1` with this as a zip, adding the layer `arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38` and invoked it once. I waited until the instance was spun down (10-15 minutes), but there was no log from the shutdown hook.

I also deployed it as a docker image with this Dockerfile:

```Dockerfile
FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

COPY handler.py "$LAMBDA_TASK_ROOT/handler.py"

CMD ["handler.lambda_handler"]
```

I pushed this into ECR and then created a container-based lambda with this image again in `eu-central-1` and invoked it once. Again, I waited until the instance was spun down (10-15 minutes), but there was no log from the shutdown hook.

To prove that SIGTERM was being sent to the runtime, I altered the previous image to just be a simple shell script and updated the container-based lambda from the previous step:

`startup.sh`

```sh
#!/usr/bin/env bash

set -e
set -o pipefail

printf 'Starting lambda handler. Running as pid %s\n' "$BASHPID"

shutdown() {
  printf 'Shutting down gracefully\n'
  exit
}
trap 'shutdown' SIGTERM

while true ; do
  HEADERS="$(mktemp)"
  BODY="$(mktemp)"
  curl -f -sS -LD "$HEADERS" -X GET "http://$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/next" -o "$BODY"
  REQUEST_ID="$(grep -Fi Lambda-Runtime-Aws-Request-Id "$HEADERS" | tr -d '[:space:]' | cut -d: -f2)"
  rm -rf "$HEADERS"

  printf 'Processing request %s\n' "$REQUEST_ID"

  curl -f -sS -X POST "http://$AWS_LAMBDA_RUNTIME_API/2018-06-01/runtime/invocation/$REQUEST_ID/response" -d '"SUCCESS"'
  rm -rf "$BODY"
done
```

with this `Dockerfile`:

```Dockerfile
FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

COPY startup.sh "$LAMBDA_TASK_ROOT/startup.sh"

ENTRYPOINT ["/var/runtime/startup.sh"]
```

I invoked the lambda and got the `"SUCCESS"` response. I waited for it to spin down and checked the cloudwatch logs again:

```
LOGS	Name: cloudwatch_lambda_agent	State: Subscribed	Types: [Platform]
Starting lambda handler. Running as pid 12
Name: cloudwatch_lambda_agent	State: Ready	Events: [INVOKE, SHUTDOWN]
START RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7 Version: $LATEST
Processing request 1127e4c8-97e8-4ef9-9c52-e24673eb40a7
{""status"":""OK""}
END RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7
REPORT RequestId: 1127e4c8-97e8-4ef9-9c52-e24673eb40a7	Duration: 304.42 ms	Billed Duration: 590 ms	Memory Size: 128 MB	Max Memory Used: 35 MB	Init Duration: 285.31 ms	
Terminated
Shutting down gracefully
```

Success, so I knew that SIGTERM was being sent. After this, I suspected that it had to be somewhere in the runtime handling for python. While reading the source code for `awslambdaric` I realised part of it was written in cpp and I wondered if this might be the issue.

To test whether this was the problem, I replaced the `runtime_client` written in cpp, with one written in python.

`runtime_client_py.py`

```py
# Reimplementation of runtime_client.cpp in python

import http
import http.client
import os

lambda_runtime_address = os.environ["AWS_LAMBDA_RUNTIME_API"]
user_agent = "awslambdaric"


class RuntimeClient:
    @staticmethod
    def initialize_client(user_agent: str) -> None:
        user_agent = user_agent

    @staticmethod
    def next() -> tuple[bytes, dict]:
        endpoint = "/2018-06-01/runtime/invocation/next"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request("GET", endpoint, headers={"User-Agent": user_agent})
        response = runtime_connection.getresponse()
        response_body = response.read()
        raw_headers = response.headers

        headers = {
            "Lambda-Runtime-Aws-Request-Id": raw_headers.get(
                "Lambda-Runtime-Aws-Request-Id"
            ),
            "Lambda-Runtime-Trace-Id": raw_headers.get("Lambda-Runtime-Trace-Id")
            or None,
            "Lambda-Runtime-Invoked-Function-Arn": raw_headers.get(
                "Lambda-Runtime-Invoked-Function-Arn"
            ),
            "Lambda-Runtime-Deadline-Ms": int(
                raw_headers.get("Lambda-Runtime-Deadline-Ms") or "0"
            ),
            "Lambda-Runtime-Client-Context": raw_headers.get(
                "Lambda-Runtime-Client-Context"
            )
            or None,
            "Lambda-Runtime-Cognito-Identity": raw_headers.get(
                "Lambda-Runtime-Cognito-Identity"
            )
            or None,
            "Content-Type": raw_headers.get("Content-Type") or None,
        }

        return response_body, headers

    @staticmethod
    def post_invocation_result(
        invoke_id: str, result_data: bytes, content_type: str
    ) -> None:
        from awslambdaric.lambda_runtime_client import LambdaRuntimeClientError

        endpoint = f"/2018-06-01/runtime/invocation/{invoke_id}/response"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request(
            "POST",
            endpoint,
            body=result_data,
            headers={"User-Agent": user_agent, "Content-Type": content_type},
        )

        response = runtime_connection.getresponse()

        if response.status != http.HTTPStatus.ACCEPTED:
            raise LambdaRuntimeClientError(endpoint, response.status, None)

    @staticmethod
    def post_error(invoke_id: str, error_response_data: bytes, xray_fault: str) -> None:
        endpoint = f"/2018-06-01/runtime/invocation/{invoke_id}/error"
        runtime_connection = http.client.HTTPConnection(lambda_runtime_address)
        runtime_connection.connect()

        runtime_connection.request(
            "POST",
            endpoint,
            body=error_response_data,
            headers={
                "User-Agent": user_agent,
                "Content-Type": "application/json",
                "Lambda-Runtime-Function-Error-Type": "UnhandledContent",
                "Lambda-Runtime-Function-xray-error-cause": xray_fault,
            },
        )


runtime_client = RuntimeClient()
```

and I put this in an image with this `Dockerfile`:

```Dockerfile
FROM public.ecr.aws/lambda/python:3.10.2023.07.13.16

# download manually before building the image:
# curl -f -s -o lambda-insights-amd64-38.zip "$(aws lambda get-layer-version-by-arn --arn "arn:aws:lambda:eu-central-1:580247275435:layer:LambdaInsightsExtension:38" --query 'Content.Location' --output text)"

COPY lambda-insights-amd64-38.zip /tmp
# hadolint ignore=DL3033
RUN yum install -y unzip && unzip -q -d /opt /tmp/lambda-insights-amd64-38.zip && rm /tmp/*.zip && \
    yum history undo -y last && yum clean all && rm -rf /var/cache/yum

RUN rm "$LAMBDA_RUNTIME_DIR/runtime_client.cpython"*.so && sed -i 's/import runtime_client/from awslambdaric.runtime_client import runtime_client/' "$LAMBDA_RUNTIME_DIR/awslambdaric/lambda_runtime_client.py"
COPY runtime_client_py.py "$LAMBDA_RUNTIME_DIR/awslambdaric/runtime_client.py"
COPY handler.py "$LAMBDA_TASK_ROOT/handler.py"

CMD ["handler.lambda_handler"]
```

I updated the image used by the lambda again and invoked it and waited for it to spin back down. I checked the cloudwatch logs again:

```
LOGS	Name: cloudwatch_lambda_agent	State: Subscribed	Types: [Platform]
EXTENSION	Name: cloudwatch_lambda_agent	State: Ready	Events: [INVOKE, SHUTDOWN]
START RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9 Version: $LATEST
Handled event {'test': 'value'}
END RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9
REPORT RequestId: faab07eb-60f6-47c6-b5ba-0dced4097ec9	Duration: 48.67 ms	Billed Duration: 383 ms	Memory Size: 128 MB	Max Memory Used: 45 MB	Init Duration: 333.96 ms	
Shutdown hook invoked
```

Success! So indeed it looks like the native code needs some adjustment to handle the signal.

Unfortunately, I can't offer a pull request to fix this since I don't know enough about how this works. Some searching points out that perhaps the code needs to be using [`PyErr_CheckSignals`](https://docs.python.org/3/c-api/exceptions.html#c.PyErr_CheckSignals). `awslambdaric` is also using `aws-lambda-runtime` internally, so there might be some interaction there too that needs even more handling.

Anyway, I hope these reproduction steps can help someone with the skills find the correct solution.

Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`awslambdaric` prevents graceful lambda shutdown #105

Overview

Steps to reproduce.

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

awslambdaric prevents graceful lambda shutdown #105

Description

Overview

Steps to reproduce.

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`awslambdaric` prevents graceful lambda shutdown #105