-
Notifications
You must be signed in to change notification settings - Fork 406
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Idempotency and Lambda timeouts #1038
Comments
@cakepietoast didn't we have that covered in the initial implementation? Unsure if it's a regression or if we accidentally missed that |
@heitorlessa could this be referring to cases where the lambda execution times out before updating the status from INPROGRESS to COMPLETED? One recommendation would be to increase the lambda timeout to be much longer (even if the api gw requests timeout within 25 seconds, the lambda can run for much longer.) Otherwise, the idempotent handler could check the remaining time and clean up before exiting? |
Yep that’s the case. Because Lambda doesn’t offer a timeout hook besides
extensions it’s a tricky endeavour — I had a private event signal utility
for Powertools so I could send traces before it timed out, and it had side
effects due to varying hardware.
Increasing time out is surely one. But if it fails regardless it’ll take
another hour for that specific request to succeed - OR all requests if
Idempotency token is not configured correctly.
@cakepietoast and I were chatting briefly about it. Maybe allowing a
configurable expiration time for INPROGRESS is part of the solution - docs
must accompany to provide guidance on dealing with timeouts and how to make
it more amenable.
Now that we have @idempotent_function, it’s easier to have more control
over these failures too as Lambda Handler can wrap it with a try/except —
still not ideal.
We need to think this through carefully
…On Thu, 7 Oct 2021 at 07:29, Michael Brewer ***@***.***> wrote:
@heitorlessa <https://github.com/heitorlessa> could this be referring to
cases where the lambda execution times out before updating the status from
INPROGRESS to COMPLETED?
One recommendation would be to increase the lambda timeout to be much
longer (even if the api gw requests timeout within 25 seconds, the lambda
can run for much longer.)
Otherwise, the idempotent handler could check the remaining time and clean
up before exiting?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<https://github.com/awslabs/aws-lambda-powertools-python/issues/738#issuecomment-937462731>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAZPQBCLUV4FEO45256SA5LUFUVZZANCNFSM5FMDIRCA>
.
|
Yes, a TTL on |
When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that. |
It depends on the event source/invocation type. But what I meant by shorter INPROGRESS was both TTL and timestamp condition, since TTL - as you pointed out - wouldn't be sufficient and reliable |
The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry. Consider an example where a Lambda function starts executing, then code runs that causes one side effects (like charging a payment), then the function times out before completion. If we delete the INPROGRESS record and allow a retry, the payment will be charged again. Put another way, there is no way for the idempotency utility to know whether a retry can safely be made or not - that is domain/implementation specific. We can't change the default behaviour there as customers are likely relying on it.
As Heitor mentions, this isn't a problem - even if the item still persists in DynamoDB, the idempotency utility checks the expiry timestamp to see if the record has expired or not. Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests. We can also add a section to clarify the behavior (what happens when a Lambda function times out?). |
"The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry." - that's the current default AWS behaviour, which is changed when powertools are used.
"We can't change the default behaviour there as customers are likely relying on it." - agreed, but currently you are charging the AWS default.
"Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests." - setting it to the lambda's actual timeout would be about the same as what I suggested. Anything else will potentially interfere with the default retries.
…________________________________
From: Tom McCarthy ***@***.***>
Sent: 13 October 2021 14:14
To: awslabs/aws-lambda-powertools-python ***@***.***>
Cc: n2N8Z ***@***.***>; Author ***@***.***>
Subject: Re: [awslabs/aws-lambda-powertools-python] Idempotency and Lambda timeouts (#738)
When the Lambda fails the first retry occurs 1 minute later, and the second retry 2 minutes after that. So if the Lambda has failed due to a timeout, then the idempotency item needs to be cleaned within 1 minute otherwise the retry will fail.
To take your example - just because the Lambda has timed out doesn't mean there were no side effects. The idempotency utility is designed to safely wrap code that should not be run more than once.
The issue here is that there's an assumption being made that if Lambda times out, it is safe to retry. Consider an example where a Lambda function starts executing, then code runs that causes one side effects (like charging a payment), then the function times out before completion. If we delete the INPROGRESS record and allow a retry, the payment will be charged again. Put another way, there is no way for the idempotency utility to know whether a retry can safely be made or not - that is domain/implementation specific. We can't change the default behaviour there as customers are likely relying on it.
Using TTL to try to have an item reliably deleted in under 1 minute sounds like it will not be reliable - "TTL typically deletes expired items within 48 hours of expiration.<https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/howitworks-ttl.html>" If the timeout timestamp is in the idempotency item then it can reliably be checked on the next run.
As Heitor mentions, this isn't a problem - even if the item still persists in DynamoDB, the idempotency utility checks the expiry timestamp to see if the record has expired or not.
Taking the above into account, the only solution I can think of is to add an option to allow for setting a shorter timeout on INPROGRESS records so they can be expired on a different schedule to successful requests. We can also add a section to clarify the behavior (what happens when a Lambda function times out?).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub<https://github.com/awslabs/aws-lambda-powertools-python/issues/738#issuecomment-942240247>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AE4J5ADJ3GFQ2FBEXE5C6KLUGVZYVANCNFSM5FMDIRCA>.
Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
One small clarification even though I don't think its the most relevant piece here: Lambda doesn't really have a default for retries - it depends on how it is invoked. For example in the case of a synchronous execution, there are no retries, and the client needs to decide how to handle them.
I don't see it that way. We're adding a utility which makes it simple(r) to implement business logic which causes retries fail if we can't guarantee the code didn't already run. You could also think of the utility as protecting functionality from retries - regardless of where they come from.
"Interfering with retries" is in many ways exactly what the idempotency utility is supposed to do. The utility wraps a piece of your code that you consider unsafe to execute more than once (with the same arguments, within a certain time period). After a Lambda function times out, it may have been executed completely, partially, or not at all. If we execute the function again without knowing more than that, its no longer idempotent: "The property of idempotency means that an operation does not cause additional side effects if it is called more than once with the same input parameters." If you want retries to re-trigger your function by default after a timeout - even if you aren't sure how much of your code has already run - my gut feeling is that you probably shouldn't be using the idempotency utility. That said, I'm always keen to learn about new use cases we might not have thought of, so if you're able to share more details about what you want to achieve I'd be happy to dive deeper. If you prefer Slack please drop by the #lambda-powertools channel on the AWS Developers slack (details are in this project README). |
True, there are no retries for synchronous execution, but for event based execution there are default retries that occur as I've mentioned previously.
The utility is treating explicit errors from the lambda differently to timeout errors, whereas AWS performs retries regardless of whether it was an application error or a timeout.
I guess that's largely true, but I want the Idempotency utility to only stop reprocessing of successfully processed events. If the lambda fails or timesout I want the retries to occur. |
Understood. I'll try to get a PR in this week to add a new, independently configurable "inprogress_expires_after_seconds" arg, which you could then set to the same value as your Lambda timeout. Something like this: persistence_store = DynamoDBPersistenceLayer(
table_name="SomeDynamoDBTable",
expires_after_seconds=3600, # successful idempotent records will expire after 60 minutes
inprogress_expires_after_seconds=30 # in progress idempotent records will expire after 30 seconds
)
@idempotent(persistence_store=persistence_store)
def handler(event, context):
... Can you confirm that would solve for your use case? |
In principle yes, but the details matter: |
@n2N8Z and everyone - We have recently found a better way to solve this using an internal no-op extension. POC below - This allows us to receive a https://gist.github.com/heitorlessa/4aad06c39a1d520ff8c42adc72b0bcd5 |
Sorry, kept meaning to get back to this and forgetting. Why is this proposal better than adding an additional field to the persistence record e.g. function_timeout and storing in it the timestamp when the lambda will timeout: now().timestamp() + int(context.get_remaining_time_in_millis() / 1000), and expand the DDB condition with: OR function_timeout < :now ? |
Yes, it is the Lambda scheduler who orchestrates sandboxes that are recycled when a time out happens. A SIGTERM is only sent if an internal or external extension is registered -- the former is what the POC above does using a no-op internal extension -- The lambda scheduler guarantees 500ms for this logic to run. SIGALRM, in your experience, is https://docs.aws.amazon.com/lambda/latest/dg/runtimes-extensions-api.html Hopefully that explains it. @cakepietoast had thoughts about both solutions not being good enough |
Fundamentally, I'm still of the mind that deleting INPROGRESS records violates idempotency. When "stuck" INPROGRESS records occur, it is because there's no way for the utility to know whether the Lambda failed before, during, or after the unit of work its being asked to perform. As such, in the rare cases where this occurs, you would need to evaluate/investigate to be able to make the correct decision on how to proceed. With that said, the discussion on this thread demonstrates that we can improve this utility to apply to a wider scope of problems, where strict idempotency guarantees are not required. Of the 2 possible solutions we have, I'm in favour of the initial suggestion to use a separate timeout counter for INPROGRESS records. It is simpler, requiring less work for users to implement, and less likely to cause issues similar to the one @n2N8Z mentions in the last comment. I haven't had time to work on this - and am not likely to have time for it in the near future. If anyone would like to submit a PR for it at a time when the repository is accepting feature requests, the team would be happy to review. |
Sounds like we have a winner implementation then - use a timeout counter property and a OR as originally suggested by @n2N8Z. Two clarifications tho:
Given this is borderline bug, I'd be happy to review and contribute to any PR. If no one has the bandwidth, we can revisit in August-onwards timeframe. Thanks a lot everyone |
Changes: - Initial draft on an option to clean up on function timeout close aws-powertools#1038
I am working on an initial draft PR based on what @n2N8Z has suggested #1198 Still untested and more to see the impact of the changes. This does not include fancy extension method @heitorlessa spoke about. If i have time to complete the test coverage and do some end to end testing, then maybe @cakepietoast or @heitorlessa can look at it further. Otherwise the docs will have to be updated with the list of caveats. :) |
Deployed and tested a version of my PR #1198 and it works template.yaml with powertools based on #1198 : AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Testing idempotency timeouts
Globals:
Function:
Timeout: 2
Resources:
IdempotencyTable:
Type: AWS::DynamoDB::Table
Properties:
AttributeDefinitions:
- AttributeName: id
AttributeType: S
KeySchema:
- AttributeName: id
KeyType: HASH
TimeToLiveSpecification:
AttributeName: expiration
Enabled: true
BillingMode: PAY_PER_REQUEST
IdempotencyFunction:
Type: AWS::Serverless::Function
Properties:
FunctionName: IdempotencyFunction
CodeUri: src/
Handler: app.handler
Runtime: python3.9
MemorySize: 512
Layers:
- !Ref PowertoolsLayer
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref IdempotencyTable
Environment:
Variables:
LOG_LEVEL: INFO
POWERTOOLS_SERVICE_NAME: example
DYNAMODB_TABLE: !Ref IdempotencyTable
PowertoolsLayer:
Type: AWS::Serverless::LayerVersion
Properties:
LayerName: MyLambdaLayer
ContentUri: layer/
CompatibleRuntimes:
- python3.9 app.py handler with a sleep longer than the function timeout: import time
from os import environ
from uuid import uuid4
from aws_lambda_powertools.utilities.idempotency import (
DynamoDBPersistenceLayer,
IdempotencyConfig,
idempotent,
)
DYNAMODB_TABLE = environ["DYNAMODB_TABLE"]
config = IdempotencyConfig(function_timeout_clean_up=True)
persistence_layer = DynamoDBPersistenceLayer(table_name=DYNAMODB_TABLE)
@idempotent(persistence_store=persistence_layer, config=config)
def handler(event, context):
time.sleep(3)
uuid_value: str = str(uuid4())
return {"message": uuid_value} |
I have tested this approach on a separate repo and it all seems to be good so far. |
@n2N8Z - i have deployed and tested the PR which fixes this issue. If you would like to have a look to see if it fits what you where expecting. |
This is now released under 1.26.7 version! |
Is your feature request related to a problem? Please describe.
Currently if a lambda timesout the idempotency item is left in the 'INPROGRESS' status and subsequent lambda retries fail with IdempotencyAlreadyInProgressError.
Describe the solution you'd like
Add an additional field to the persistence record e.g. function_timeout and store in it the timestamp when the lambda will timeout: now().timestamp() + int(context.get_remaining_time_in_millis() / 1000)
Expand the DDB condition: OR function_timeout < :now
Describe alternatives you've considered
Create a signal handler in the lambda that raises an exception just before the lambda is about to timeout.
Additional context
N/A
The text was updated successfully, but these errors were encountered: