Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry to submit_event in trigger to avoid deadlock #27344

Closed
wants to merge 0 commits into from

Conversation

NickYadance
Copy link
Contributor

related: #27000

@dstandish
Copy link
Contributor

dstandish commented Oct 28, 2022

@NickYadance did you try this fix? did it resolve the issue? do you think it's possible to write a test for it? i imagine you could might simulate the scenario by managing two sessions. if so, you could write a test that reliably fails without the fix, then add the fix and leave the test.

the reason I ask is because i see that that the retry is within the scope of the session, and i'm not sure whether, if we get a deadlock error, the session will need to be rolled back.

and, another option would be to put @tenacity.retry(stop=tenacity.stop_after_attempt(MAX_DB_TRIES)) as the outermost decorator, and then each failed attempt would get rolled back and a new session created for the next attempt

@NickYadance
Copy link
Contributor Author

@NickYadance did you try this fix? did it resolve the issue? do you think it's possible to write a test for it? i imagine you could might simulate the scenario by managing two sessions. if so, you could write a test that reliably fails without the fix, then add the fix and leave the test.

the reason I ask is because i see that that the retry is within the scope of the session, and i'm not sure whether, if we get a deadlock error, the session will need to be rolled back.

and, another option would be to put @tenacity.retry(stop=tenacity.stop_after_attempt(MAX_DB_TRIES)) as the outermost decorator, and then each failed attempt would get rolled back and a new session created for the next attempt

No actually. The solution is from here as these are similiar cases.

for attempt in run_with_db_retries():
with attempt:
session.query(TaskInstance).filter(
TaskInstance.state != State.DEFERRED, TaskInstance.trigger_id.isnot(None)
).update({TaskInstance.trigger_id: None})

Working on to reproduce this...

@dstandish
Copy link
Contributor

dstandish commented Nov 10, 2022

@NickYadance did you try this fix? did it resolve the issue? do you think it's possible to write a test for it? i imagine you could might simulate the scenario by managing two sessions. if so, you could write a test that reliably fails without the fix, then add the fix and leave the test.
the reason I ask is because i see that that the retry is within the scope of the session, and i'm not sure whether, if we get a deadlock error, the session will need to be rolled back.
and, another option would be to put @tenacity.retry(stop=tenacity.stop_after_attempt(MAX_DB_TRIES)) as the outermost decorator, and then each failed attempt would get rolled back and a new session created for the next attempt

No actually. The solution is from here as these are similiar cases.

for attempt in run_with_db_retries():
with attempt:
session.query(TaskInstance).filter(
TaskInstance.state != State.DEFERRED, TaskInstance.trigger_id.isnot(None)
).update({TaskInstance.trigger_id: None})

Working on to reproduce this...

Here's an example that illustrates what I'm saying:

from airflow.models import Log
from airflow.settings import Session

session = Session()
for i in range(3):
    val = 'hi' if i == 2 else {}
    print(f"{val=}")
    try:
        session.add(Log(event=val))
        session.commit()
        break
    except Exception as e:
        print(f'failure: {e}')

This line session.add(Log(event=val)) fails when val is dict but succeeds when string.

The first two times it runs, it runs with dict. The third time it runs with string.

But you will see that it doesn't matter that the third try uses a good value; it will still fail because the transaction has not been rolled back.

The only difference is the error is (sqlite3.InterfaceError) Error binding parameter 4 - probably unsupported type instead of deadlock error.

Now... is that a difference that makes a difference? I am not sure. But, I think it's likely. If that's true, this won't work and you'll have to do the retry around the whole transaction instead of within it (which, thankfully, isn't that big of a change).

@dstandish
Copy link
Contributor

@NickYadance did you give up on this one?

@NickYadance
Copy link
Contributor Author

NickYadance commented Nov 25, 2022

nop. i just rebase my source repo cuz i'm testing another issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants