Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add transaction on_commit before signals for alert group actions #3731

Merged
merged 20 commits into from
Jan 31, 2024

Conversation

mderynck
Copy link
Contributor

@mderynck mderynck commented Jan 23, 2024

What this PR does

Add transactions around log record creation and check transaction on_commit before sending signals passing DB id of alert group log records. In cases for delete we can then assume any missing IDs on tasks are from intentionally deleted alert groups and we can stop tasks from retrying endlessly.

Which issue(s) this PR fixes

Checklist

  • Unit, integration, and e2e (if applicable) tests updated
  • Documentation added (or pr:no public docs PR label added if not required)
  • CHANGELOG.md updated (or pr:no changelog PR label added if not required)

@joeyorlando
Copy link
Contributor

this may also close #3729

@mderynck mderynck added the pr:no public docs Added to a PR that does not require public documentation updates label Jan 25, 2024
@mderynck mderynck marked this pull request as ready for review January 25, 2024 02:34
@mderynck mderynck requested a review from a team January 25, 2024 02:34
Copy link
Contributor

@iskhakov iskhakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea, I see that it worked in the past, but do we have a transaction blocks in these places?

@@ -686,11 +687,7 @@ def acknowledge_by_source(self):
f"log record {log_record.pk} with type '{log_record.get_type_display()}', action source: alert"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logging is incorrect

log_record=log_record.pk,
action_source=action_source,
)
transaction.on_commit(partial(send_alert_group_signal.delay, log_record.pk))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if it called inside transaction block

alert_group_action_triggered_signal.send(
sender=self.acknowledge_by_user,
log_record=log_record.pk,
action_source=action_source,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be sure, are you dropping action_source on purpose here (and below)?
Also checking, we are ok making this async now, right? (I guess so, and it makes sense; will take a look)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll double-check this, I think only slack uses action_source from the signal but force_sync was used instead (for the delete case). In all cases it should be able to be taken from the log record instead of it being passed on the signal, I wanted to avoid changing the signature on send_alert_group_signal/having to add another task with a different signature.

logger.warning(
f"AlertGroupTelegramRepresentative: log record {log_record_id} never created or has been deleted"
)
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

log_record=log_record_pk,
action_source=None,
force_sync=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we trigger the action triggered signal and queue the cleanup task after that. Makes sense 👍

@@ -10,7 +11,7 @@
@shared_dedicated_queue_retry_task(
autoretry_for=(Exception,), retry_backoff=True, max_retries=1 if settings.DEBUG else None
)
def delete_alert_group(alert_group_pk, user_pk):
def delete_alert_group(alert_group_pk: int, user_pk: int) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not for this PR, but wondering if this still needs to be a task?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can keep it for now, I guess something could go wrong in the stop escalation part of delete and we could retry here.

@mderynck
Copy link
Contributor Author

I like the idea, I see that it worked in the past, but do we have a transaction blocks in these places?

We do not, going to test this on dev to see if it improves the issue. It comes down to things should have autocommit but we keep finding cases where the objects are missing.

@mderynck
Copy link
Contributor Author

Added back raising exceptions. In most cases they will not retry and we will get a log message and task failure. In the delete cases only a log message is written, no reason to raise an exception if it was already deleted. There are a couple cases for slack and telegram where it is inside of a signal call, here will still retry to not change the behavior at the top level. If we want to not retry the signal cases we will need to add an intermediate task to isolate the one handler from others.

Copy link
Contributor

@matiasb matiasb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mderynck mderynck merged commit 2a466a0 into dev Jan 31, 2024
20 of 21 checks passed
@mderynck mderynck deleted the mderynck/oncommit-action-triggered-signal branch January 31, 2024 22:54
iskhakov pushed a commit that referenced this pull request Feb 20, 2024
# What this PR does
Add transactions around log record creation and check transaction
on_commit before sending signals passing DB id of alert group log
records. In cases for delete we can then assume any missing IDs on tasks
are from intentionally deleted alert groups and we can stop tasks from
retrying endlessly.

## Which issue(s) this PR fixes

## Checklist

- [x] Unit, integration, and e2e (if applicable) tests updated
- [x] Documentation added (or `pr:no public docs` PR label added if not
required)
- [x] `CHANGELOG.md` updated (or `pr:no changelog` PR label added if not
required)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr:no public docs Added to a PR that does not require public documentation updates
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants