Create a foundation for an "alert" system #1255

kiahna-tucker · 2023-10-23T18:56:27Z

kiahna-tucker
Oct 23, 2023
Maintainer

Last Updated: Nov. 9, 2023

The inaugural alert that will be supported by the system has the following requirements:

Provide a means for a user to subscribe to any alert defined under a specific prefix (of which they admin). The user will only be presented with the option to receive alerts via email at this time, but Slack channels are a notification delivery method that will need to be supported in the future.
Provide a means for the user to select a period of time over which a task's data processing should be evaluated from a predefined set of options. Accepted time periods are: two hours, four hours, eight hours, 12 hours, 24 hours, and two days.
Send one alert email when the condition for the aforementioned alert is valid.
Send a confirmation email when the condition for the aforementioned alert is no longer valid.
Allow a user to enter an email that is not associated with an account in the system to receive alerts.

Design

At a high level, the alert system breaks down into three, core components: a table containing subscription-related information, a table containing the state relevant to a specific alert type, and a table containing alert activity records.

Alert Subscriptions Table

The alert_subscriptions table stores the set of active, alert subscriptions. Each row corresponds to an individual user's subscription. Initially, a user will only be able to subscribe to alerts on a tenant-basis. That said, the prefixed scope may be narrowed to allow more fine-grained control over subscriptions in the future.

Column	Type	Description
`id`	`uuid`	The ID of an alert subscription for a given user.
`detail`	`text` or `null`
`created_at`	`timestampz`	The time at which the alert subscription was created.
`updated_at`	`timestampz`	The time at which the alert subscription was last updated.
`catalog_prefix`	`catalog_prefix`	The namespace for which alerts are eligible.
`email`	`text` or `null`	The email being subscribed to alerts within the associated namespace.

Alert Data Processing Table

The alert_data_processing table stores state relevant to the 'data_not_processed_in_interval' alert type. Each row of this table corresponds to the alert state for a specific task.

Column	Type	Description
`catalog_name`	`catalog_name`	The namespace to which the task belongs.
`evaluation_interval`	`interval`	The window of time that task activity should be reviewed.

There is a view associated with this table, alert_data_processing_firing, whose main purpose is to relay the set of alerts (of this type) that need to be sent.

Alert History Table

The alert_history table provides historical context for all alert activity.

Column	Type	Description
`alert_type`	`text`	The name of the alert.
`catalog_name`	`catalog_name`	The namespace to which the alert applies.
`fired_at`	`timestampz`	The time at which the alert was fired.
`resolved_at`	`timestampz` or `null`	The time at which the alert was resolved.
`arguments`	`json`	The set of state-related arguments for this alert when it was fired.

Notes

There are two, behavioral quirks of the implementation that are considered acceptable:

Disabling or deleting a task for which an alert is firing (i.e., there is a row in the alert_history table for the task where resolved_at is null) will result in a confirmation email sent.
Unsetting the evaluation_interval of a task for which an alert is firing (i.e., there is a row in the alert_history table for the task where resolved_at is null but the row in the alert_data_processing table for the task has been deleted) will result in a confirmation email sent.

Questions

Would we like to track errors encountered when attempting to send an alert? Or are we comfortable relying on the error handling capabilities built into the marketing (and transactional) email provider?
How long should data in the alert_history table be retained?

Additional Comments

Given the vulnerabilities identified with Resend late last week, searching for a new marketing (and transactional) email provider is advised for future notification work. Noting this here for record.
A proper, CMS solution is a pending item that will be addressed in a future adaptation of this alert system. A proposal was made in the notification_templates section of this comment. It is my understanding that the team is comfortable employing different solutions for email alerts and in-app alerts.
In the event external email verification is addressed in the future, the current modeling does not allow for the UI to reliably surface verified, external emails as alert delivery method options.

jgraettinger · 2023-10-23T23:18:35Z

jgraettinger
Oct 23, 2023
Maintainer

Thanks. This is a great write-up and exactly the conversation we need to be having. My thoughts below:

notifications_subscriptions

using an email address that is not in the system should require some degree of email verification

Would you please unpack this more? Are we unable to send an email to custom-trigger-123@pagerduty.com without verifying it? Or we risk reputation harm / issues if it bounces repeatedly over time or is marked as spam ? Put differently, do we require automated processes in place from day one, or is there opportunity to Mechanical Turk it for a period of time?

column user_id in notifications_subscriptions

Given that we do intend to allow for arbitrary emails, why not just store the desired email directly, and avoid the coupling to our auth.users table?

Then, we could have trigger on BEFORE INSERT/UPDATE that the email is actually contained in auth.users if required for now (or, depending on the previous question, we may be able to skip that). Rationale is that this gets us more-immediately to the data model we want to have, and the trigger can be removed without requiring a data migration.

Would this also avoid the extra notification_subscriptions_ext view ?

notification_templates

Having a CMS of some sort in the loop makes plenty of sense. But, can we not use Strapi ? It's the CMS system we already have in place, that people who might be editing this content know how to use, and has a reasonable GUI. There's also an email plugin for templated email. There may be issues with it, but we should be sure it isn't fit for purpose before introducing a new thing.

Having an internal notification_templates table isn't a deal-breaker, and can be shipped as a V1, but I do think we'd want to remove it and use something like ☝️ instead. Or, if easier, we could punt on this by having message contents live in-code in the edge function for now. What's the best path from where we are right now?

One other comment is that we should try and stick to established industry terminology, which is:

Firing: an alert is currently active
Acknowledge: an alert has been acknowledged by a human recipient (we're not trying to model this).
Resolved: a previously-firing alert is no longer firing

pagerduty docs have a good writeup, but Prometheus and Grafana use identical terminology.

data_processing_notifications

There is a view associated with this table, data_processing_notifications_ext, whose main purpose is to gather the data required to determine whether a notification (of this classification) needs to be sent and compose its templated content.

There's a key question we're circling around here that I want to get at: what is an alert, exactly? We have one kind that we're defining today -- no data processed in an $interval -- and we expect to have others, like telling uses that they've gone over free tier limits and are now in a free trial, or that payment failed, and other things that I can't foresee right now.

We should explicitly model and get consensus on what an alert is. I think it's defined as:

It's type (classification), such as data_not_processed_in_interval or payment_failed or in_free_trial. This is a machine-accessible enumeration of alert types which is produced by the specific alert generator, and is known to the CMS system which is generating user content.
It's catalog name, which scopes the entity that produced the error, and is evaluated against notification_subscribers to identify all recipients.
It's template arguments, perhaps as JSON generated by the alert generator and consumed by a CMS template.
(Maybe?) Additional flags which control behavior. An example might be a send_resolved flag which determines whether an email should be sent when the alert is no longer firing.

The natural "key" of an alert is (alert-type, catalog-name).

Whatever the exact definition is -- and we may want to model it as an explicit DB type -- an observation is that the role of the data_processing_notifications_ext view is to apply alert-specific business logic to generate currently-firing alerts.

This is a really nice pattern and I'm thinking we should rally around it.

To put it in slightly different words, the concrete table data_processing_notifications is an implementation detail in support of the data_not_processed_in_interval alert type, and the REAL API for this alert type is data_processing_notifications_ext:

It's an API that surfaces alerts in a common "alert" shape.
- It returns a row when an alert is currently firing when queried.
- It stops returning a row when that alert has resolved.
As a view, it can be used for sending emails, or to power UI notifications, or for other messaging mechanisms.
We can compose these views via UNION ALL to build a holistic picture of currently-firing alerts within the platform.
- A blessed composed view could represent all "production" alerts in the platform.
Experimenting with new alert types is as simple as defining a view that returns the right "alert" shape.
- This de-risks adding new types of alerts (just a view!), and lets us surface them as "production" by updating another composing view.

Side note: individual views are likely to get slow, depending on the business logic, but we can accelerate them as-needed by having select alert types use materialized views which are refreshed using a type-specific pgcron schedule.

The one place where this ☝️ approach is incompatible with the current design, is the acknowledged column of data_processing_notifications.

I think that "have I sent an email yet?" is separate state that should be tracked internally by the mechanism that's responsible for sending emails. This is a convergence problem -- this email-sending mechanism is running scheduled iterations of:

query a view to figure out all currently-firing alerts, and outer join it with firing alerts that I've already emailed about
for alerts returned by the view which i haven't emailed, i send an email and track the alert
for alerts not returned by the view which I'm currently tracking, I send a "resolved" email and stop tracking the alert.

Questions

Can any admin user adjust the notification configuration for a task?

I think "yes". Any user with admin to the live spec's catalog name of data_processing_notifications is able to mutate it.

How should the notification_templates table account for internationalization? Is the standard approach in this codebase sufficient for now? Is there an existing or preferred approach for processing templated messages?

I think this falls out of the question on Strapi.

What next?

Please poke holes or push back on what I've written where there's disagreement.

Also, while a lot of this feedback is broadly compatible with the model you've proposed, some of it is probably a significant departure from current implementation. I think it's acceptable to take measured short-cuts to get this feature out the door, but I do want there to be a consensus design outcome from this discussion that we're moving towards.

5 replies

kiahna-tucker Oct 24, 2023
Maintainer Author

`notification_subscriptions`

Are we unable to send an email to custom-trigger-123@pagerduty.com without verifying it? Or we risk reputation harm / issues if it bounces repeatedly over time or is marked as spam? Put differently, do we require automated processes in place from day one, or is there opportunity to Mechanical Turk it for a period of time?

Email verification is an important handshake between the system intending to communicate with the external, email address and the owner of that address. While the expectation is that a customer would provide an external, email address owned by themselves, a trusted party, or their organization, it is possible for an unrelated yet existing address to be entered either through error (or malicious intent). Notifications can contain sensitive information about the customer (e.g., the name of tenant) and their system/data (e.g., catalog names can expose information about a customer's tech stack). Requiring the customer to verify an external, email address (via explicit authentication or interaction with a verification email) can invoke a sense of confidence that notifications will reach the intended destination as well as help a customer identify and correct unintended email addresses. Additionally, it presents the company in a professional light, for lack of better term, in the eyes of the owner of the external, email address accidentally entered by a customer -- not subjecting them to email notifications from a potentially unknown product for an individual or company with whom they are not affiliated.

With that in mind, I advocate that some degree of email verification is in place the moment a customer can enter an external, email address. To address your first question however, it is possible to send an email to an external, email address without verifying it. I will leave you and @dyaffe to determine what you are comfortable with here.

Given that we do intend to allow for arbitrary emails, why not just store the desired email directly, and avoid the coupling to our auth.users table?

The intent behind this coupling was to be sensitive to changes to a customer's email address, ultimately to ensure the organization alerts table in the UI reflects the current email addresses on file and that notifications are delivered there. That said, this does raise a UX question: should notification subscriptions automatically be transferred to an updated email address in the auth.users table if that is how they originally subscribed? If not, would it be advantageous to provide a means of doing so in the UI? And, in general, is there any benefit to distinguishing between emails associated with accounts in our system and external emails in the UI?

For example, say a customer created an account with us using their personal email to test us out on behalf of their organization. After their organization approves of the adoption of Flow, this user updates their email on record to be their work email. If they subscribed to notifications using their personal email, should their subscription be automatically altered so that notifications are delivered to their work email? If not, is there an expectation that we communicate that their personal email is now considered external in the UI so they can take action if they would like notifications to be delivered to the email associated with their account?

Regarding the trigger that you mentioned, is the proposal that a trigger on BEFORE INSERT/UPDATE would be defined for the notification_subscriptions table where a verified_email would be evaluated against the emails in the auth.users table as a temporary constraint? My response to this suggestion is depended on the scenario proposed above. The original shape the notification_subscriptions table took was as follows: id, created_at, updated_at, detail, catalog_prefix, verified_email, unverified_email. Is this closer to the data model you are envisioning here (minus the unverified_email column, I would expect)?

Would this also avoid the extra notification_subscriptions_ext view?

I do not believe so in the longrun, but that depends on how we would like to accommodate Slack channels being eligible recipients of notifications. From a data modeling perspective, I would be interested in hearing any thoughts you may have about accounting for this new notification method. Happy to share my initial vision here as well.

`notification_templates`

This section was an interesting read as I got the impression that we are gearing this notification system toward email notifications only, from your perspective. It is my understanding that the intent is to devise a system that can accommodate email and in-app notifications. My largest concern when considering how notification content should be managed is that I want there to be a single source of truth for content for each notification classification (e.g., 'data_not_processed_in_interval'). Otherwise, we would need to be fairly meticulous when updating notification content, ensuring that the source of email notification content is in accord with the source of in-app notification content. Originally, I was curious if some form of template parser could be defined to aid in the preparation of templated, notification content. I will look into Strapi, thanks for the recommendation.

Given the current state of the feature, I would advocate for the abolishment of the notification_templates table to punt this CMS decision and to move the content into the edge function for the time being. I will leave a comment behind in the edge function to note that this is a temporary solution.

`data_processing_notifications`

There's a key question we're circling around here that I want to get at: what is an alert, exactly?

The definition you provided for a notification seems solid. Given the use of the term catalog name, I would argue that it is the definition of a task-based notification; but that term can easily be replaced with catalog prefix to make it applicable for higher-order, catalog prefix-based notifications as well.

The natural "key" of an alert is (alert-type, catalog-name).

Yes, with the note about the use of the term catalog name applying here as well. Given the proposed removal of the notification_templates table, the data_processing_notifications table would be altered so that it is keyed off the notification classification and catalog prefix; the live_spec_id column would be removed entirely. Now, given it is technically a task-based notification, would you prefer for the catalog prefix to be the full catalog name in this case?

Originally, all notifications were housed in a single table and I get the impression that you are leaning towards returning to that implementation. Is that correct? If so, do we envision a future where notifications whose behavior can be altered would vary by customer (e.g., the evaluation_interval of the data processing notification could be customer-specific)? I broke away from housing all notifications in a single table because I worried about the growth potential for this table if notifications had this level of customization.

Side note: individual views are likely to get slow, depending on the business logic, but we can accelerate them as-needed by having select alert types use materialized views which are refreshed using a type-specific pgcron schedule.

I am still processing this idea, but I do not want to delay this lengthy response.

NOTE: Not to be pedantic but an alert is a specific type of notification from a UX perspective. That said, I understand that the two terms tend to be used interchangeably by many. I tend to use the terms selectively.

Additional Questions

What is the desired UX when there is no longer a single subscription to a defined notification? Should the entry in the corresponding notification table (e.g., data_processing_notifications) be removed?

dyaffe Oct 24, 2023
Maintainer

I'm fine with us not taking on email verification now and putting it on the list to follow up with when it becomes important.

psFried Oct 25, 2023
Maintainer

The intent behind this coupling was to be sensitive to changes to a customer's email address, ultimately to ensure the organization alerts table in the UI reflects the current email addresses on file and that notifications are delivered there. That said, this does raise a UX question: should notification subscriptions automatically be transferred to an updated email address in the auth.users table if that is how they originally subscribed? If not, would it be advantageous to provide a means of doing so in the UI? And, in general, is there any benefit to distinguishing between emails associated with accounts in our system and external emails in the UI?

These are good questions and considerations. It makes me wonder what would even happen if you did change the email associated with an OAuth provider. IIRC someone may have tested changing their GH email and logging in. Does it update the address in auth.users? Given the requirement to support arbitrary emails, I think I'd probably lean toward just having an email column that we manually populate with the address from auth.users, just since it feels a little easier to read and comprehend (at least, to me right now).

TBH, though, I think if we wanted we could probably write a migration that converts from using a reference to auth.users to using a denormalized email on notification_subscriptions. Given a relatively easy way to change our minds in the future, I'm less inclined to try to arrive at an ideal modeling right now.

psFried Oct 25, 2023
Maintainer

Not to be pedantic but an alert is a specific type of notification from a UX perspective. That said, I understand that the two terms tend to be used interchangeably by many. I tend to use the terms selectively.

Here's how I'm understanding it: The backend has a distinction between an alert, a condition that may be firing or not, and a notification, an attempt to tell some subset of users about it. There is a joint in between those, that I predict will become more apparent once we start accounting for there being many types of alerts, and a small handful of types of notifications. I may like the term monitor or condition even better than alert, though, since they're a little more distinct from notification.

If that distinction makes sense to y'all, then I would suggest renaming data_processing_notifications to data_processing_conditions|monitors|alerts and using the same suffix for others moving forward.

Apropos, I kind of tripped over this comment, actually:

Originally, all notifications were housed in a single table...

But in response, I'll ask @jgraettinger to clarify, because that is not how I interpreted his comment. My interpretation, which is also an opinion I hold, was not that it suggests changing the existing table-per-alert design, but that it's more about just wanting a common facility for tracking whether notifications have been dispatched for specific alert conditions or not.

I think that "have I sent an email yet?" is separate state that should be tracked internally by the mechanism that's responsible for sending emails.

I strongly agree, and would like to try putting it into slightly different terms:

If we consider there being multiple notification channels that may apply to a single alert condition, a more complete modeling might be to have a table that stores sent_notifications per notificiation_subscriptions group. I'm thinking of what happens when we have multiple notification types that want to send for a given alert, and sending one of them succeeds and the other fails. We wouldn't want to repeatedly send out emails for the same alert conditions in cases where sending the slack notification fails. But we also wouldn't want to skip re-trying the slack notification on the next go-around. Ideally we'd be able to handle failures of different notification channels separately. Another way of asking the question is, "what is the unit of work for notifications?" I think ideally we want it to be pretty granular (e.g. keyed on (notification_subscription_id, catalog_name, alert_type)) so that we can avoid re-sending duplicate notifications.

kiahna-tucker Oct 25, 2023
Maintainer Author

Here's how I'm understanding it: The backend has a distinction between an alert, a condition that may be firing or not, and a notification, an attempt to tell some subset of users about it.

I included the terminology aside to share that notification-related language differs depending on domain, not that I was having difficulty following any applications of terms found in both domains. I would defer to those, like yourself and Johnny, to determine how terms should be employed.

My interpretation, which is also an opinion I hold, was not that it suggests changing the existing table-per-alert design, but that it's more about just wanting a common facility for tracking whether notifications have been dispatched for specific alert conditions or not.

I understood that this was the heart of the conversation on the subject, but I wanted to make sure there wasn’t an implicit suggestion in the text. If “alerts” take a common shape, have a predefined set of flags associated with them, and will not be customized by the user for the foreseeable future (see original reply for context), what is the benefit of having a table per “alert?” At that point, all “alerts” can be thrown into a single table (keyed as discussed) and the views can contain logic to select the “alert” relevant to them.

I think that "have I sent an email yet?" is separate state that should be tracked internally by the mechanism that's responsible for sending emails.

I agree.

for alerts not returned by the view which I'm currently tracking, I send a "resolved" email and stop tracking the alert.

That said, I do not see how you could rely on the results of the view and the table to determine whether an email is needed because a row will exist in the table as long as a configuration exists. In the case of this data processing alert, there will be a row in the table for each task that has an evaluation interval that is not null. Sure, the view can be altered so that it only contains “alerts” that are “firing” but absence from the view does not indicate the need for a confirmation email. We would need a dedicated table for tracking that which the edge function would be tasked with managing.

jgraettinger · 2023-10-25T16:43:39Z

jgraettinger
Oct 25, 2023
Maintainer

On verification & authorization

In some contexts, like our dashboard, the ability to demonstrate ownership of an email address bestows the bearer with a capability to act as that account and do things. Verification is the proof that "you" (as a browser) are the owner of the email (equivalent: account), and your browser can thus act as an agent for that account with all of its capabilities.

This is not the situation we have here. An email to which we send alerts is given no capabilities to an account within our system. The only capability is an implicit one, to read the contents of alerts we're sending out. But, that flow has already been authorized by an account in our system having admin over the alerted entities, which set up the email in the first place.

From a AuthN / AuthZ perspective, I see this as a non-issue. The only role of verification in this context is to prevent spamming of email that doesn't want it (or equivalently: to prevent problems with email reputation). I definitely think we can punt on this.

On terminology

This discussion is mixing up concepts of "alert", "notification", and "subscription".

My mental picture has been that "alert" and "notification" are the exact same thing -- a system condition that can be firing or resolved, that we're telling the user about.

I honestly don't know where / how "notification" entered the scene -- it doesn't appear in the PRD and isn't a separable product concept in any discussion I've been part of -- and I've been rolling with it as a synonym of "alert" to date, but it's getting confusing and we need to pick one and be consistent (some of this discussion now seems to mix a "notification" and "subscriber/subscription" concept).

Proposal: "alert" and "alert subscriber/subscription" are the technical terms we use. "notification" is not.
If this doesn't work, I'd like to hear a concise argument for why "alert", "notification", and "subscriber/subscription" must stand on their own as independent technical concepts.

Another one is "classification" vs "type" for the taxonomy of specific kinds of alerts. IMO classification is a long word to type and easily misspelled, and it also emotes a notion of triaging of firing alerts that isn't actually intended. For example, I might expect a classification of alerts to be a higher-order taxonomy applied to a set of concrete types (akin to log levels applying to the concrete places where logs are emitted within a codebase).

On coupling to the `auth.users` table vs a column having just the email

Accounts in our system are 1:1 with email addresses. AFAIK, it's not even possible to change an email address -- that is defacto a separate account.

Still, if it were possible, I don't think it's required or even necessarily desired that that should automatically update an alert subscriber list. That's not a common expectation in comparable systems I've used. For this reason I don't see value in creating a coupling between auth.users and alert recipients.

should notification subscriptions automatically be transferred to an updated email address in the auth.users table if that is how they originally subscribed?

No.

If not, would it be advantageous to provide a means of doing so in the UI?

No.

And, in general, is there any benefit to distinguishing between emails associated with accounts in our system and external emails in the UI?

None I'm aware of.

For example, say a customer created an account with us using their personal email to test us out on behalf of their organization. After their organization approves of the adoption of Flow, this user updates their email on record to be their work email.

The premise is mistaken, because the work email would be a distinct account in our system which would need explicit authorization setup to the original tenant. I think it's 💯 desirable that any alert subscriptions act as simple email lists and must be updated accordingly.

If not, is there an expectation that we communicate that their personal email is now considered external in the UI so they can take action if they would like notifications to be delivered to the email associated with their account?

No.

Is this closer to the data model you are envisioning here (minus the unverified_email column, I would expect)?

Yep, that's right. My read of the situation is that the discussed trigger is truly optional here and we should probably not even do that, but we could if necessary.

Would this also avoid the extra notification_subscriptions_ext view?

I do not believe so in the longrun, but that depends on how we would like to accommodate Slack channels being eligible recipients of notifications.

An alert subscription could be represented as a one-of of an email or $config to write into a slack channel (e.g. we could alter the table to make the email column nullable, and to add nullable columns for slack). I'm confident we can slot this in, though, and it's not a problem for today.

On Templates

This section was an interesting read as I got the impression that we are gearing this notification system toward email notifications only, from your perspective.

Not quite, but we're not trying to provide messaging outside of email just yet, and a message delivered in-browser is probably going to have different content than email anyway. So, the important joint for right now is just to have a system-interpretable alert type (data_not_processed_in_interval ), that can be used in the future to predicate CMS lookups for customized and context-specific messaging.

I do think that alerts -- as a set of enumerable "firing" system entities -- should be decoupled from the specifics of how they're messaged to users.

My largest concern when considering how notification content should be managed is that I want there to be a single source of truth for content for each notification classification

Maybe. A reason to be cautious is we don't understand how this will evolve just yet, and lose coupling gives flexibility. Tactically, my suggestion here is to use a DB enum to extensibly represent alert type variants and to otherwise not worry too much about this.

Given the current state of the feature, I would advocate for the abolishment of the notification_templates table to punt this CMS decision and to move the content into the edge function for the time being

👍

On implementation

There's a key question we're circling around here that I want to get at: what is an alert, exactly?

The definition you provided for a notification seems solid. Given the use of the term catalog name, I would argue that it is the definition of a task-based notification; but that term can easily be replaced with catalog prefix to make it applicable for higher-order, catalog prefix-based notifications as well.

I don't think that's necessary. If you want to create a "foo" billing alert for a tenant -- which is-a catalog prefix -- you can trivially turn it into a catalog name as $tenant/foo-billing-alert. The role of catalog names in this context is to 1) identify a specific entity that's alerting, and 2) to play well with our catalog authorization system. It's not required that these be specification names or explicitly modeled entities that are interpretable in other contexts.

One other update: I no longer think that alerts have flags. The original motivation was the question "how do we know whether to send an email or not when an alert is resolved?", but that's better done in the CMS system.

So, just (alert-type, catalog-name, bag-of-arguments).

Originally, all notifications were housed in a single table and I get the impression that you are leaning towards returning to that implementation. Is that correct?

I think tables are not the important joint here: Views are.

For a specific alert type, like this "no data is processing" one, tables are helpful implementation detail for the view to use, to identify tasks it should look at and how. But the table has no role to play outside of the implementation of the view for that specific alert type.

do we envision a future where notifications whose behavior can be altered would vary by customer (e.g., the evaluation_interval of the data processing notification could be customer-specific)?

I think it's catalog task specific, as an implementation detail of a "no data is processing" alert view.

NOTE: Not to be pedantic but an alert is a specific type of notification from a UX perspective. That said, I understand that the two terms tend to be used interchangeably by many. I tend to use the terms selectively.

I think you're alluding to "alert" as being a level that's applicable to a user facing notification, alongside "info" or "warn". Yes, we're explicitly inverting this taxonomy in this context.

In a UI context, a notification is a message conveyed to a user which has an associated level. In this context, an alert is a (possibly configured) check of a condition which may or may not be firing, and its state transitions can implicate messaging sent to a user.

Perhaps "alarm" might be a better term instead of "alert"? But it's not worth up-turning very well established and trodden terminology ground in the broader industry.

What is the desired UX when there is no longer a single subscription to a defined notification? Should the entry in the corresponding notification table (e.g., data_processing_notifications) be removed?

I think it's fine to leave these losely coupled for now. An alert can exist without any subscriber who might see it. We, as system operators, probably still care about these.

On alert transitions

There's some discussion that's quickly getting into "how to trigger emails?", which is fine, but some meta comments:

We can model alert transitions as events created in our system.
An event can be created in our system by periodically comparing "all current firing alerts" with "previously firing alerts", and generating events over alerts which are newly-firing or newly-resolved.
There's going to have to be state somewhere, that holds the set of "previously firing alerts" to facilitate detecting these transitions.

So far, none of this ☝️ is especially coupled to sending emails: the actual act of sending an email is conceptually downstream of the logic that's generating these events over alert transitions.

I'm completely cool with bundling them for now, though, but I did simply want to call out the architectural separability of "identifying transitions in alert states and turning those into events" vs sending email. If you model the problem as one of reacting to events, that modeling can be used to power any subscription mechanism (future Slack, etc).

(This is just occurring to me: if we must have state for understanding transitions somewhere anyway, then we could represent alert transitions as a concrete table of append-only events, and "previously firing alerts" would be a query over that table. What that buys is a) there's now a single concrete table for use of event processors that do things like send emails, and b) it can power UI/UX presenting the history of fired alerts).

7 replies

kiahna-tucker Oct 26, 2023
Maintainer Author

Drop the data_processing_alerts.alert_type column? alert_type doesn't need to be part of this implementation table, since it MUST always be the data_not_processed_in_interval variant, and the corresponding view can instead just bake that variant into its SELECT implementation.

Yes, I expected this to be recommended. There was a question posed in an earlier reply of mine that was misunderstood, so I intentionally left this column in place to get an answer to that question.

Drop the data_processing_alerts.resolved column? This may hint at an ongoing mis-understanding.

We are on two different pages here. I understand the solution described above and in your original comment, but I do not see how the number of alert and confirmation emails sent, respectively, could be restricted to just one here. This piece of state cannot live in the edge function, and I would believe the view could not be hooked into in any capacity. Could you expand on how this could be achieved in your proposed solution?

jgraettinger Oct 26, 2023
Maintainer

We are on two different pages here.

Yep, this is what the On alert transitions section of the parent is getting at.

Views tell you (only) about the set of alerts that are firing right now.
But, user messaging is something that needs to happen at the transition between when an alert is not firing => firing, and also firing => not firing.

This is a generalized problem which is invariant to any specific alert type: we have to detect these transitions.

A proposal:

We have a view alert_all_firing which is a UNION ALL of various specific alert views.
We have a table alert_history with columns:

alert_type alert_type The alert type of the alert.
catalog_name catalog_name The catalog name of the alert.
fired_at timestampz not null The time at which an alert starting firing.
resolved_at timestmapz The time at which the alert was resolved.
arguments json The bag-of-arguments of the alert, as-of when it first fired.

We have a regularly-scheduled pgcron function which detects differences between alert_all_firing and alert_history.

alert_history lets us query for the set of known-to-be-firing alerts.
alert_all_firing gives you the current set. So, a function would essentially run an update followed by an insert:

-- Resolve alerts that have transitioned from firing => !firing
with open_alerts as (
  select alert_type, catalog_name from alert_history
  where resolved_at is null
)
update alert_history set resolved_at = now()
    where resolved_at is null and (alert_type, catalog_name) not in (select * from open_alerts);

-- Create alerts which have transitioned from !firing => firing
with open_alerts as (
  select alert_type, catalog_name from alert_history
  where resolved_at is null
)
insert into alert_history (alert_type, catalog_name, fired_at, arguments)
select alert_type, catalog_name, now(), arguments from alert_all_firing
  where (alert_type, catalog_name) not in (select * from open_alerts)

Messaging for alert_subscriptions, like sending email, can then be implemented in terms of a cursor timestamp. Suppose I'm figuring out alerts to email at current time $NOW, and I know I last ran at timestamp $LAST.

-- Identify emails to send from newly-firing alerts.
select * from alert_history where firing_at > $LAST and resolved_at is null;

-- Identify emails to send for resolved alerts.
select * from alert_history where resolved_at > $LAST;

I'm less knowledgable about edge functions and exactly how they're triggered (is it via their own cron, or would they be kicked off via a DB process?) but presumably we can find a way to store the timestamp of the last time it ran.

kiahna-tucker Oct 26, 2023
Maintainer Author

Yep, this is what the On alert transitions section of the parent is getting at.

Okay, when I initially read that section I thought you were suggesting that as an idea to consider in the future. This is where we diverged. Got it.

jgraettinger Oct 26, 2023
Maintainer

I'm open to other options or an incremental glide path, but if I'm being asked "how do you think this can work?" in the abstract, then ☝️

jgraettinger Oct 31, 2023
Maintainer

updates LGTM

Would we like to track errors encountered when attempting to send an alert? Or are we comfortable relying on the error handling capabilities built into the marketing (and transactional) email provider?

To start with, can we use the provider for this? I imagine they have this built out.

How long should data in the alert_history table be retained?

IMO we can start with "indefinite" and then trim its history later if/when it gets big.

psFried · 2023-11-22T20:38:39Z

psFried
Nov 22, 2023
Maintainer

The backend of the alerts functionality has been deployed and working more or less as expected, with one caveat. Alert emails are being sent in duplicate.

How it works today

As implemented, the edge function just receives a request, and then queries the database for the alerts for which it will send emails. The edge function does this by just querying for all alerts that have either started firing or resolved within the last 5 minutes (a hard-coded constant). The edge function is called by a trigger on the alert_history table, which executes FOR EACH STATEMENT. So every time there's a statement that inserts or updates rows in alert_history, the edge function is called and sends out emails for every alert that fired. At present, this results in two emails for every alert, because of the way that we update alert_history:

with open_alerts as (
  select alert_type, catalog_name from alert_history
  where resolved_at is null
)
insert into alert_history (alert_type, catalog_name, fired_at, arguments)
  select alert_type, catalog_name, now(), arguments from alert_all_firing
  where (alert_type, catalog_name) not in (select * from open_alerts);


-- Resolve alerts that have transitioned from firing => !firing
with open_alerts as (
  select alert_type, catalog_name from alert_all_firing
)
update alert_history set resolved_at = now()
    where resolved_at is null and (alert_type, catalog_name) not in (select * from open_alerts);

Because there's two separate statements for insert and update, the edge function gets called twice and duplicates the email.

How to solve

A possible short term solution is to just update the above code to be a single statement, with separate CTEs for the inserts and updates. This can probably be good enough for a while.

But I think we'll eventually need to make this more robust. Can we always count on being able to modify alert_history using a single statement at most every 5 minutes? Probably not. So I'd like to discuss alternatives. Here's a few to get the ball rolling:

Add a cursor (timestamp column in the DB) that the edge function would update as it makes progress sending alerts. The edge function would select for update on the cursor at the start of it's execution, and then update it to the most recent timestamp that it's processed through in alert_history. Keeping in mind that it'd need to process alerts strictly in order.
Re-frame the edge function to accept a request per alert (or resolution of an alert). The arguments, catalog_name, etc (and potentially even the Resend API key) could be provided in the request body, so all the edge function is doing is just formatting the message and sending out the emails.
Something else?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a foundation for an "alert" system #1255

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Create a foundation for an "alert" system #1255

kiahna-tucker Oct 23, 2023 Maintainer

Design

Alert Subscriptions Table

Alert Data Processing Table

Alert History Table

Notes

Questions

Additional Comments

Replies: 3 comments · 12 replies

jgraettinger Oct 23, 2023 Maintainer

notifications_subscriptions

notification_templates

data_processing_notifications

Questions

What next?

kiahna-tucker Oct 24, 2023 Maintainer Author

notification_subscriptions

notification_templates

data_processing_notifications

Additional Questions

dyaffe Oct 24, 2023 Maintainer

psFried Oct 25, 2023 Maintainer

psFried Oct 25, 2023 Maintainer

kiahna-tucker Oct 25, 2023 Maintainer Author

jgraettinger Oct 25, 2023 Maintainer

On verification & authorization

On terminology

On coupling to the auth.users table vs a column having just the email

On Templates

On implementation

On alert transitions

kiahna-tucker Oct 26, 2023 Maintainer Author

jgraettinger Oct 26, 2023 Maintainer

kiahna-tucker Oct 26, 2023 Maintainer Author

jgraettinger Oct 26, 2023 Maintainer

jgraettinger Oct 31, 2023 Maintainer

psFried Nov 22, 2023 Maintainer

How it works today

How to solve

kiahna-tucker
Oct 23, 2023
Maintainer

Replies: 3 comments 12 replies

jgraettinger
Oct 23, 2023
Maintainer

kiahna-tucker Oct 24, 2023
Maintainer Author

`notification_subscriptions`

`notification_templates`

`data_processing_notifications`

dyaffe Oct 24, 2023
Maintainer

psFried Oct 25, 2023
Maintainer

psFried Oct 25, 2023
Maintainer

kiahna-tucker Oct 25, 2023
Maintainer Author

jgraettinger
Oct 25, 2023
Maintainer

On coupling to the `auth.users` table vs a column having just the email

kiahna-tucker Oct 26, 2023
Maintainer Author

jgraettinger Oct 26, 2023
Maintainer

kiahna-tucker Oct 26, 2023
Maintainer Author

jgraettinger Oct 26, 2023
Maintainer

jgraettinger Oct 31, 2023
Maintainer

psFried
Nov 22, 2023
Maintainer