-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[nexus] webhooks #7277
base: main
Are you sure you want to change the base?
[nexus] webhooks #7277
Conversation
51f7f8e
to
139cfe6
Compare
140aea4
to
0b80c8f
Compare
41cf0b0
to
2bc5925
Compare
I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that. |
Glob subscription entries in |
As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered. GCing an event would also clean up any child delivery attempt records. |
This commit adds (unimplemented) public API endpoints for managing Nexus webhooks, as described in [RFD 364][1]. [1]: https://rfd.shared.oxide.computer/rfd/364#_external_api
Co-authored-by: Augustus Mayo <augustus@oxidecomputer.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only reviewed the schema so far, still need to go through db-queries, the background tasks, the tests, and the API
#[serde( | ||
default = "WebhookDeliveratorConfig::default_second_retry_backoff" | ||
)] | ||
pub second_retry_backoff_secs: u64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be a time "after the first retry", or relative to the start of the delivery time?
like, if we try to send a delivery, and fail, and then:
first = 5 seconds
second = 10 seconds
Are we expecting:
@ 0 seconds -> send and fail first delivery
@ 5 seconds -> send first retry delivery
@ 10 seconds -> send second retry delivery? Or is this actually at 15 seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's supposed to be the time since the previous attempt. So, in your example, it would be 0, 5, and 15 seconds.
use serde::ser::{Serialize, Serializer}; | ||
use std::fmt; | ||
|
||
impl_enum_type!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How critical is it that these webhook event classes are strongly-typed, as opposed to being raw strings present as "data" rather than "schema" in the database?
Renaming or removing these variants will not be trivial for "cockroachdb is silly sometimes" reasons- see: https://github.com/oxidecomputer/omicron/tree/main/schema/crdb#212-changing-enum-variants
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great question, and it's something I've thought about a bit. My initial plan was, in fact, to just store these as strings in the database. @andrewjstone suggested that we represent them as an enum, instead, to avoid storing a whole bunch of duplicate copies of the same relatively small set of strings. This also has the advantage of tying the set of event classes quite closely to the database schema version, which is used to determine when glob subscriptions need to be reprocessed.
Another option, which occupies a middle ground between using an enum and just storing the string representation of the class for every event, would be to instead have a table of event class strings along with some numeric identifier, so we could represent them more concisely on disk using the numeric identifier. That way, they could be inserted or removed by queries rather than by a schema update.
Though that flexibility seems appealing, I'm not actually sure if it's a good thing. On updates that add new event classes, we'd probably have the new Nexus version go and run some queries to add the new classes to the database. This is a bit ad-hoc, and it means that it's at least theoretically possible to change the set of event classes without changing the schema version, so it's much harder to reason about whether the glob subscriptions are correct based on what's presently in the database. With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.
Finally, I'm not actually that convinced that being unable to easily remove event classes is that big of a problem. For enums that represent operational states (e.g. instance_state
or similar), or that represent a policy or mode of behavior, it's actually very important to not be able to represent defunct variants: code that consumes this data needs to handle every possible state or policy or whatever, and it's very unfortunate to have to have match arms or other cases for stuff that's no longer used that just panics or logs an error. In this case, however, these are basically just strings that we do regex matching on, and the enum is mostly used to reduce the size of the database record (and to tie the set of strings to the schema version, as I mentioned). So, if we leave behind some event classes that we've stopped using, but they're still there in the enum...honestly, who cares? They don't actually cause that big of a problem just by being there, and we can filter them out of the "list event classes" API response so that users don't get the idea that they still exist. It's a bit ugly to leave behind stuff that's no longer used, but the consequences are less bad than for something like VmmState
or SledPolicy
...
Hopefully that all makes some amount of sense. I'm certainly not that attached to this approach, but that was the rationale I was operating under.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one area where I'd give caution is coupling "database schema" with "API schema".
Storing an enum
in the database does keep the set of event classes tightly coupled with the schema. But don't we actually care about the event classes we've promised through the API?
E.g., suppose a customer calls the "GET list of all EventClass objects" API. They get some result as a response. Then we update Nexus, and the set changes immediately.
With this structure, we're giving the guarantee that "all variants of this enum" are valid EventClass targets, unless we explicitly filter them. I think this is roughly equivalent to what you said in:
With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.
Since there is some glob processing which must happen to process new subscriptions, this also means that our "duration of time to perform schema updates" is going to increase proportionally with the amount of webhooks usage.
Imagine an alternative world:
- Event classes are stored as their own table in the database. They contain
TEXT
, but also some fields indicating if they're "activated" or not (basically - have we turned them on/off). - Enabling/disabling event classes can now be the responsibility of a background task, rather than a schema upgrade. We can have a big list of "event classes we want to use".
- Do you have an event class in this list which doesn't have a record? Add it. Re-process all globs. Then mark it as visible.
- Do you have an event class in the database, which isn't in this list? You can mark it as deprecated, and remove all subscriptions directly to it. Once all subscriptions are gone, it can be fully deleted. (Alternatively, we could have an explicit list of "removed" event classes, if we want to be really cautious about re-use)
This seems like it might be nicer-to-use than having such a tight coupling with the database:
- It lets developers add and remove event classes really easily. They're text identifiers, basically!
- It avoids hard-coding the test case event classes.
- It makes schema changes execute faster -- globs can be processed out-of-band, by the background task subsystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'm still reviewing the glob reprocessing code, perhaps a chunk of this is already the case? Certainly seems that the webhook_dispatcher
is invoking this glob-reprocessing code. So we presumably already won't be blocking schema updates proportional to webhook usage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that I've actually read through more of this (and really, why bother with that before commenting), I see that:
- The glob re-processing should guard against this, and the ordering of background tasks means that it'll basically be the first webhooks-related operation to happen after an update, even though it won't block schema changes themselves. This is great!
- Because of that, the choice of "enum" vs "TEXT" matters a lot less. There might be some developer experience gains we could gain from going to a TEXT-based list instead of an enum-based list (we can reconcile a list of event classes, just like we can reconcile an enum, and I still think it might be nice to decouple this from the database), but it seems possible to punt this. Even if we did go to an enum-less event-class world, we'd probably end up doing re-processing of globs in a similar spot to this PR.
impl std::str::FromStr for WebhookEventClass { | ||
type Err = EventClassParseError; | ||
fn from_str(s: &str) -> Result<Self, Self::Err> { | ||
for &class in Self::ALL_CLASSES { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many event classes do you think we're gonna end up with, if you had to place a bet, 2-3 years from now?
I don't think think this is a problem now, but I expect this enum will grow quite large as it expands to include at least "all possible faults" and "all possible invocations of the interface", right?
Any operation (outside of tests) that requires iteration over all variants seems like it's worth a little scrutiny IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this definitely won't scale to large numbers of event classes. As this list grows, I would definitely consider a parsing strategy that doesn't require iterating over every possible class --- since event classes form a tree-like hierarchy, we could eventually write a parser that parses them segment-by-segment: i.e., if you have the string "instance.start", you would see "instance" and know that the next segment has to be one of ["start", "stop", "create", "delete", ...]
, and only look for those tokens. I didn't do that here because we don't currently have that many event classes --- maybe there should be a comment here proposing we do that in the future.
With that said, I'm not really sure if there are going to be quite as many of these as you imagine: for FMA, I don't think every possible fault class emitted by illumos FMA is necessarily going in here, so I dunno if we'll see "ereport.cpu.generic-x86.cache"` or similar. We might just have broader categories in the webhook event class, like "sled.fault", "sled.disk.fault", etc, and put more detailed fault class strings in the JSON payload. I do think it's worth thinking about the cardinality of this though!
impl WebhookDeliveryAttempt { | ||
fn response_view(&self) -> Option<views::WebhookDeliveryResponse> { | ||
Some(views::WebhookDeliveryResponse { | ||
status: self.response_status? as u16, // i hate that this has to signed in the database... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might be interested in https://github.com/oxidecomputer/omicron/blob/main/nexus/db-model/src/unsigned.rs - we have used different types here to make unexpectedly-signed values a serialization error when we try reading them from the database.
mod test { | ||
use super::*; | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we test some of the error cases too?
Co-authored-by: Sean Klein <sean@oxide.computer>
Co-authored-by: Sean Klein <sean@oxide.computer>
Co-authored-by: Sean Klein <sean@oxide.computer>
; Conflicts: ; nexus/db-model/src/schema_versions.rs
)), | ||
) | ||
.set(( | ||
dsl::time_delivery_started.eq(now.nullable()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I think this is intentional, but to confirm) so the "time delivery started" column represents when the latest deliverator has started attempts to send a webhook delivery, not when the first deliverator started, right?
This seems only relevant in cases where lease timeouts have occurred
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's correct. I think perhaps it ought to be renamed to something like time_leased
or something to make its purpose clearer.
} | ||
|
||
Err(Error::internal_error( | ||
"couldn't start delivery attempt for some secret third reason???", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could update this to:
Found an incomplete webhook delivery which has not been claimed by another Nexus, but which our Nexus cannot claim for an unknown reason
Is there any additional debug info about the event we'd want to include in this case? Maybe the state?
This could happen in any of the cases where the id
exists, but the filter
clauses don't match. So, as one example, this could happen when the state if failed
, but time_completed
is not set?
|
||
/// Send liveness probe to webhook receiver | ||
/// | ||
/// This endpoint synchronously sends a liveness probe request to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the motivation behind synchronously sending the liveness probe (e.g., blocking the caller here) instead of treating it like any other event, through the background task system?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The primary intended use of this endpoint for external monitoring systems to determine if the webhook receiver endpoint is both alive and reachable by the control plane (see this section in RFD 538). So, when an external health-checking system attempts to send a probe to a webhook receiver, it is doing so because it would actually like to know whether the probe completes successfully or not, which we return in the response.
Alternatively, we could do something where this endpoint just enqueues a probe that the deliverator task will send "eventually", returning a delivery ID, and then the caller can poll the delivery-list endpoint to see if that delivery ID has succeeded. But, that seems substantially more complex from the perspective of a consumer of this API that just wants to ask Nexus "hey, are you currently able to get through to the receiver endpoint?". And, a lot of health checking systems may not even be capable of doing a stateful, multi-step process of "trigger probe, remember its UUID, and then check if that UUID made it through" --- usually, I feel like these systems are just configured with a URL to hit and some simple configurations for how to interpret responses from that URL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was mostly wondering, "from the caller perspective, why not just check the receiver?". Basically, "trigger probe from API", and "outside the API, check the receiver to see if the probe arrived".
If no probe arrives at the receiver, it's implied that there isn't connectivity, but I suppose there's a bit of asynchrony / implicit assumptions on this pathway. I see the justification for making this endpoint synchronous, and delivering an explicit response.
So, with that - the synchronous ordering of this endpoint makes sense. I do have the minor fear that "if this code-path is disjoint from regular webhook event notifications, is it possible that the probes can get sent successfully, when some other aspect of webhook event notifications is broken?"
If you're not concerned about that, definitely fine to keep this as-is, but wanted to raise that flag -- having this dual pathway makes it plausible that "probes can be dispatched" XOR "real events can be dispatched" would be true, which would be sad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this doesn't exercise the entire pathway from event creation to webhook delivery, but I think I feel okay with that, as it's really intended for monitoring the health of the receiver, not the control plane. I have tried to make the code that this executes as similar as possible to what runs in the deliverator task --- since it tries to create webhook_delivery
and webhook_delivery_attempt
records, it will fail if there's something preventing Nexus from touching those tables. And if the background tasks themselves are totally wedged, we do have other mechanisms for knowing about that.
use serde::ser::{Serialize, Serializer}; | ||
use std::fmt; | ||
|
||
impl_enum_type!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one area where I'd give caution is coupling "database schema" with "API schema".
Storing an enum
in the database does keep the set of event classes tightly coupled with the schema. But don't we actually care about the event classes we've promised through the API?
E.g., suppose a customer calls the "GET list of all EventClass objects" API. They get some result as a response. Then we update Nexus, and the set changes immediately.
With this structure, we're giving the guarantee that "all variants of this enum" are valid EventClass targets, unless we explicitly filter them. I think this is roughly equivalent to what you said in:
With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.
Since there is some glob processing which must happen to process new subscriptions, this also means that our "duration of time to perform schema updates" is going to increase proportionally with the amount of webhooks usage.
Imagine an alternative world:
- Event classes are stored as their own table in the database. They contain
TEXT
, but also some fields indicating if they're "activated" or not (basically - have we turned them on/off). - Enabling/disabling event classes can now be the responsibility of a background task, rather than a schema upgrade. We can have a big list of "event classes we want to use".
- Do you have an event class in this list which doesn't have a record? Add it. Re-process all globs. Then mark it as visible.
- Do you have an event class in the database, which isn't in this list? You can mark it as deprecated, and remove all subscriptions directly to it. Once all subscriptions are gone, it can be fully deleted. (Alternatively, we could have an explicit list of "removed" event classes, if we want to be really cautious about re-use)
This seems like it might be nicer-to-use than having such a tight coupling with the database:
- It lets developers add and remove event classes really easily. They're text identifiers, basically!
- It avoids hard-coding the test case event classes.
- It makes schema changes execute faster -- globs can be processed out-of-band, by the background task subsystem.
use std::sync::Arc; | ||
use tokio::task::JoinSet; | ||
|
||
// The Deliverator belongs to an elite order, a hallowed sub-category. He's got |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm cool if we keep this comment in, but can we also add a doc comment describing what this does? I read this file before nexus/src/app/webhook.rs
, where this is actually described, and a snippet like:
//! The `webhook_deliverator` task reads these delivery records and sends
//! HTTP requests to the receiver endpoint for each delivery that is
//! currently in flight. The deliverator is responsible for recording the
//! status of each *delivery attempt*. Retries and retry backoff are
//! the responsibility of the deliverator.
woulda been nice to have here
) -> Result<(), Error> { | ||
const MAX_ATTEMPTS: u8 = 3; | ||
let conn = self.pool_connection_authorized(opctx).await?; | ||
diesel::insert_into(attempt_dsl::webhook_delivery_attempt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will result in us re-sending the event (which is fine), but I wanna make sure it doesn't leave the database in an otherwise inconsistent state.
if current_version != SCHEMA_VERSION { | ||
return Err(Error::InternalError { | ||
internal_message: format!( | ||
"cannot reprocess webhook globs, as our schema version \ | ||
({SCHEMA_VERSION}) doess not match the current version \ | ||
({current_version})", | ||
), | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is defensive code, but to be clear, I don't think this is possible in the codebase today.
Creating a Datastore involves ensuring the schema is up-to-date:
omicron/nexus/db-queries/src/db/datastore/mod.rs
Lines 232 to 256 in 118a2da
// Keep looping until we find that the schema matches our expectation. | |
retry_notify( | |
retry_policy_internal_service(), | |
|| async { | |
if let Some(try_for) = try_for { | |
if std::time::Instant::now() > start + try_for { | |
return Err(BackoffError::permanent(())); | |
} | |
} | |
match datastore | |
.ensure_schema(&log, EXPECTED_VERSION, config) | |
.await | |
{ | |
Ok(()) => return Ok(()), | |
Err(e) => { | |
warn!(log, "Failed to ensure schema version"; "error" => #%e); | |
} | |
}; | |
return Err(BackoffError::transient(())); | |
}, | |
|_, _| {}, | |
) | |
.await | |
.map_err(|_| "Failed to read valid DB schema".to_string())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was defensive code intended to protect against weird behavior mid-update.
While a DataStore
can't be created unless the schema is up to date, a Nexus
can start at a given schema version, and then --- depending on how online
updates of Nexus are actually performed once we get to that --- newer Nexii
could be started with a later schema version. This was intended to protect
against a situation where older Nexii remain active mid-upgrade in order to
prevent outdated Nexii from seeing a newer schema version and downgrading
globs that had already been upgraded.
This may not be necessary, especially as we don't currently do online upgrades
of Nexus, and I'm not sure how this will work once we start doing it (perhaps
we won't update the schema until all outdated Nexus processes have stopped?).
But, I figured it was worth including now just in case. I can remove it if you
think it's not worth having.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we at a minimum add a comment identifying that this is a bit speculatively defensive?
I want to make sure that if we need to refactor this in the future, we understand that this code is not currently load-bearing, while we do this schema-change-on-reboot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do --- I am also fine with just removing it entirely; it just felt like something that was worth checking.
use serde::ser::{Serialize, Serializer}; | ||
use std::fmt; | ||
|
||
impl_enum_type!( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I'm still reviewing the glob reprocessing code, perhaps a chunk of this is already the case? Certainly seems that the webhook_dispatcher
is invoking this glob-reprocessing code. So we presumably already won't be blocking schema updates proportional to webhook usage.
// TODO(eliza): validate endpoint URI; reject underlay network IPs for | ||
// SSRF prevention... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, this is a great point. Should this be blocking merging this PR?
(Also, can we stop this from using the Nexus external API as a receiver destination?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, my intent is not to merge this PR until this bit is complete. The reason it's not done yet is that we've determined the best way to do it is to use the socket option IP_BOUND_IF
, which allows us to only bind a socket on specific interfaces (in this case, the opte
interface), which is the only way to reliably prevent this --- rejecting underlay-like DNS names isn't sufficient as the DNS server might be malicious. However, using this socket option with reqwest
requires upstream changes in various libraries, which I've done several PRs for but which haven't all been released, so we're waiting on this.
I left a comment discussing this and pointing at the upstream changes we're waiting on, but...GitHub has wisely decided it's not interesting and hid it by default 🙃: #7277 (comment)
//! Re-delivery of an event can be requested either via the event resend API | ||
//! endpoint, or by a *liveness probe* succeeding. Liveness probes are | ||
//! synthetic delivery requests sent to a webhook receiver to check whether it's | ||
//! actually able to receive an event. They are triggered via the | ||
//! [`Nexus::webhook_receiver_probe`] API endpoint. A probe may optionally | ||
//! request that any events for which all past deliveries have failed be resent | ||
//! if it succeeds. Delivery records are also created to represent the outcome | ||
//! of a probe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since liveness probes appear to be bucked in the "Event class" enum as the string probe
, what happens if someone tries to make a receiver for the probe event directly?
Do we (/should we) have an integration test for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, I'll add a test for that!
); | ||
|
||
-- Singleton probe event | ||
INSERT INTO omicron.public.webhook_event ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our schema-change tests (like, the ones to validate that new deployments match old deployments) will definitely help you (and hopefully did help you??) with ensuring that data-definition language statements (new indices, tables, etc) match between old/new formats.
However, they won't actually help catch data manipulation language changes (e.g., "insert into this table") are consistent between deployments.
Since you're adding some of this data, as part of the schema change, I recommend checking out nexus/tests/integration_tests/schema.rs
's validate_data_migration
test.
You should be able to query for this row from the webhook_event
table, and confirm it's there. The after_125_0_0
migration is doing something very similar.
This is intended to prevent conflicts with the SQL `trigger` keyword (as suggested by @smklein [here]). [here]: #7277 (comment)
This is intended to represent specifically the time the current lease was acquired, not when "delivery started" broadly (that's the "time_created" field). This naming should be a bit clearer --- see @smklein's comment: #7277 (comment)
Co-authored-by: Sean Klein <sean@oxide.computer>
as @smklein suggested in #7277 (comment)
3e5e99b
to
9ec5adb
Compare
This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:
delivery status
The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.
Future Work
Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):
webhook_delivery
,webhook_delivery_attempt
, andwebhook_event
CRDB tables (need to figure out a good retention policy for events)omdb db webhooks
commands for actually looking at the webhook database tablesNot currently planned, but possible future work:
/v1/webhooks/event-classes
endpoints, or both)fleet.viewer
(see RFD 538 Appendix B.3); probably requires service accounts