[nexus] webhooks #7277

hawkw · 2024-12-18T21:09:45Z

This branch adds an MVP implementation of the internal machinery for delivering webhooks from Nexus. This includes:

webhook-related external API endpoints (as described in RFD 538)
database tables for storing webhook receiver configurations and, webhook events and tracking their
delivery status
background tasks for actually delivering webhook events to receivers

The user-facing interface for webhooks is described in greater detail in RFD 538. The code change in this branch includes a "Big Theory Statement" comment that describes most of the implementation details, so reviewers are encouraged to refer to that for more information on the implementation.

Future Work

Immediate follow-up work (i.e. stuff I'd like to do shortly but would prefer to land in separate PRs):

Garbage collection for old records in the webhook_delivery, webhook_delivery_attempt, and webhook_event CRDB tables (need to figure out a good retention policy for events)
omdb db webhooks commands for actually looking at the webhook database tables
Oximeter metrics tracking webhook delivery attempt outcomes and latencies

Not currently planned, but possible future work:

Actually record webhook events when stuff happens :)
Some mechanism for communicating JSON schemas for webhook event payloads (either via OpenAPI 3.1, by sticking JSON schemas in the /v1/webhooks/event-classes endpoints, or both)
Allow webhook receivers to have roles with more restrictive permissions than fleet.viewer (see RFD 538 Appendix B.3); probably requires service accounts
Track receiver liveness and alert when a receiver has gone away (see RFD 538 Appendix B.4)

hawkw · 2025-01-24T00:16:26Z

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that. I'd like to have a way to include a couple "test" variants in there that aren't exposed in the public API, so I'll be giving some thought to how to deal with that.

hawkw · 2025-01-24T00:27:32Z

I think I've come around a bit to @andrewjstone's proposal that the event classes be a DB enum, so I'm planning to change that.

Glob subscription entries in webhook_rx_event_glob should capture the schema version when they're created, so that we can trigger reprocessing (generating the exact event class subscriptions for those globs) if the schema has changed. It's probably fine for nexus to do glob reprocessing on startup rather than in a bg task, although online update might invalidate that assumption.

hawkw · 2025-01-24T00:34:44Z

As far as GCing old events from the event table, dispatching an event should probably add a count of the number of receivers it was dispatched to, and then when we successfully deliver the event, we increment a count of successes. That way, we would not consider an event entry eligible to be deleted unless the two counts are equal; we want to hang onto events that weren't successfully delivered so any failed deliveries can be re-triggered.

GCing an event would also clean up any child delivery attempt records.

nexus/db-queries/src/db/datastore/webhook_event.rs

This commit adds (unimplemented) public API endpoints for managing Nexus webhooks, as described in [RFD 364][1]. [1]: https://rfd.shared.oxide.computer/rfd/364#_external_api

Co-authored-by: Augustus Mayo <augustus@oxidecomputer.com>

smklein

Only reviewed the schema so far, still need to go through db-queries, the background tasks, the tests, and the API

schema/crdb/webhooks/README.adoc

schema/crdb/webhooks/up26.sql

nexus-config/src/nexus_config.rs

smklein · 2025-03-19T23:55:56Z

nexus-config/src/nexus_config.rs

+    #[serde(
+        default = "WebhookDeliveratorConfig::default_second_retry_backoff"
+    )]
+    pub second_retry_backoff_secs: u64,


Is this supposed to be a time "after the first retry", or relative to the start of the delivery time?

like, if we try to send a delivery, and fail, and then:

first = 5 seconds
second = 10 seconds

Are we expecting:

@ 0 seconds -> send and fail first delivery
@ 5 seconds -> send first retry delivery
@ 10 seconds -> send second retry delivery? Or is this actually at 15 seconds?

It's supposed to be the time since the previous attempt. So, in your example, it would be 0, 5, and 15 seconds.

smklein · 2025-03-20T00:45:01Z

nexus/db-model/src/webhook_event_class.rs

+use serde::ser::{Serialize, Serializer};
+use std::fmt;
+
+impl_enum_type!(


How critical is it that these webhook event classes are strongly-typed, as opposed to being raw strings present as "data" rather than "schema" in the database?

Renaming or removing these variants will not be trivial for "cockroachdb is silly sometimes" reasons- see: https://github.com/oxidecomputer/omicron/tree/main/schema/crdb#212-changing-enum-variants

This is a great question, and it's something I've thought about a bit. My initial plan was, in fact, to just store these as strings in the database. @andrewjstone suggested that we represent them as an enum, instead, to avoid storing a whole bunch of duplicate copies of the same relatively small set of strings. This also has the advantage of tying the set of event classes quite closely to the database schema version, which is used to determine when glob subscriptions need to be reprocessed.

Another option, which occupies a middle ground between using an enum and just storing the string representation of the class for every event, would be to instead have a table of event class strings along with some numeric identifier, so we could represent them more concisely on disk using the numeric identifier. That way, they could be inserted or removed by queries rather than by a schema update.

Though that flexibility seems appealing, I'm not actually sure if it's a good thing. On updates that add new event classes, we'd probably have the new Nexus version go and run some queries to add the new classes to the database. This is a bit ad-hoc, and it means that it's at least theoretically possible to change the set of event classes without changing the schema version, so it's much harder to reason about whether the glob subscriptions are correct based on what's presently in the database. With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.

Finally, I'm not actually that convinced that being unable to easily remove event classes is that big of a problem. For enums that represent operational states (e.g. instance_state or similar), or that represent a policy or mode of behavior, it's actually very important to not be able to represent defunct variants: code that consumes this data needs to handle every possible state or policy or whatever, and it's very unfortunate to have to have match arms or other cases for stuff that's no longer used that just panics or logs an error. In this case, however, these are basically just strings that we do regex matching on, and the enum is mostly used to reduce the size of the database record (and to tie the set of strings to the schema version, as I mentioned). So, if we leave behind some event classes that we've stopped using, but they're still there in the enum...honestly, who cares? They don't actually cause that big of a problem just by being there, and we can filter them out of the "list event classes" API response so that users don't get the idea that they still exist. It's a bit ugly to leave behind stuff that's no longer used, but the consequences are less bad than for something like VmmState or SledPolicy...

Hopefully that all makes some amount of sense. I'm certainly not that attached to this approach, but that was the rationale I was operating under.

I think one area where I'd give caution is coupling "database schema" with "API schema".

Storing an enum in the database does keep the set of event classes tightly coupled with the schema. But don't we actually care about the event classes we've promised through the API?

E.g., suppose a customer calls the "GET list of all EventClass objects" API. They get some result as a response. Then we update Nexus, and the set changes immediately.

With this structure, we're giving the guarantee that "all variants of this enum" are valid EventClass targets, unless we explicitly filter them. I think this is roughly equivalent to what you said in:

With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.

Since there is some glob processing which must happen to process new subscriptions, this also means that our "duration of time to perform schema updates" is going to increase proportionally with the amount of webhooks usage.

Imagine an alternative world:

Event classes are stored as their own table in the database. They contain TEXT, but also some fields indicating if they're "activated" or not (basically - have we turned them on/off).

Enabling/disabling event classes can now be the responsibility of a background task, rather than a schema upgrade. We can have a big list of "event classes we want to use".

Do you have an event class in this list which doesn't have a record? Add it. Re-process all globs. Then mark it as visible.

Do you have an event class in the database, which isn't in this list? You can mark it as deprecated, and remove all subscriptions directly to it. Once all subscriptions are gone, it can be fully deleted. (Alternatively, we could have an explicit list of "removed" event classes, if we want to be really cautious about re-use)

This seems like it might be nicer-to-use than having such a tight coupling with the database:

It lets developers add and remove event classes really easily. They're text identifiers, basically!

It avoids hard-coding the test case event classes.

It makes schema changes execute faster -- globs can be processed out-of-band, by the background task subsystem.

okay, I'm still reviewing the glob reprocessing code, perhaps a chunk of this is already the case? Certainly seems that the webhook_dispatcher is invoking this glob-reprocessing code. So we presumably already won't be blocking schema updates proportional to webhook usage.

Now that I've actually read through more of this (and really, why bother with that before commenting), I see that:

The glob re-processing should guard against this, and the ordering of background tasks means that it'll basically be the first webhooks-related operation to happen after an update, even though it won't block schema changes themselves. This is great!

Because of that, the choice of "enum" vs "TEXT" matters a lot less. There might be some developer experience gains we could gain from going to a TEXT-based list instead of an enum-based list (we can reconcile a list of event classes, just like we can reconcile an enum, and I still think it might be nice to decouple this from the database), but it seems possible to punt this. Even if we did go to an enum-less event-class world, we'd probably end up doing re-processing of globs in a similar spot to this PR.

smklein · 2025-03-20T00:47:46Z

nexus/db-model/src/webhook_event_class.rs

+impl std::str::FromStr for WebhookEventClass {
+    type Err = EventClassParseError;
+    fn from_str(s: &str) -> Result<Self, Self::Err> {
+        for &class in Self::ALL_CLASSES {


How many event classes do you think we're gonna end up with, if you had to place a bet, 2-3 years from now?

I don't think think this is a problem now, but I expect this enum will grow quite large as it expands to include at least "all possible faults" and "all possible invocations of the interface", right?

Any operation (outside of tests) that requires iteration over all variants seems like it's worth a little scrutiny IMO

Yeah, this definitely won't scale to large numbers of event classes. As this list grows, I would definitely consider a parsing strategy that doesn't require iterating over every possible class --- since event classes form a tree-like hierarchy, we could eventually write a parser that parses them segment-by-segment: i.e., if you have the string "instance.start", you would see "instance" and know that the next segment has to be one of ["start", "stop", "create", "delete", ...], and only look for those tokens. I didn't do that here because we don't currently have that many event classes --- maybe there should be a comment here proposing we do that in the future.

With that said, I'm not really sure if there are going to be quite as many of these as you imagine: for FMA, I don't think every possible fault class emitted by illumos FMA is necessarily going in here, so I dunno if we'll see "ereport.cpu.generic-x86.cache"` or similar. We might just have broader categories in the webhook event class, like "sled.fault", "sled.disk.fault", etc, and put more detailed fault class strings in the JSON payload. I do think it's worth thinking about the cardinality of this though!

nexus/db-model/src/webhook_delivery.rs

smklein · 2025-03-20T01:06:14Z

nexus/db-model/src/webhook_delivery.rs

+impl WebhookDeliveryAttempt {
+    fn response_view(&self) -> Option<views::WebhookDeliveryResponse> {
+        Some(views::WebhookDeliveryResponse {
+            status: self.response_status? as u16, // i hate that this has to signed in the database...


You might be interested in https://github.com/oxidecomputer/omicron/blob/main/nexus/db-model/src/unsigned.rs - we have used different types here to make unexpectedly-signed values a serialization error when we try reading them from the database.

smklein · 2025-03-20T01:10:59Z

nexus/db-model/src/webhook_rx.rs

+mod test {
+    use super::*;
+
+    #[test]


Could we test some of the error cases too?

Co-authored-by: Sean Klein <sean@oxide.computer>

Extracted from #7339 for use in #7277. This PR does not use the pagination helper in any endpoints. There are proper integration tests like `test_audit_log_list` in #7339 demonstrating the ordering and cursor work as expected.

; Conflicts: ; nexus/db-model/src/schema_versions.rs

nexus-config/src/nexus_config.rs

nexus/db-queries/src/db/datastore/webhook_delivery.rs

schema/crdb/dbinit.sql

smklein · 2025-03-20T18:56:08Z

nexus/db-queries/src/db/datastore/webhook_delivery.rs

+                    )),
+            )
+            .set((
+                dsl::time_delivery_started.eq(now.nullable()),


(I think this is intentional, but to confirm) so the "time delivery started" column represents when the latest deliverator has started attempts to send a webhook delivery, not when the first deliverator started, right?

This seems only relevant in cases where lease timeouts have occurred

Yeah, that's correct. I think perhaps it ought to be renamed to something like time_leased or something to make its purpose clearer.

smklein · 2025-03-20T19:05:11Z

nexus/db-queries/src/db/datastore/webhook_delivery.rs

+                }
+
+                Err(Error::internal_error(
+                    "couldn't start delivery attempt for some secret third reason???",


Maybe we could update this to:

Found an incomplete webhook delivery which has not been claimed by another Nexus, but which our Nexus cannot claim for an unknown reason

Is there any additional debug info about the event we'd want to include in this case? Maybe the state?

This could happen in any of the cases where the id exists, but the filter clauses don't match. So, as one example, this could happen when the state if failed, but time_completed is not set?

nexus/src/app/webhook.rs

nexus/external-api/src/lib.rs

smklein · 2025-03-20T22:23:53Z

nexus/external-api/src/lib.rs

+
+    /// Send liveness probe to webhook receiver
+    ///
+    /// This endpoint synchronously sends a liveness probe request to the


What's the motivation behind synchronously sending the liveness probe (e.g., blocking the caller here) instead of treating it like any other event, through the background task system?

The primary intended use of this endpoint for external monitoring systems to determine if the webhook receiver endpoint is both alive and reachable by the control plane (see this section in RFD 538). So, when an external health-checking system attempts to send a probe to a webhook receiver, it is doing so because it would actually like to know whether the probe completes successfully or not, which we return in the response.

Alternatively, we could do something where this endpoint just enqueues a probe that the deliverator task will send "eventually", returning a delivery ID, and then the caller can poll the delivery-list endpoint to see if that delivery ID has succeeded. But, that seems substantially more complex from the perspective of a consumer of this API that just wants to ask Nexus "hey, are you currently able to get through to the receiver endpoint?". And, a lot of health checking systems may not even be capable of doing a stateful, multi-step process of "trigger probe, remember its UUID, and then check if that UUID made it through" --- usually, I feel like these systems are just configured with a URL to hit and some simple configurations for how to interpret responses from that URL.

I was mostly wondering, "from the caller perspective, why not just check the receiver?". Basically, "trigger probe from API", and "outside the API, check the receiver to see if the probe arrived".

If no probe arrives at the receiver, it's implied that there isn't connectivity, but I suppose there's a bit of asynchrony / implicit assumptions on this pathway. I see the justification for making this endpoint synchronous, and delivering an explicit response.

So, with that - the synchronous ordering of this endpoint makes sense. I do have the minor fear that "if this code-path is disjoint from regular webhook event notifications, is it possible that the probes can get sent successfully, when some other aspect of webhook event notifications is broken?"

If you're not concerned about that, definitely fine to keep this as-is, but wanted to raise that flag -- having this dual pathway makes it plausible that "probes can be dispatched" XOR "real events can be dispatched" would be true, which would be sad.

I agree that this doesn't exercise the entire pathway from event creation to webhook delivery, but I think I feel okay with that, as it's really intended for monitoring the health of the receiver, not the control plane. I have tried to make the code that this executes as similar as possible to what runs in the deliverator task --- since it tries to create webhook_delivery and webhook_delivery_attempt records, it will fail if there's something preventing Nexus from touching those tables. And if the background tasks themselves are totally wedged, we do have other mechanisms for knowing about that.

smklein · 2025-03-20T23:25:31Z

nexus/db-model/src/webhook_event_class.rs

+use serde::ser::{Serialize, Serializer};
+use std::fmt;
+
+impl_enum_type!(


I think one area where I'd give caution is coupling "database schema" with "API schema".

Storing an enum in the database does keep the set of event classes tightly coupled with the schema. But don't we actually care about the event classes we've promised through the API?

E.g., suppose a customer calls the "GET list of all EventClass objects" API. They get some result as a response. Then we update Nexus, and the set changes immediately.

With this structure, we're giving the guarantee that "all variants of this enum" are valid EventClass targets, unless we explicitly filter them. I think this is roughly equivalent to what you said in:

With an enum, any time we add new classes, we must add a migration to do so, which ensures that the schema version accurately indicates what classes exist.

Since there is some glob processing which must happen to process new subscriptions, this also means that our "duration of time to perform schema updates" is going to increase proportionally with the amount of webhooks usage.

Imagine an alternative world:

Event classes are stored as their own table in the database. They contain TEXT, but also some fields indicating if they're "activated" or not (basically - have we turned them on/off).

Enabling/disabling event classes can now be the responsibility of a background task, rather than a schema upgrade. We can have a big list of "event classes we want to use".

Do you have an event class in this list which doesn't have a record? Add it. Re-process all globs. Then mark it as visible.

Do you have an event class in the database, which isn't in this list? You can mark it as deprecated, and remove all subscriptions directly to it. Once all subscriptions are gone, it can be fully deleted. (Alternatively, we could have an explicit list of "removed" event classes, if we want to be really cautious about re-use)

This seems like it might be nicer-to-use than having such a tight coupling with the database:

It lets developers add and remove event classes really easily. They're text identifiers, basically!

It avoids hard-coding the test case event classes.

It makes schema changes execute faster -- globs can be processed out-of-band, by the background task subsystem.

smklein · 2025-03-20T23:40:10Z

nexus/src/app/background/tasks/webhook_deliverator.rs

+use std::sync::Arc;
+use tokio::task::JoinSet;
+
+// The Deliverator belongs to an elite order, a hallowed sub-category. He's got


So, I'm cool if we keep this comment in, but can we also add a doc comment describing what this does? I read this file before nexus/src/app/webhook.rs, where this is actually described, and a snippet like:

//! The `webhook_deliverator` task reads these delivery records and sends //! HTTP requests to the receiver endpoint for each delivery that is //! currently in flight. The deliverator is responsible for recording the //! status of each *delivery attempt*. Retries and retry backoff are //! the responsibility of the deliverator.

woulda been nice to have here

smklein · 2025-03-20T23:46:52Z

nexus/db-queries/src/db/datastore/webhook_delivery.rs

+    ) -> Result<(), Error> {
+        const MAX_ATTEMPTS: u8 = 3;
+        let conn = self.pool_connection_authorized(opctx).await?;
+        diesel::insert_into(attempt_dsl::webhook_delivery_attempt)


I think this will result in us re-sending the event (which is fine), but I wanna make sure it doesn't leave the database in an otherwise inconsistent state.

nexus/src/app/background/tasks/webhook_dispatcher.rs

smklein · 2025-03-20T23:52:49Z

nexus/db-queries/src/db/datastore/webhook_rx.rs

+        if current_version != SCHEMA_VERSION {
+            return Err(Error::InternalError {
+                internal_message: format!(
+                    "cannot reprocess webhook globs, as our schema version \
+                    ({SCHEMA_VERSION}) doess not match the current version \
+                    ({current_version})",
+                ),
+            });
+        }


I think this is defensive code, but to be clear, I don't think this is possible in the codebase today.

Creating a Datastore involves ensuring the schema is up-to-date:

omicron/nexus/db-queries/src/db/datastore/mod.rs

Lines 232 to 256 in 118a2da

// Keep looping until we find that the schema matches our expectation.

retry_notify(

retry_policy_internal_service(),

|| async {

if let Some(try_for) = try_for {

if std::time::Instant::now() > start + try_for {

return Err(BackoffError::permanent(()));

}

}

match datastore

.ensure_schema(&log, EXPECTED_VERSION, config)

.await

{

Ok(()) => return Ok(()),

Err(e) => {

warn!(log, "Failed to ensure schema version"; "error" => #%e);

}

};

return Err(BackoffError::transient(()));

},

|_, _| {},

)

.await

.map_err(|_| "Failed to read valid DB schema".to_string())?;

This was defensive code intended to protect against weird behavior mid-update.
While a DataStore can't be created unless the schema is up to date, a Nexus
can start at a given schema version, and then --- depending on how online
updates of Nexus are actually performed once we get to that --- newer Nexii
could be started with a later schema version. This was intended to protect
against a situation where older Nexii remain active mid-upgrade in order to
prevent outdated Nexii from seeing a newer schema version and downgrading
globs that had already been upgraded.

This may not be necessary, especially as we don't currently do online upgrades
of Nexus, and I'm not sure how this will work once we start doing it (perhaps
we won't update the schema until all outdated Nexus processes have stopped?).
But, I figured it was worth including now just in case. I can remove it if you
think it's not worth having.

Can we at a minimum add a comment identifying that this is a bit speculatively defensive?

I want to make sure that if we need to refactor this in the future, we understand that this code is not currently load-bearing, while we do this schema-change-on-reboot.

Sure, will do --- I am also fine with just removing it entirely; it just felt like something that was worth checking.

schema/crdb/webhooks/README.adoc

smklein · 2025-03-21T00:02:32Z

nexus/db-model/src/webhook_event_class.rs

+use serde::ser::{Serialize, Serializer};
+use std::fmt;
+
+impl_enum_type!(


okay, I'm still reviewing the glob reprocessing code, perhaps a chunk of this is already the case? Certainly seems that the webhook_dispatcher is invoking this glob-reprocessing code. So we presumably already won't be blocking schema updates proportional to webhook usage.

nexus/src/external_api/http_entrypoints.rs

smklein · 2025-03-21T00:29:51Z

nexus/src/app/webhook.rs

+        // TODO(eliza): validate endpoint URI; reject underlay network IPs for
+        // SSRF prevention...


Ah, this is a great point. Should this be blocking merging this PR?

(Also, can we stop this from using the Nexus external API as a receiver destination?)

Yes, my intent is not to merge this PR until this bit is complete. The reason it's not done yet is that we've determined the best way to do it is to use the socket option IP_BOUND_IF, which allows us to only bind a socket on specific interfaces (in this case, the opte interface), which is the only way to reliably prevent this --- rejecting underlay-like DNS names isn't sufficient as the DNS server might be malicious. However, using this socket option with reqwest requires upstream changes in various libraries, which I've done several PRs for but which haven't all been released, so we're waiting on this.

I left a comment discussing this and pointing at the upstream changes we're waiting on, but...GitHub has wisely decided it's not interesting and hid it by default 🙃: #7277 (comment)

smklein · 2025-03-21T00:38:24Z

nexus/src/app/webhook.rs

+//! Re-delivery of an event can be requested either via the event resend API
+//! endpoint, or by a *liveness probe* succeeding.  Liveness probes are
+//! synthetic delivery requests sent to a webhook receiver to check whether it's
+//! actually able to receive an event.  They are triggered via the
+//! [`Nexus::webhook_receiver_probe`] API endpoint.  A probe may optionally
+//! request that any events for which all past deliveries have failed be resent
+//! if it succeeds.  Delivery records are also created to represent the outcome
+//! of a probe.


Since liveness probes appear to be bucked in the "Event class" enum as the string probe, what happens if someone tries to make a receiver for the probe event directly?

Do we (/should we) have an integration test for this?

Good call, I'll add a test for that!

smklein · 2025-03-21T15:58:26Z

schema/crdb/dbinit.sql

+);
+
+-- Singleton probe event
+INSERT INTO omicron.public.webhook_event (


Our schema-change tests (like, the ones to validate that new deployments match old deployments) will definitely help you (and hopefully did help you??) with ensuring that data-definition language statements (new indices, tables, etc) match between old/new formats.

However, they won't actually help catch data manipulation language changes (e.g., "insert into this table") are consistent between deployments.

Since you're adding some of this data, as part of the schema change, I recommend checking out nexus/tests/integration_tests/schema.rs's validate_data_migration test.

You should be able to query for this row from the webhook_event table, and confirm it's there. The after_125_0_0 migration is doing something very similar.

@smklein

This is intended to prevent conflicts with the SQL `trigger` keyword (as suggested by @smklein [here]). [here]: #7277 (comment)

@smklein

This is intended to represent specifically the time the current lease was acquired, not when "delivery started" broadly (that's the "time_created" field). This naming should be a bit clearer --- see @smklein's comment: #7277 (comment)

Co-authored-by: Sean Klein <sean@oxide.computer>

@smklein

as @smklein suggested in #7277 (comment)

hawkw force-pushed the eliza/webhook-models branch from 51f7f8e to 139cfe6 Compare December 18, 2024 21:10

hawkw changed the base branch from eliza/webhook-api to main December 18, 2024 21:11

hawkw requested a review from augustuswm December 18, 2024 21:11

hawkw force-pushed the eliza/webhook-models branch 2 times, most recently from 140aea4 to 0b80c8f Compare January 8, 2025 17:28

hawkw changed the title ~~[nexus] Webhook DB models~~ [nexus] webhooks Jan 11, 2025

hawkw force-pushed the eliza/webhook-models branch from 41cf0b0 to 2bc5925 Compare January 17, 2025 19:20

hawkw mentioned this pull request Jan 24, 2025

[nexus] Webhook API skeleton #7274

Closed

hawkw commented Jan 24, 2025

View reviewed changes

nexus/db-queries/src/db/datastore/webhook_event.rs Outdated Show resolved Hide resolved

hawkw and others added 18 commits February 3, 2025 10:24

[nexus] Webhook API skeleton

59a10fc

This commit adds (unimplemented) public API endpoints for managing Nexus webhooks, as described in [RFD 364][1]. [1]: https://rfd.shared.oxide.computer/rfd/364#_external_api

naming consistency edits from @augustuswm

4ece574

Co-authored-by: Augustus Mayo <augustus@oxidecomputer.com>

fix cargo workspace hack

6e8f8ac

update to match RFD 538

478b2ce

also do event classes APIs

c730c63

apparently this is also a thing we need to do

d5ed2ba

[nexus] start DB model for webhooks

5d4f13b

message queue

5bf2633

more diesel plumbing

52efc08

terminology tweaks

72754d7

change tracking of delivery attempts

fa75a2f

s/deliverey/delivery

f9e5fc9

add 'failed_timeout' delivery result

432057a

ag

52d0f24

ag

1aa95c3

more models for webhook delivery

75ac350

models for receivers

59c98a0

s/delivery_attempts/delivery_attempt

9fdd77a

smklein reviewed Mar 20, 2025

View reviewed changes

hawkw and others added 3 commits March 19, 2025 18:13

Update README.adoc

9addb4e

Co-authored-by: Sean Klein <sean@oxide.computer>

Update README.adoc

9908b4f

Co-authored-by: Sean Klein <sean@oxide.computer>

Update nexus_config.rs

4ae319d

Co-authored-by: Sean Klein <sean@oxide.computer>

david-crespo mentioned this pull request Mar 20, 2025

PaginatedByTimeAndId #7842

Merged

hawkw added 2 commits March 20, 2025 09:59

make schema.rs fields match dbinit

f76ba0c

remove spare pipe

fbff139

hawkw added 2 commits March 20, 2025 14:42

Merge branch 'main' into eliza/webhook-models

0ad888b

; Conflicts: ; nexus/db-model/src/schema_versions.rs

use @david-crespo's lovely timestamp pagination from #7842

4069eea

smklein reviewed Mar 20, 2025

View reviewed changes

smklein reviewed Mar 21, 2025

View reviewed changes

hawkw and others added 12 commits March 21, 2025 09:41

whoops forgot to update oepnapi again

1eb8aae

rename trigger to triggered_by

c1e671f

This is intended to prevent conflicts with the SQL `trigger` keyword (as suggested by @smklein [here]). [here]: #7277 (comment)

rename time_delivery_started to time_leased

334249f

This is intended to represent specifically the time the current lease was acquired, not when "delivery started" broadly (that's the "time_created" field). This naming should be a bit clearer --- see @smklein's comment: #7277 (comment)

remove unused default impl

5b42ea5

Apply @smklein's suggestions

8c0d0ee

Co-authored-by: Sean Klein <sean@oxide.computer>

also update trigger's name where clause in index

d44d002

update query expectorate tests

5c304d4

use IncompleteOnConflictExt

0b8b1b6

as @smklein suggested in #7277 (comment)

improve error message for unexpected lease errors

3949865

improve error logging in delivery attempt finish

ac16c4e

remove defunct TODO comment

70a8aec

embrace sqlu16

9ec5adb

hawkw force-pushed the eliza/webhook-models branch from 3e5e99b to 9ec5adb Compare March 24, 2025 17:15

hawkw added 2 commits March 24, 2025 11:15

nicer doc comments for background tasks

49e7264

AGH

c37d289

	// Keep looping until we find that the schema matches our expectation.
	retry_notify(
	retry_policy_internal_service(),
	\|\| async {
	if let Some(try_for) = try_for {
	if std::time::Instant::now() > start + try_for {
	return Err(BackoffError::permanent(()));
	}
	}

	match datastore
	.ensure_schema(&log, EXPECTED_VERSION, config)
	.await
	{
	Ok(()) => return Ok(()),
	Err(e) => {
	warn!(log, "Failed to ensure schema version"; "error" => #%e);
	}
	};
	return Err(BackoffError::transient(()));
	},
	\|_, _\| {},
	)
	.await
	.map_err(\|_\| "Failed to read valid DB schema".to_string())?;

		// TODO(eliza): validate endpoint URI; reject underlay network IPs for
		// SSRF prevention...

[nexus] webhooks #7277

Are you sure you want to change the base?

[nexus] webhooks #7277

Conversation

hawkw commented Dec 18, 2024 • edited Loading

Future Work

hawkw commented Jan 24, 2025

hawkw commented Jan 24, 2025

hawkw commented Jan 24, 2025

smklein left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hawkw commented Dec 18, 2024 •

edited

Loading