Audit log prototype #7339

david-crespo · 2025-01-14T17:36:56Z

Initial implementation of RFD 523. Lots of open questions still. I will fill this description in as I go. I tried to concentrate the questions about what to store in models/audit-log.

The biggest high-level design decision here is that this log will go into CockroachDB and not, for example, Clickhouse because we need an immediate guarantee at write time that the audit log write happened before we proceed with the API operation. We will need to think about how fast the log will grow and how long we want to retain it (discussed in the RFD, but might need more detail) and decide if we need to do something like periodically dump the log out of CockroachDB into a format that takes up less space.

The retrieval API is not finalized. So far I have a required start time (inclusive) and an optional end time (exclusive) in addition to the usual pagination params of page size and cursor token. I think I have a good start on how to set up the indexes so we can get the log lines in timestamp order with good performance and no table scans (see Cockroach docs on hash-sharded indexes).

Tasks

common/src/api/external/http_pagination.rs

inickles · 2025-01-14T19:43:03Z

nexus/db-model/src/audit_log.rs

+    // TODO: RFD 523 says: "Additionally, the response (or error) data should be
+    // included in the same log entry as the original request data. Separating
+    // the response from the request into two different log entries is extremely
+    // expensive for customers to identify which requests correspond to which
+    // responses." I guess the typical thing is to include a duration of the
+    // request rather than a second timestamp.


This was me making a case for including something like the pub success_response: Option<Value> you have in a log entry. I can clarify that in the RFD.

david-crespo · 2025-01-15T16:48:31Z

nexus/src/external_api/http_entrypoints.rs

            let project =
                nexus.project_create(&opctx, &new_project.into_inner()).await?;
+
+            let _ = nexus.audit_log_entry_complete(&opctx).await?;


I started with project create because it's easy to work with in tests, but I know it's not in the short list of things we want to start with. We might end up simply logging every endpoint.

Yeah, we'll want (at least eventually) to include all (at least all authenticated) API methods. I think if we want to just have a subset of the methods available then we should prioritize those that make changes (vs GET operations), but with the intention of getting coverage of the API.

Related note, while not a requirement for this initial version, I spoke to @sunshowers about strategies for how we might be able to enforce new methods must implement the audit log. It's a place I think we'd like to get to.

It's related to dropshot lacking middleware — notice we manually call this instrument_dropshot_handler thing in every endpoint. I wonder if we could build that in elsewhere, make it automatic, and add the audit log call to it.

inickles

Some initial thoughts on the fields in AuditLogEntry.

inickles · 2025-01-15T17:21:54Z

nexus/db-model/src/audit_log.rs

+    // TODO: this isn't in the RFD but it seems nice to have
+    pub request_uri: String,


Yeah, this looks like it might be the closest thing that we'd have to something like a rack and/or fleet ID, which is something I think we'd want - something for customer to be able to filter which audit logs came from which rack / fleet.

This may suffice for now, but maybe just until we get multi-rack implemented?

inickles · 2025-01-15T17:28:25Z

nexus/db-model/src/audit_log.rs

+    // Fields that are optional because they get filled in after the action completes
+    /// Time in milliseconds between receiving request and responding
+    pub duration: Option<TimeDelta>,


While fine to include, I don't think this is required, in case that makes it easier. I'm not following the earlier note about this relates to including the response in the audit log entry.

I just meant the response and the duration are both things we only know at the end of the operation.

inickles · 2025-01-15T17:29:58Z

nexus/db-model/src/audit_log.rs

+    // TODO: including a real response complicates things
+    // Response data on success (if applicable)
+    // pub success_response: Option<Value>,


While this indeed complicates things, it is critical IMO. For example, if someone were to create a new instance this audit log should say what that new instance ID is as a result.

inickles · 2025-01-15T17:48:27Z

nexus/db-model/src/audit_log.rs

+    // Error information if the action failed
+    pub error_code: Option<String>,
+    pub error_message: Option<String>,


Have you considered using a enum for response elements instead of optionals for errors and successful responses?

Something like:

#[derive(Serialize, Deserialize)] #[serde(untagged)] enum Response { Success { success: Value }, Error { error_code: String, error_message: String } }

Not sure I have a preference over one way over the other in terms of output structure, though I kind of like the enum from a code perspective to help ensure you either get one or the other.

Yes — most likely what I'll do is have views::AuditLogEntry look nice and structured but keep the model struct flat like the DB entry, and then the conversion from model to view is what mediates between those two formats.

inickles · 2025-01-15T17:52:33Z

nexus/db-model/src/audit_log.rs

+
+#[derive(Queryable, Insertable, Selectable, Clone, Debug)]
+#[diesel(table_name = audit_log)]
+pub struct AuditLogEntry {


I'm thinking it might make more sense to put operation-specific things like resource_type, resource_id and maybe action into a something like a request_elements: Value, where the operation can decide what makes to include.

inickles · 2025-01-15T18:01:52Z

nexus/db-model/src/audit_log.rs

+
+#[derive(Queryable, Insertable, Selectable, Clone, Debug)]
+#[diesel(table_name = audit_log)]
+pub struct AuditLogEntry {


I'd like for us to include a version format, where we stick to major/minor semver, and include a event_version in this struct. I'm not sure how we'd want to manage that, and for all I know it might be a little more difficult for fields with Value type (request and response bits), but I think it's important for us to not silently break user parsers.

I was thinking we could use the release version, but I see you mean the abstract shape of the log entry, and we'd want the version to stay the same across releases when applicable to indicate that log parsing logic does not have to change. So we should probably include both a log format version and the release version. Semver might be overkill — maybe we can get away with integers and not worry about distinguishing between breaking, semi-breaking, and non-breaking changes.

I was thinking we could use the release version, but I see you mean the abstract shape of the log entry, and we'd want the version to stay the same across releases when applicable to indicate that log parsing logic does not have to change. So we should probably include both a log format version and the release version. Semver might be overkill — maybe we can get away with integers and not worry about distinguishing between breaking, semi-breaking, and non-breaking changes.

The patch number of SemVer might be overkill, but I think following similar rules for Major and Minor versions to differentiate between changes that'd break parsers vs those that shouldn't (e.g. new fields added) could still fit into SemVer rules and be a natural means indicating when parser logic has to change.

Pulling these refactors out of #7339 because they're mechanical and just add noise. The point is to make it a cleaner diff when we add the function calls or wrapper code that creates audit log entries, as well as to clean up the `device_auth` (eliminated) and `console_api` (shrunken substantially) files, which have always been a little out of place. ### Refactors With the change to a trait-based Dropshot API, the already weird `console_api` and `device_auth` modules became even weirder, because the actual endpoint definitions were moved out of those files and into `http_entrypoints.rs`, but they still called functions that lived in the other files. These functions were redundant and had signatures more or less identical to the endpoint handlers. That's the main reason we lose 90 lines here. Before we had ``` http_entrypoints.rs -> console_api/device_auth -> nexus/src/app functions ``` Now we (mostly) cut out the middleman: ``` http_entrypoints.rs -> nexus/src/app functions ``` Some of what was in the middle moved up into the endpoint handlers, some moved "down" into the nexus "service layer" functions. ### One (1) functional change The one functional change is that the console endpoints are all instrumented now.

Extracted from #7339 for use in #7277. This PR does not use the pagination helper in any endpoints. There are proper integration tests like `test_audit_log_list` in #7339 demonstrating the ordering and cursor work as expected.

There should be no functional changes here, though the internal error messages are now slightly different between saml login and local login, where before they were the same. Ran into this while working #7339. This logic (which I wrote originally and shuffled around in #7374) never really made sense, and I figured it's good prep for #7818 and friends to clean it up. The core of the change here is updating two existing functions that returned `Result<Option<User>, Error>` to just return `Result<User, Error>` and move the logic about what we do when the user was `None` inside each function. In both cases, when the user was `None` we ended up with an `Error::Unauthenticated` anyway, so we can just do that a moment earlier and eliminate a lot of misdirection.

benjaminleonard · 2025-06-27T11:35:36Z

Are there plans to squeeze in an operation filter (that takes an array of operation_id). Not sure how much work it would be, but it'd massively increase the utility of this.

david-crespo · 2025-06-27T15:08:47Z

I will track it as a followup, but so far I have assumed customers would primarily consume the log by hitting this endpoint on an interval and putting the log somewhere else and doing the search and filtering there.

log user agent

nexus/types/src/external_api/views.rs

david-crespo commented Jan 14, 2025

View reviewed changes

common/src/api/external/http_pagination.rs Outdated Show resolved Hide resolved

david-crespo force-pushed the crespo/audit-log branch from ec3d782 to 6f04417 Compare January 14, 2025 18:42

inickles reviewed Jan 14, 2025

View reviewed changes

david-crespo force-pushed the crespo/audit-log branch 5 times, most recently from 9258b89 to 1c4e5bf Compare January 15, 2025 15:57

david-crespo commented Jan 15, 2025

View reviewed changes

inickles reviewed Jan 15, 2025

View reviewed changes

david-crespo mentioned this pull request Jan 18, 2025

Refactor login endpoints, instrument console endpoints #7374

Merged

david-crespo force-pushed the crespo/audit-log branch 4 times, most recently from 8dae6b3 to 9d70d86 Compare January 30, 2025 21:56

david-crespo added this to the 13 milestone Jan 31, 2025

david-crespo self-assigned this Jan 31, 2025

benjaminleonard mentioned this pull request Feb 10, 2025

Audit Log Designs oxidecomputer/console#2684

Open

morlandi7 modified the milestones: 13, 14 Feb 11, 2025

david-crespo force-pushed the crespo/audit-log branch 5 times, most recently from f9d36c0 to f95cb8a Compare March 6, 2025 19:48

This was referenced Mar 19, 2025

minor: refactor session create methods #7827

Merged

[nexus] webhooks #7277

Merged

PaginatedByTimeAndId #7842

Merged

david-crespo force-pushed the crespo/audit-log branch from f95cb8a to e1843f8 Compare March 20, 2025 15:32

david-crespo changed the base branch from main to pag-time-id March 20, 2025 15:33

david-crespo force-pushed the pag-time-id branch 2 times, most recently from 3df422c to 8481a67 Compare March 20, 2025 16:36

david-crespo force-pushed the crespo/audit-log branch from e1843f8 to 7d6b972 Compare March 20, 2025 16:36

david-crespo force-pushed the pag-time-id branch from 8481a67 to 849fe23 Compare March 20, 2025 16:39

david-crespo force-pushed the crespo/audit-log branch from 7d6b972 to 6153223 Compare March 20, 2025 16:42

Base automatically changed from pag-time-id to main March 20, 2025 18:03

askfongjojo modified the milestones: 14, 15 May 1, 2025

morlandi7 modified the milestones: 15, 16 May 20, 2025

david-crespo force-pushed the crespo/audit-log branch 2 times, most recently from ef64ac3 to b610bb2 Compare July 3, 2025 21:54

benjaminleonard mentioned this pull request Jul 8, 2025

Audit Log oxidecomputer/console#2849

Draft

5 tasks

david-crespo force-pushed the crespo/audit-log branch 3 times, most recently from 037ac9b to 9484b3b Compare July 10, 2025 22:46

david-crespo added 2 commits July 10, 2025 18:25

initial audit log endpoints, data model, tests

7909060

log user agent

formalize AuditLogEntryActor view struct for actor

0116fca

david-crespo force-pushed the crespo/audit-log branch from 9484b3b to 0116fca Compare July 10, 2025 23:25

benjaminleonard reviewed Jul 11, 2025

View reviewed changes

nexus/types/src/external_api/views.rs Show resolved Hide resolved

		// TODO: this isn't in the RFD but it seems nice to have
		pub request_uri: String,

Audit log prototype #7339

Are you sure you want to change the base?

Audit log prototype #7339

Uh oh!

Conversation

david-crespo commented Jan 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

inickles left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

david-crespo Jan 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benjaminleonard commented Jun 27, 2025

Uh oh!

david-crespo commented Jun 27, 2025

Uh oh!

Uh oh!

Uh oh!

david-crespo commented Jan 14, 2025 •

edited

Loading

david-crespo Jan 15, 2025 •

edited

Loading

david-crespo Jan 15, 2025 •

edited

Loading