Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Audit log prototype #7339

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

Audit log prototype #7339

wants to merge 5 commits into from

Conversation

david-crespo
Copy link
Contributor

@david-crespo david-crespo commented Jan 14, 2025

Initial implementation of RFD 523. Lots of open questions still. I will fill this description in as I go. I tried to concentrate the questions about what to store in models/audit-log.

The biggest high-level design decision here is that this log will go into CockroachDB and not, for example, Clickhouse because we need an immediate guarantee at write time that the audit log write happened before we proceed with the API operation. We will need to think about how fast the log will grow and how long we want to retain it (discussed in the RFD, but might need more detail) and decide if we need to do something like periodically dump the log out of CockroachDB into a format that takes up less space.

The retrieval API is not finalized. So far I have a required start time (inclusive) and an optional end time (exclusive) in addition to the usual pagination params of page size and cursor token. I think I have a good start on how to set up the indexes so we can get the log lines in timestamp order with good performance and no table scans (see Cockroach docs on hash-sharded indexes).

Tasks

  • Basic log entry init function manually called in one endpoint
  • Retrieve log with a start time (and optional end time)
  • Sort response by (timestamp, id) to ensure stable sort for identical timestamps
  • Log auth type (session vs. API token)
  • Change retrieval logic to filter for completed entries and sort by time_completed
  • Log HTTP status code from responses
  • Log responses
  • More structure on view struct (not flat with a bunch of Option like the DB model)
  • Figure out auto-completing old entries in some kind of job
  • Figure out how to call on every endpoint (instrument_dropshot_handler is a model, or maybe we can use it directly)
    • Exclude all GETs to start with? Those responses would be 100x the size of everything else.

Comment on lines +44 to +66
// TODO: RFD 523 says: "Additionally, the response (or error) data should be
// included in the same log entry as the original request data. Separating
// the response from the request into two different log entries is extremely
// expensive for customers to identify which requests correspond to which
// responses." I guess the typical thing is to include a duration of the
// request rather than a second timestamp.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was me making a case for including something like the pub success_response: Option<Value> you have in a log entry. I can clarify that in the RFD.

@david-crespo david-crespo force-pushed the crespo/audit-log branch 5 times, most recently from 9258b89 to 1c4e5bf Compare January 15, 2025 15:57
let project =
nexus.project_create(&opctx, &new_project.into_inner()).await?;

let _ = nexus.audit_log_entry_complete(&opctx).await?;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started with project create because it's easy to work with in tests, but I know it's not in the short list of things we want to start with. We might end up simply logging every endpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we'll want (at least eventually) to include all (at least all authenticated) API methods. I think if we want to just have a subset of the methods available then we should prioritize those that make changes (vs GET operations), but with the intention of getting coverage of the API.

Related note, while not a requirement for this initial version, I spoke to @sunshowers about strategies for how we might be able to enforce new methods must implement the audit log. It's a place I think we'd like to get to.

Copy link
Contributor Author

@david-crespo david-crespo Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's related to dropshot lacking middleware — notice we manually call this instrument_dropshot_handler thing in every endpoint. I wonder if we could build that in elsewhere, make it automatic, and add the audit log call to it.

Copy link

@inickles inickles left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts on the fields in AuditLogEntry.

Comment on lines 19 to 20
// TODO: this isn't in the RFD but it seems nice to have
pub request_uri: String,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this looks like it might be the closest thing that we'd have to something like a rack and/or fleet ID, which is something I think we'd want - something for customer to be able to filter which audit logs came from which rack / fleet.

This may suffice for now, but maybe just until we get multi-rack implemented?

Comment on lines 55 to 57
// Fields that are optional because they get filled in after the action completes
/// Time in milliseconds between receiving request and responding
pub duration: Option<TimeDelta>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While fine to include, I don't think this is required, in case that makes it easier. I'm not following the earlier note about this relates to including the response in the audit log entry.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just meant the response and the duration are both things we only know at the end of the operation.

Comment on lines +62 to +82
// TODO: including a real response complicates things
// Response data on success (if applicable)
// pub success_response: Option<Value>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this indeed complicates things, it is critical IMO. For example, if someone were to create a new instance this audit log should say what that new instance ID is as a result.

Comment on lines +59 to +79
// Error information if the action failed
pub error_code: Option<String>,
pub error_message: Option<String>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered using a enum for response elements instead of optionals for errors and successful responses?

Something like:

#[derive(Serialize, Deserialize)]
#[serde(untagged)]
enum Response {
    Success { success: Value },
    Error { error_code: String, error_message: String }
}

Not sure I have a preference over one way over the other in terms of output structure, though I kind of like the enum from a code perspective to help ensure you either get one or the other.

Copy link
Contributor Author

@david-crespo david-crespo Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes — most likely what I'll do is have views::AuditLogEntry look nice and structured but keep the model struct flat like the DB entry, and then the conversion from model to view is what mediates between those two formats.


#[derive(Queryable, Insertable, Selectable, Clone, Debug)]
#[diesel(table_name = audit_log)]
pub struct AuditLogEntry {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking it might make more sense to put operation-specific things like resource_type, resource_id and maybe action into a something like a request_elements: Value, where the operation can decide what makes to include.


#[derive(Queryable, Insertable, Selectable, Clone, Debug)]
#[diesel(table_name = audit_log)]
pub struct AuditLogEntry {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like for us to include a version format, where we stick to major/minor semver, and include a event_version in this struct. I'm not sure how we'd want to manage that, and for all I know it might be a little more difficult for fields with Value type (request and response bits), but I think it's important for us to not silently break user parsers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could use the release version, but I see you mean the abstract shape of the log entry, and we'd want the version to stay the same across releases when applicable to indicate that log parsing logic does not have to change. So we should probably include both a log format version and the release version. Semver might be overkill — maybe we can get away with integers and not worry about distinguishing between breaking, semi-breaking, and non-breaking changes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking we could use the release version, but I see you mean the abstract shape of the log entry, and we'd want the version to stay the same across releases when applicable to indicate that log parsing logic does not have to change. So we should probably include both a log format version and the release version. Semver might be overkill — maybe we can get away with integers and not worry about distinguishing between breaking, semi-breaking, and non-breaking changes.

The patch number of SemVer might be overkill, but I think following similar rules for Major and Minor versions to differentiate between changes that'd break parsers vs those that shouldn't (e.g. new fields added) could still fit into SemVer rules and be a natural means indicating when parser logic has to change.

david-crespo added a commit that referenced this pull request Jan 22, 2025
Pulling these refactors out of #7339 because they're mechanical and just
add noise. The point is to make it a cleaner diff when we add the
function calls or wrapper code that creates audit log entries, as well
as to clean up the `device_auth` (eliminated) and `console_api`
(shrunken substantially) files, which have always been a little out of
place.

### Refactors

With the change to a trait-based Dropshot API, the already weird
`console_api` and `device_auth` modules became even weirder, because the
actual endpoint definitions were moved out of those files and into
`http_entrypoints.rs`, but they still called functions that lived in the
other files. These functions were redundant and had signatures more or
less identical to the endpoint handlers. That's the main reason we lose
90 lines here.

Before we had

```
http_entrypoints.rs -> console_api/device_auth -> nexus/src/app functions
```

Now we (mostly) cut out the middleman:

```
http_entrypoints.rs -> nexus/src/app functions
```

Some of what was in the middle moved up into the endpoint handlers, some
moved "down" into the nexus "service layer" functions.

### One (1) functional change

The one functional change is that the console endpoints are all
instrumented now.
@david-crespo david-crespo force-pushed the crespo/audit-log branch 2 times, most recently from 287045d to e1951ba Compare January 28, 2025 19:01
@david-crespo david-crespo added this to the 13 milestone Jan 31, 2025
@david-crespo david-crespo self-assigned this Jan 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants