Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: allow users to run conformance reports for their schemas #100004

Open
arulajmani opened this issue Mar 30, 2023 · 4 comments
Open

server: allow users to run conformance reports for their schemas #100004

arulajmani opened this issue Mar 30, 2023 · 4 comments
Assignees
Labels
A-cluster-observability Related to cluster observability A-multitenancy Related to multi-tenancy C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@arulajmani
Copy link
Collaborator

arulajmani commented Mar 30, 2023

Is your feature request related to a problem? Please describe.

Background

Users can configure various properties, such as quorum size, number of non-voters, data placement etc. on schema objects. They can either do so directly, via zone configurations, or indirectly using multi-region abstractions (which are internally translated to zone configurations). However, the effects of such changes are asynchronous.

The zone configurations table lives in a tenant's keyspace. At a high level, once a zone configuration is committed, it must be converted to a SpanConfig and reconciled to KV (where it lives in the system.span_configurations table). Doing so entails hydrating the zone configuration (by walking up its inheritance chain) to convert it to a SpanConfig . This SpanConfig is then linked with the keyspan associated with the schema object, and persisted in KV using an RPC.

All KV nodes maintain an in-memory, incremental view over system.span_configurations. Once a KV node receives a SpanConfig update, ranges that overlap with the update's keyspans are pushed through various queues (e.g. Split, Merge, Replicate). It's these queues that are responsible for taking action to fulfill what user intention.

SpanConfigBounds

Until very recently, the async application of user specified configurations was only a matter of time. This changed with the introduction of SpanConfigBounds. SpanConfigBounds were motivated by a desire to disallow secondary tenants unfettered access to multi-region features (or zone configurations) in deployments where operators desire such control (read: serverless).

SpanConfigBounds allow operators the ability to declare bounds on (almost) all SpanConfig fields at a per-tenant level. These only work for secondary tenants. Operators can use SpanConfigBounds to override tenant reconciled span configurations by "clamping" any or all fields. For example, operators are able to do things like constrain a tenant to specific region(s) regardless of what the tenant requested. They can also do so retro-actively, after a tenant has successfully committed and reconciled such configurations.

Describe the solution you'd like

Arguably, users care more about when their data is in conformance, as opposed to to a promise that it eventually will be. With the introduction of SpanConfigBounds, tenants no longer have the latter either. This elevates the need to make point-in-time conformance easily observable.

Conveniently, we have a lot of pieces to provide such conformance reports already built. We just need to possibly enhance it, stitch things together, and provide a mechanism to consume such information. This issue asks to do exactly that.

Specifically, users should be able to run conformance reports that gives them information about which table(s)/index(es) are in violation of their zone configurations. There should also be 2 variations -- one that takes SpanConfigBounds in account and another that doesn't. This will allow users to discriminate between cases where all they need to do is "just wait" and cases that will never be satisfied.

High level sketch

We already have a conformance reporter. However, it doesn't give the caller a point-in-time snapshot -- this may need to change if we want stronger guarantees when stitching the report back with SQL state.

Note: this Reporter is not to be confused by the other Reporter in kvserver/reports/reporter.go. This latter construct is older, deprecated and we don't see too much future into it any more. It should be removed. (#100180)

The Reporter also doesn't know about SpanConfigBounds yet. We should extend it to return a list of SpanConfigs that fail bound checks in its response. This might simply be about giving the reporter a handle to the BoundsReader to get it access to a tenant's Bounds and calling Check() on it.

The tenant (SQL) is the only thing that has access to both:

  1. What timestamp its reconciled up till.
  2. How keyspans map back to schema objects (a reverse translation of sorts)

As such, the tenant would be responsible for taking the contents of a SpanConfigConformanceReport (which only associates raw keys to a conformance status) and mapping it back to which tables/indexes are in violation (if any).

I'm not sure what the best way to consume such information is -- maybe a new endpoint users can query? Or, better yet, we can build some sort of DB console page using? Alternatively, we could run such a thing periodically and maybe increment some metrics.

cc @ajwerner @knz

Jira issue: CRDB-26176

Epic CRDB-26686

As part of addressing this, we should make sure to delete FIXMEIDONTKNOWWHICHCODECTOUSE usage (introduced as part of #48123).

@arulajmani arulajmani added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-multitenancy Related to multi-tenancy A-cluster-observability Related to cluster observability labels Mar 30, 2023
@arulajmani
Copy link
Collaborator Author

Thanks @ajwerner and @knz for brainstorming some of this stuff with me earlier this week. Feel free to add more thoughts I may have missed or not represented here.

@knz
Copy link
Contributor

knz commented Mar 30, 2023

@zachlite for your interest. I believe that Andrew, Irfan and I would be delighted to brainstorm with you on this.

@arulajmani
Copy link
Collaborator Author

@knz I'm pulling off one of your edits into its own comment here:

There's also a bug by which the Reporter is unable to reason about logical keyspaces for secondary tenants; it's simply unaware of their schema and produces incoherent results for those ranges (this is tracked in #48123).

This wasn't part of what I had in mind when writing up the issue -- I don't think we need to push tenant schema information into the Reporter, which runs in KV, right?

@knz
Copy link
Contributor

knz commented Mar 30, 2023

let me correct my edit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cluster-observability Related to cluster observability A-multitenancy Related to multi-tenancy C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

3 participants