Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-23.1: ccl/sqlproxyccl: add CIDR Ranges and Private Endpoints ACL support #103751

Conversation

jaylim-crl
Copy link
Collaborator

@jaylim-crl jaylim-crl commented May 22, 2023

Backport 14/14 commits from #101864, #103070, #103479, #103487, #103550, and #103625.

/cc @cockroachdb/release


See individual commits for more information. On a high-level overview, this backport introduces two new ACL mechanisms to the proxy: CIDR Ranges and Private Endpoints. For that to work, we reworked the tenant directory to watch tenant metadata changes, which were previously constant.

Release note: None

Epic: none

Release justification: SQL proxy ACL change that is needed for CC.

Fixes cockroachdb#76839 and cockroachdb#71365.

This commit rewrites the test to use the static directory server instead of
the dynamic one. The old one is susceptible to GRPC port reuse and deadlock
issues. With this change, only directory_cache_test.go is relying on the
dynamic directory server. Once we rewrite that, we can remove the entire test
directory implementation.

Release note: None
The Load field has been deprecated for months, and is no longer being used
in production. The tenant directory implementation is no longer emitting pod
loads. This commit addresses an old TODO.

Release note: None
This commit adds a new WatchTenants API to the directory server interface. The
goal is for this API to stream tenant metadata change notifications. At the
same time, this commit adds a new Tenant proto message to represent tenant
fields instead. This would allow us to add more fields to the tenant object
(e.g. IP allowlist and private endpoint entries).

Release note: None
This commits adds a new tenant metadata watcher mechanism to the directory
cache. This will be used to receive notifications about tenants being created,
modified, or deleted. Previously, we were only calling GetTenant once
throughout the lifetime of the tenant entry. In this new model, GetTenant is
only called if the tenant entry isn't valid.

Release note: None
This commit includes two new fields in the tenant proto: ConnectivityType and
PrivateEndpoints. The first will be used to indicate how the cluster can be
accessed (i.e. public only, private only, or both), whereas the second will
be used to indicate which endpoints can access the cluster, assuming that the
cluster allows private connectivity.

A subsequent commit will implement the necessary ACLs.

Release note: None
Previously, the CheckConnection API in the AccessController interface took in
a TimeSource object, and that seemed a little awkward. Given that the
timeSource field is only used by the denylist ACL, we will plumb this into
the denylis ACL. At the same time, we plumb context.Context into the
CheckConnection API. Once we add a new PrivateEndpoints ACL, that will have
to use a context.Context object for methods on the directory cache interface.

Release note: None
This commit adds a new PrivateEndpoints ACL support to the access controller.
This ACL will only apply if the cluster allows private connectivity. If the
connection does not have an EndpointID, it is assumed to be a public
connection, and the ACL is a no-op, which is the current behavior. A
subsequent commit will extract the private endpoints from the PROXY header,
and populate the EndpointID for the connection.

Release note: None
@blathers-crl
Copy link

blathers-crl bot commented May 22, 2023

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@jaylim-crl
Copy link
Collaborator Author

Note that this backport adds new functionalities to the proxy. That said, all these will be rolled out to CC through our automation.

@jaylim-crl jaylim-crl marked this pull request as ready for review May 22, 2023 22:42
@jaylim-crl jaylim-crl requested review from a team as code owners May 22, 2023 22:42
This commit parses the private endpoint IDs within the PROXY header. Doing
this would also mean that the ACL for private endpoints will be enforced under
the right conditions. The parsing logic assumes that there can only be one
type of endpoint ID within the PROXY header (i.e. AWS, GCP, or Azure).

Release note: None
There is a requirement of automatically disconnecting existing connections if
the ACL rules no longer match the connection's state. For the file-based ACLs,
we poll every 1 minute, read the file, and check on every single connection.

Here, we implement a similar mechanism for private endpoint updates. This is
reasonable for now as:
  1. Calls to LookupTenant are cached most of the time.
  2. This is only applied to existing connections, and a polling interval of
     1 minute isn't too bad.
  3. We are already iterating all the connections for the other types ACLs

That said, this approach isn't efficient and we're limited by the existing
design of AccessController. Checking all the connections each time the watch
ticks isn't ideal. In fact, we're checking three times here, one for each ACL.
The directory cache already knows which tenants were updated through
WatchTenants, and we can definitely do better here. A follow up TODO has been
added to refactor AccessController in a way that allows updates to be batched.
We should also only check connections only for tenants that have been updated.

Release note: None
Previously, we were only validating cluster names when LookupTenantPods get
called within the connector, which happens after the ACL check. The rationale
behind that is that we didn't want a malicious actor who's iterating through
all tenant IDs spinning up pods for them. The cluster name check ensures that
the incoming connection knows something about the tenant.

Now that we have introduced LookupTenant within the ACL logic, it is possible
for a malicious actor to iterate through all the tenant IDs, and figure out
which tenant IDs are in use (since it returns "connection refused" if the
tenant exists). To address that, we will start validating cluster names before
running the ACL check (i.e. at the start of the proxy handler) before
proceeding. This ensures that we will return a NotFound error if the tenant
doesn't exist, or there's a cluster name mismatch. At the same time, the
clusterName parameter has been removed from LookupTenantPods since that is
no longer needed.

Release note: None

Epic: none
…ities

Previously, we had introduced 3 states (ALLOW_NONE, ALLOW_PUBLIC, and
ALLOW_PRIVATE) with the intention that ALLOW_NONE would be useful. Thinking
about it further, it does not make sense for a cluster to support ALLOW_NONE
(i.e. what would that actually mean?). This commit updates the connectivity
types to support ALLOW_ALL, ALLOW_PUBLIC_ONLY, and ALLOW_PRIVATE_ONLY. By
default, ConnectivityType=0, which maps to ALLOW_ALL.

Release note: None

Epic: none
@jaylim-crl jaylim-crl force-pushed the backport23.1-101864-103070-103479-103487-103550-103625 branch from 1b28e58 to b5a711e Compare May 22, 2023 23:04
Previously, the proxy only supported file based IP allowlist entries. This
commit adds a new ACL for allowed CIDR ranges, and will only apply if the
allowlist file isn't supplied to the proxy. Unlike the file-based ACL, this
makes use of the per-tenant metadata to store allowed CIDR ranges (i.e. in
the new AllowedCIDRRanges field). By default, if there are no CIDR ranges, it
is expected to block all incoming connections.

Release note: None

Epic: none
This commit renames PrivateEndpoints to AllowedPrivateEndpoints in the tenant
proto message to be consistent with AllowedCIDRRanges. In theory, we could have
done AllowedEndpoints, but we'll leave "Private" in to be more explicit.

Release note: None

Epic: none
Fixes cockroachdb#103494.

This commit fixes a flake on TestWatchTenants. There's a possibility where
the cache invalidation logic races with the watcher termination logic in the
test. This commit updates the test to wait for the cache invalidation before
proceeding.

Release note: None

Epic: none
@jaylim-crl jaylim-crl force-pushed the backport23.1-101864-103070-103479-103487-103550-103625 branch from b5a711e to 768d496 Compare May 22, 2023 23:05
Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All changes are in the SQL Proxy, which is only used in Serverless clusters. This patch will have no effect on Dedicated or self-host clusters.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jeffswenson)

@jaylim-crl
Copy link
Collaborator Author

TFTRs!

@jaylim-crl jaylim-crl merged commit 65c2fd9 into cockroachdb:release-23.1 May 23, 2023
@jaylim-crl jaylim-crl deleted the backport23.1-101864-103070-103479-103487-103550-103625 branch May 23, 2023 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants