-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/sqlproxyccl: add private endpoints ACL support #103070
ccl/sqlproxyccl: add private endpoints ACL support #103070
Conversation
d73cdac
to
561fc19
Compare
561fc19
to
ea2652f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @jaylim-crl, @jeffswenson, and @pjtatlow)
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 33 at r3 (raw file):
// ID. If the tenant cannot be found, this will return a GRPC NotFound // error. LookupTenant(ctx context.Context, tenantID roachpb.TenantID) (*Tenant, error)
I thought the plan was to keep Tenant and Pod separate in tenant.Directory
, but to combine them in the cache? What if we had a Pods
field in Tenant
(or maybe a separate []*Pod
return value), so that callers don't need to call twice? Or do we expect callers to always care only about one or the other?
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 70 at r3 (raw file):
// SkipTenantWatcher indicates that the directory cache should not run the // tenant watcher. func SkipTenantWatcher() func(opts *dirOptions) {
What caller will use this option?
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 499 at r3 (raw file):
// tenant's entry to reflect that change. func (d *directoryCache) updateTenantPodEntry(ctx context.Context, pod *Pod) { if pod == nil || pod.Addr == "" || pod.TenantID == 0 {
When can this be nil? Are you sure it doesn't indicate a bug in the caller if it's ever nil?
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 574 at r3 (raw file):
// them as stale allows LookupTenant to fetch a new right away // if needed. d.markAllEntriesInvalid()
Shouldn't we also do this in watchPods
? Or is the idea that we'll just rely on reportFailure
to invalidate pod metadata?
pkg/ccl/sqlproxyccl/tenant/entry.go
line 43 at r3 (raw file):
syncutil.Mutex // pods represents the tenant's SQL pods.
NIT: can you add a similar comment here that the pods slice is copy-on-write?
pkg/ccl/sqlproxyccl/tenant/entry.go
line 251 at r3 (raw file):
// as nil). When that happens, the tenant watcher may race to update a // tenant, so we'd have to perform a nil check here. if (e.mu.valid && e.mu.tenant != nil && tenant.Version >= e.mu.tenant.Version) ||
NIT: Just a suggestion, feel free to disregard: the "if" logic here is a bit tricky, and also forces you to indent the "meat" of this method. I'd typically choose this instead:
// I try not to combine && and || in the same logic statement, b/c it's trickier to understand.
// It also makes comments harder to write, b/c it's often ambiguous which part(s) of the "if" they apply to.
// So I break apart the && into multiple lines like this:
if e.mu.valid {
if e.mu.tenant == nil || tenant.Version < e.mu.tenant.Version {
return
}
}
// And now for the main logic of method, which does not need to be nested b/c we've checked the short-circuit conditions above.
pkg/ccl/sqlproxyccl/tenantdirsvr/test_static_directory_svr.go
line 74 at r2 (raw file):
// eventListeners stores a list of listeners that are watching changes // to pods through WatchPods. eventListeners *list.List
NIT: I'd rename to podEventListeners
now that we two different lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, and @pjtatlow)
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 33 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
I thought the plan was to keep Tenant and Pod separate in
tenant.Directory
, but to combine them in the cache? What if we had aPods
field inTenant
(or maybe a separate[]*Pod
return value), so that callers don't need to call twice? Or do we expect callers to always care only about one or the other?
If we wanted to combine them together, I think tenant.Directory
should have combined APIs as well. We might need to look into what the interface would look like. At the moment, callers only care about one or the other. The ACL stuff only needs LookupTenant
, and the connector needs LookupTenantPods
. It is possible to refactor such that callers only use LookupTenant
and ReportFailure
if you feel strongly otherwise, though I would put it in a follow-up PR. I see two issues:
LookupTenantPods
blocks until there is at least one RUNNING pod, which means that the new API will need to support both sync and async.- If the tenant gets invalidated during the cold start path, we may need to perform another Get, even if the connector doesn't need the data.
Due to the above, I'm leaning towards keeping the APIs as-is.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 70 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
What caller will use this option?
I removed it as it's no longer being used. I had it originally for a test.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 499 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
When can this be nil? Are you sure it doesn't indicate a bug in the caller if it's ever nil?
I was somewhat defensive here to avoid panics. You're right that it's a bug in the caller (i.e. the tenant directory) if it's ever nil. I've removed the nil
checks for both the tenant and pod watchers. That way, if there's a bug, the proxy will panic, and we should know about it.
pkg/ccl/sqlproxyccl/tenant/directory_cache.go
line 574 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
Shouldn't we also do this in
watchPods
? Or is the idea that we'll just rely onreportFailure
to invalidate pod metadata?
The idea is correct: since we're relying on the reportFailure
API, there's no need to explicitly invalidate all pods. To add on, it's likely that some of the pods metadata are valid, and we can continue routing to them without fetching a whole new list of pods. For example, if a tenant has 3 pods, and 1 gets deleted, there's still a 2/3 chance for a cache hit. The pods metadata should only be invalidated if necessary.
pkg/ccl/sqlproxyccl/tenant/entry.go
line 43 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
NIT: can you add a similar comment here that the pods slice is copy-on-write?
Done.
pkg/ccl/sqlproxyccl/tenant/entry.go
line 251 at r3 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
NIT: Just a suggestion, feel free to disregard: the "if" logic here is a bit tricky, and also forces you to indent the "meat" of this method. I'd typically choose this instead:
// I try not to combine && and || in the same logic statement, b/c it's trickier to understand. // It also makes comments harder to write, b/c it's often ambiguous which part(s) of the "if" they apply to. // So I break apart the && into multiple lines like this: if e.mu.valid { if e.mu.tenant == nil || tenant.Version < e.mu.tenant.Version { return } } // And now for the main logic of method, which does not need to be nested b/c we've checked the short-circuit conditions above.
Good point. Done.
pkg/ccl/sqlproxyccl/tenantdirsvr/test_static_directory_svr.go
line 74 at r2 (raw file):
Previously, andy-kimball (Andy Kimball) wrote…
NIT: I'd rename to
podEventListeners
now that we two different lists.
Done.
ea2652f
to
ed3e1c2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jaylim-crl, and @jeffswenson)
pkg/ccl/sqlproxyccl/proxy_handler.go
line 326 at r15 (raw file):
if handler.RequireProxyProtocol { var err error endpointID, err = acl.FindPrivateEndpointID(incomingConn)
nit: it feels a little strange to have the FindPrivateEndpointID
function in the acl
package. I know the ACL is the only one who really cares about the private endpoint stuff, but it's been abstracted away from directly dealing with networking code until now.
pkg/ccl/sqlproxyccl/acl/watcher.go
line 319 at r15 (raw file):
// pollAndUpdateChan sends the same access controller object into the returned // channel every pollingInterval.
We should clarify exactly why we're doing this.
pkg/ccl/sqlproxyccl/acl/watcher.go
line 328 at r15 (raw file):
result := make(chan AccessController) go func() { // TODO(ye): use notification via SIGHUP instead.
These comments don't apply to this code since it's not watching files.
pkg/ccl/sqlproxyccl/tenant/entry.go
line 247 at r15 (raw file):
// tenant, so we'd have to perform a nil check here. if e.mu.valid { if e.mu.tenant == nil || tenant.Version < e.mu.tenant.Version {
I am probably misunderstanding this, but if the tenantEntry had a nil tenant, why would we return here rather than using the incoming *Tenant
?
ed3e1c2
to
5913539
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball, @jeffswenson, and @pjtatlow)
pkg/ccl/sqlproxyccl/proxy_handler.go
line 326 at r15 (raw file):
Previously, pjtatlow (PJ Tatlow) wrote…
nit: it feels a little strange to have the
FindPrivateEndpointID
function in theacl
package. I know the ACL is the only one who really cares about the private endpoint stuff, but it's been abstracted away from directly dealing with networking code until now.
I originally thought the same way as well, but I don't see a better place to put that. Putting it into proxy_handler.go
would mean that it needs to know how to parse endpoint IDs (which isn't the concern of the proxy handler), which is why I put it in private_endpoints.go
(though I'd imagine acl/utils.go
could work too). How strongly do you feel about this, or do you have any suggestions?
pkg/ccl/sqlproxyccl/acl/watcher.go
line 319 at r15 (raw file):
Previously, pjtatlow (PJ Tatlow) wrote…
We should clarify exactly why we're doing this.
Done.
pkg/ccl/sqlproxyccl/acl/watcher.go
line 328 at r15 (raw file):
Previously, pjtatlow (PJ Tatlow) wrote…
These comments don't apply to this code since it's not watching files.
Done.
pkg/ccl/sqlproxyccl/tenant/entry.go
line 247 at r15 (raw file):
Previously, pjtatlow (PJ Tatlow) wrote…
I am probably misunderstanding this, but if the tenantEntry had a nil tenant, why would we return here rather than using the incoming
*Tenant
?
Good point. Done. If e.mu.valid=true
, and e.mu.tenant=nil
, it must be the case that the entry was populated when the tenant wasn't created, or the API server is down. In that case, we would want to replace the tenant with a newer version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Reviewed 2 of 5 files at r2, 1 of 8 files at r10, 1 of 5 files at r17, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andy-kimball and @jeffswenson)
pkg/ccl/sqlproxyccl/proxy_handler.go
line 326 at r15 (raw file):
Previously, jaylim-crl (Jay Lim) wrote…
I originally thought the same way as well, but I don't see a better place to put that. Putting it into
proxy_handler.go
would mean that it needs to know how to parse endpoint IDs (which isn't the concern of the proxy handler), which is why I put it inprivate_endpoints.go
(though I'd imagineacl/utils.go
could work too). How strongly do you feel about this, or do you have any suggestions?
I don't have a strong opinion, this is probably a good place for it. Just seemed weird at first haha.
TFTR! I'll follow up with a change to include bors r=pjtatlow |
Build failed (retrying...): |
Build failed (retrying...): |
Hm, looks like bors r- |
Canceled. |
The Load field has been deprecated for months, and is no longer being used in production. The tenant directory implementation is no longer emitting pod loads. This commit addresses an old TODO. Release note: None
This commit adds a new WatchTenants API to the directory server interface. The goal is for this API to stream tenant metadata change notifications. At the same time, this commit adds a new Tenant proto message to represent tenant fields instead. This would allow us to add more fields to the tenant object (e.g. IP allowlist and private endpoint entries). Release note: None
5913539
to
cdebd49
Compare
This commits adds a new tenant metadata watcher mechanism to the directory cache. This will be used to receive notifications about tenants being created, modified, or deleted. Previously, we were only calling GetTenant once throughout the lifetime of the tenant entry. In this new model, GetTenant is only called if the tenant entry isn't valid. Release note: None
This commit includes two new fields in the tenant proto: ConnectivityType and PrivateEndpoints. The first will be used to indicate how the cluster can be accessed (i.e. public only, private only, or both), whereas the second will be used to indicate which endpoints can access the cluster, assuming that the cluster allows private connectivity. A subsequent commit will implement the necessary ACLs. Release note: None
Previously, the CheckConnection API in the AccessController interface took in a TimeSource object, and that seemed a little awkward. Given that the timeSource field is only used by the denylist ACL, we will plumb this into the denylis ACL. At the same time, we plumb context.Context into the CheckConnection API. Once we add a new PrivateEndpoints ACL, that will have to use a context.Context object for methods on the directory cache interface. Release note: None
This commit adds a new PrivateEndpoints ACL support to the access controller. This ACL will only apply if the cluster allows private connectivity. If the connection does not have an EndpointID, it is assumed to be a public connection, and the ACL is a no-op, which is the current behavior. A subsequent commit will extract the private endpoints from the PROXY header, and populate the EndpointID for the connection. Release note: None
This commit parses the private endpoint IDs within the PROXY header. Doing this would also mean that the ACL for private endpoints will be enforced under the right conditions. The parsing logic assumes that there can only be one type of endpoint ID within the PROXY header (i.e. AWS, GCP, or Azure). Release note: None
There is a requirement of automatically disconnecting existing connections if the ACL rules no longer match the connection's state. For the file-based ACLs, we poll every 1 minute, read the file, and check on every single connection. Here, we implement a similar mechanism for private endpoint updates. This is reasonable for now as: 1. Calls to LookupTenant are cached most of the time. 2. This is only applied to existing connections, and a polling interval of 1 minute isn't too bad. 3. We are already iterating all the connections for the other types ACLs That said, this approach isn't efficient and we're limited by the existing design of AccessController. Checking all the connections each time the watch ticks isn't ideal. In fact, we're checking three times here, one for each ACL. The directory cache already knows which tenants were updated through WatchTenants, and we can definitely do better here. A follow up TODO has been added to refactor AccessController in a way that allows updates to be batched. We should also only check connections only for tenants that have been updated. Release note: None
cdebd49
to
0655127
Compare
bors r=pjtatlow |
Build succeeded: |
ccl/sqlproxyccl: remove the Load field from the Pod proto message
The Load field has been deprecated for months, and is no longer being used
in production. The tenant directory implementation is no longer emitting pod
loads. This commit addresses an old TODO.
ccl/sqlproxyccl: add WatchTenants API to the directory server
This commit adds a new WatchTenants API to the directory server interface. The
goal is for this API to stream tenant metadata change notifications. At the
same time, this commit adds a new Tenant proto message to represent tenant
fields instead. This would allow us to add more fields to the tenant object
(e.g. IP allowlist and private endpoint entries).
ccl/sqlproxyccl: add tenant metadata watcher to the directory cache
This commits adds a new tenant metadata watcher mechanism to the directory
cache. This will be used to receive notifications about tenants being created,
modified, or deleted. Previously, we were only calling GetTenant once
throughout the lifetime of the tenant entry. In this new model, GetTenant is
only called if the tenant entry isn't valid.
ccl/sqlproxyccl: include PrivateEndpoints in the tenant proto message
This commit includes two new fields in the tenant proto: ConnectivityType and
PrivateEndpoints. The first will be used to indicate how the cluster can be
accessed (i.e. public only, private only, or both), whereas the second will
be used to indicate which endpoints can access the cluster, assuming that the
cluster allows private connectivity.
A subsequent commit will implement the necessary ACLs.
ccl/sqlproxyccl: plumb context.Context into the CheckConnection API
Previously, the CheckConnection API in the AccessController interface took in
a TimeSource object, and that seemed a little awkward. Given that the
timeSource field is only used by the denylist ACL, we will plumb this into
the denylis ACL. At the same time, we plumb context.Context into the
CheckConnection API. Once we add a new PrivateEndpoints ACL, that will have
to use a context.Context object for methods on the directory cache interface.
ccl/sqlproxyccl: add private endpoints ACL support
This commit adds a new PrivateEndpoints ACL support to the access controller.
This ACL will only apply if the cluster allows private connectivity. If the
connection does not have an EndpointID, it is assumed to be a public
connection, and the ACL is a no-op, which is the current behavior. A
subsequent commit will extract the private endpoints from the PROXY header,
and populate the EndpointID for the connection.
ccl/sqlproxyccl: parse private endpoint IDs in the PROXY header
This commit parses the private endpoint IDs within the PROXY header. Doing
this would also mean that the ACL for private endpoints will be enforced under
the right conditions. The parsing logic assumes that there can only be one
type of endpoint ID within the PROXY header (i.e. AWS, GCP, or Azure).
ccl/sqlproxyccl: add an interval poller for private endpoint updates
There is a requirement of automatically disconnecting existing connections if
the ACL rules no longer match the connection's state. For the file-based ACLs,
we poll every 1 minute, read the file, and check on every single connection.
Here, we implement a similar mechanism for private endpoint updates. This is
reasonable for now as:
1 minute isn't too bad.
That said, this approach isn't efficient and we're limited by the existing
design of AccessController. Checking all the connections each time the watch
ticks isn't ideal. In fact, we're checking three times here, one for each ACL.
The directory cache already knows which tenants were updated through
WatchTenants, and we can definitely do better here. A follow up TODO has been
added to refactor AccessController in a way that allows updates to be batched.
We should also only check connections only for tenants that have been updated.
Release note: None
Epic: none