-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vtadmin] custom discovery resolver #9977
[vtadmin] custom discovery resolver #9977
Conversation
f5fc354
to
a415798
Compare
c8df110
to
1c1ac19
Compare
okay @notfelineit @doeg @setassociative I'm marking this ready for review!! I still need to write Actual Release Notes, but given that things may change pending PR feedback, I'm going to wait on that until things seem to have solidified around the design details (but I will do it in this branch before merging, I promise!) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great, both design and implementation. Thanks for the thorough write-up + commentary.
For posterity, it's worth mentioning we've had this running in our Slack environment for a couple of days, and it's been resilient to both vtctlds and vtgates going away. 💯
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new GRPC Discovery looks good! I'm not sure about the actual vtadmin
integration, that part of Vitess is foreign to me. :)
we're currently seeing issues with keepalives closing connections repeatedly. vtadmin is still functional because we just redial but it muddies logs and could cause intermittent failures for users. i've narrowed the problem down to the fact that our default |
okayyyyyy!!!!!!!!!!!! rebasing on top of #10024 makes these logs go away. i believe grpc/grpc-go#4579 is fixing the culprit of our problem. after @vmg's change is merged, i'll rebase and update the code to use the newer fields, namely:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For breadcrumbs, we tested this (well, this + #10024) again in Slack's dev environment this morning and it works even better than before. :) The logs are very quiet now.
I realize you don't need another +1 to merge post-rebase, so I'll just leave one now. 🏆 Thanks again. This is really nice.
there's one more change i'll need to make and re-verify, unfortunately (that Target.Url bit) i think, because the field i'm currently using (because it's the only one available) has been deprecated between grpc 1.37 and 1.45 |
i've updated the rebase branch with the fix if you want to test in your environment. i can't push it to this branch (yet) because it literally won't compile 😭 😆 |
1. Register `VtctldServer`, not `VtctlServer`. 2. Abuse `GetKeyspace` to allow the client to inspect the listen address of a given server to use in assertions. 3. Building on (2), stop asserting on proxy.host, which will be going away> Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Still need to actually add flags for these, but this will make it easier. Also make some things unexported. Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
…th empty list, and tests Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Signed-off-by: Andrew Mason <andrew@planetscale.com>
…ted) Signed-off-by: Andrew Mason <andrew@planetscale.com>
d3c6b96
to
3d7ab09
Compare
Signed-off-by: Andrew Mason <andrew@planetscale.com>
Description
This PR introduces a custom grpc resolver based on our
discovery.Discovery
interface for use in both vtctld and vtgate grpc communication in vtadmin.Motivation and Rationale
There are two* main drawbacks to the current discovery/dialing approach in both vtctldclient and vtsql for vtadmin:
This solves both of these problems by using the
resolver
API to use our discovery to (1) provide multiple hosts for grpc to multiplex the ClientConn over and (2) let grpc request a re-resolve (re-discovery, in our case) to update the address list if it's getting connection failures.Preparatory changes
Discover{Vtctld,VTGate}Addrs
with the semantics ofDiscover*Addr
(in that we run the addr template) but with N addresses returned instead.Overview
When you call
grpc.Dial(addr, opts...)
, grpc will parseaddr
as follows:So, for example, the addr
dns://some_authority/foo.bar
will be parsed into&Target{Scheme: "dns", Authority: "some_authority", Endpoint: "foo.bar"}
.If there is a resolver registered for the given scheme (there is both a global and local registry; more on this in a bit), then grpc will use that resolver to resolve the target into one or more addresses to connect to. Otherwise (or if the target has no scheme, in the case of "/some_authority/foo.bar" or just "//foo.bar"), the default scheme (dns) is used. The
Dial
call then returns a single ClientConn, which has a SubConn for each address returned by the resolver, which grpc can choose (or not, depending on the balancer configuration, connection errors, etc) to multiplex the RPCs over. After enough transient failures, or periodically (opaquely, "when grpc damn well feels like it") it will request the resolver toResolveNow
again, and the resolver should callUpdateState
with the new set of addresses that grpc should use for thatClientConn
.grpc.WithResolvers(...)
As mentioned above, when
Dial
-ing, grpc has two resolver registries to check -- the global one (viaresolver.Register
, and a ClientConn-local one. The latter takes precedence over the former, and is the one we're interested in. To add a resolver to the local registry, grpc provides a DialOption calledWithResolvers
, which works exactly the same asresolver.Register
, except that it's local to the ClientConn being dialed.This allows us to have resolvers for each cluster, with their own configuration options, without risking – or having to reason about – them stepping over each other.
Within VTAdmin
With all that prologue out of the way, let's talk about how this ties together within the context of vtadmin. Each cluster gets two resolvers, one for vtctldclient proxies, and one for vtgate proxies. Each of these uses the
cluster.Id
as the "scheme" (N.B. because we only ever dial withWithResolvers
, things are local to the cluster so we could probably get away with using the same scheme (e.g. "vtadmin://") across all clusters, but there's no harm (as far as I can tell) in doing things the way I currently have them either), and then either "vtctld" or "vtgate" as the "authority". In order to instruct grpc to use our resolvers, then, we do not dial an actual address, but instead either the "address" "{clusterId}://vtctld/" or "{clusterId}://vtgate/".So, when creating a
VtctldClientProxy
orVTGateProxy
from the cluster CLI options (via the Parse methods), we new up aresolver.Options
(our type), which takes a Discovery implementation, and then parses anyDiscoveryTags
,DiscoveryTimeout
, andBalancerPolicy
(more on this one below) from the cluster flags. We then use the options to instantiate a resolver.Builder, which will get passed intoWithResolvers
when either the vtctld or vtgate client makes its call to Dial.Within the resolver itself, whenever grpc asks us to ResolveNow, we use the
discoverFunc
(which is eitherDiscoverVtctldAddrs
orDiscoverVTGateAddrs
, depending on the "authority" (component)) to look up the appropriate set of addresses, which we send back to the ClientConn via UpdateState. We also log information about the resolution, and update the resolver to remember what address list it most recently sent to the ClientConn for debug purposes. Information about the resolver state is bubbled up from resolver instance => resolver builder => vtgate or vtctld proxy => cluster debug endpoint(s). For example, running locally the other day, I can show (I've changed some of the field names since then, but you get the idea):BalancerPolicy
The default balancer policy when none is specified is called
pick_first
. This is effectively grpc doingnet.Dial(state.Addresses[0])
, which still gives us the problem of "all the load goes to one host". To get around this, without making the resolver have to be aware of shuffling the host list, we expose new flag(s) (--{vtctld,vtgate}-grpc-balancer-policy
) to allow users to specify the empty string (default policy),pick_first
explicitly, orround_robin
.To simulate RR balancing, I added a non-existent vtctld to my static discovery file and enabled RR:
After removing the RR policy flag (sending us back to
pick_first
, meaning we would pick the legit vtctld), things went back to normal:Appendix
Some notable changes:
debug.Debuggable
for, so the address list and balancer policy show up in the /debug endpoint(s) for vtadmin and for a given cluster.Testing
In addition to the unit tests I updated and added, which all pass, this has been tested both against the local example and against development vtadmin deployments.
Related Issue(s)
Checklist
Deployment Notes