-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vtadmin] Update vtctld dialer to validate connectivity #9915
[vtadmin] Update vtctld dialer to validate connectivity #9915
Conversation
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
…estRedial Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
// Even if the client connection does not shut down cleanly, we don't want to block | ||
// Dial from discovering a new vtctld. This makes VTAdmin's dialer more resilient, | ||
// but, as a caveat, it _can_ potentially leak improperly-closed gRPC connections. | ||
log.Errorf("error closing possibly-stale connection before re-dialing: %w", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more context about the leaked connections thing in the PR description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking good, a couple things to address (you'll also need to add labels to the PR)
go/vt/vtadmin/vtctldclient/proxy.go
Outdated
} | ||
} | ||
|
||
log.Infof("Discovering vtctld to dial...\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to keep these, or were they just to help test/debug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also added a bunch more logging in the Dial function... I've found the added verbosity to be useful, but let me know if I took it too far (lol).
Up to you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed most of the additional log statements.
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
It looks like there was an actual regression in a couple of the unit tests. Will fix. |
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Two things to fix the regressions:
All tests are passing now. @ajm188 no rush, but would you mind taking another quick look at 0ae0c1a...afe15d9 before this is merged? I'd like to confirm there aren't any weird gotchas with using a As an aside + mostly a note to self, I noticed a bunch of gRPC reconnect logs in the failed test output that I thought may be related to (or worsened by) this change, but it happens on the main branch too:
I think this is another example of the thing I mentioned in the PR description given gRPCs internal retry mechanism + our current dial options... I can take a stab at fixing that over the next few weeks in a separate branch by making that dial option configurable. (Open to other suggestions, of course.) |
go/vt/vtadmin/vtctldclient/config.go
Outdated
} | ||
|
||
var defaultConnectivityTimeout = 2 * time.Second |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For your question around potential gotchas, this should be fine. You can also make this a const
since I can't think of a reason we would ever modify the default at runtime.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can change that. What's another 4 hours of CI runs between friends?
return listener, server, err | ||
} | ||
|
||
func TestDial(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what you're seeing running just this test in isolation is super interesting, do you mind filing an issue and i can dig into it? i'm not really sure what's going, but it could be a "macs suck at local networking" issue, or something not completely correct in the code, but it's hard to say (and i don't think we should block this PR, which I still maintain is strictly an improvement over the current, uh, Situation)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done: #9943
Thanks for taking a look. I'm super interested in what you find! I spent way too long looking at gRPC internals for this PR 😭
Signed-off-by: Sara Bee <855595+doeg@users.noreply.github.com>
Description
This fixes #9422: prior to this fix, vtadmin-api will hang on to a cached gRPC connection a vtctld even after the gRPC channel is shut down, and any subsequent vtadmin-api request that queries a vtctld (e.g.,
/api/schemas
) will fail.This change updates VTAdmin's vtctld proxy to be "self healing" when its gRPC connection is lost. The
Dial
function, which is called prior to any vtctld request, will now wait until its cached connection isREADY
. If this check fails (for example: the connection is in aSHUTDOWN
state as mentioned above, or we exceedgrpc-connectivity-timeout
's worth of waiting), then the proxy will discover a different vtctld in that cluster and attempt to establish (and cache) a new gRPC connection.I also added a bunch more logging in the
Dial
function... I've found the added verbosity to be useful, but let me know if I took it too far (lol).FWIW we've had this branch running across Slack's Vitess deployments for the past few months and its worked well. And thank you @ajm188 for writing most of this with me back in... uh.... last July. :')
Reproduction steps
Some of this is noted in #9422, but I'll note it here for posterity anyway. (It's an interesting example of doing mildly nontrivial stuff with VTAdmin + the local example. 🤷)
Parameterize the vtctld-up.sh script and VTAdmin's discovery.json file to make it a lil easier to run a second vtctld
View diff
Start up a local cluster as usual, which will start up a single vtctld on http://localhost:15999:
./101_initial_cluster.sh
Start a second vtctld on http://localhost:16999:
VTCTLD_GRPC_PORT=16999 VTCTLD_WEB_PORT=16000 ./scripts/vtctld-up.sh
Start up VTAdmin. (The usual way is
./scripts/vtadmin-up.sh
, which will also start vtadmin-web.)At this point, we can double check that VTAdmin can "discover" both vtctlds. (Scare quotes since "discovery", in this case, is simply reading from that discovery.json file.)
Now, since VTAdmin lazy-initializes its vtctld connections, we need to trigger a request that traverses the "discover -> dial -> cache" codepath:
Examine VTAdmin's
proxy.go
logs to see which of the two local vtctlds it discovered + dialed; in this case, it's the vtcltd on http://localhost:16999.Now, we are going to kill this vtctld. 😈
For the sake of illustration, let's take brief diversion into 🐛 bug territory 🐛 and see what happens on the
main
branch after we kill the vtctld (without the WaitForReady fix, but keeping all the logging): the curl command will eventually time out, since the request never completes.`curl "http://localhost:14200/api/schemas"`
And now for the fix! We can observe that VTAdmin is able to detect the
SHUTDOWN
connection and negotiate a new one:The above logs are especially interesting because they point out a
shortcoming ofa later enhancement for this change. On VTAdmin's first attempt to discover a vtctld, it rediscovers the one we just killed on http://localhost:16999. This is because we're using static file discovery, and so the vtctld is never removed from that discovery.json file after wekill
it, so VTAdmin has a 50% chance of rediscovering it again. And, as mentioned below,Dial
does not retry in this case (although one could imagine it doing so with an expontential backoff or similar), so the request fails:The rest of the logs result from a subsequent
curl "http://localhost:14200/api/schemas"
which (you guessed it!) redials, rediscovers, and re-establishes a gRPC connection to the remaining, healthy vtcltd on http://localhost:15999.A note on rejected alternatives
Using the connectivity API to introspect our gRPC connections is a little cumbersome and possibly error prone. (I have been known to write bugs and... gestures at next section on leaked connections.)
Ideally the go-grpc library would handle this for us and theoretically it can, however my understanding is that we'd use the Resolver interface as a service discovery integration point and then run something like a lookaside load balancer. This has its advantages (round robin discovery, "officially supported")... but it would also be a Whole Thing to rewrite our service discovery layer.
Another approach that was shared with me is initializing healthchecks on the connection you get back from Dial: https://github.com/grpc/grpc/blob/master/doc/health-checking.md. I haven't investigated this one (to be candid, since this branch works) but I'll note it here for posterity!
A note on leaked connections
This PR updates proxy.go to functionally ignore errors from closing the gRPC connection. There are definitely some... undesirable interactions between this retry logic and gRPC's internal retry logic.
When the gRPC connection is lost, _even if we call
Close
, gRPC's internal mechanisms will continue to retry for ~4 seconds. During this time, as far as I can tell, the connectivity API will show the connection flapping betweenCONNECTING
andTRANSIENT_FAILURE
.During this period, any VTAdmin request that traverses the
Dial
codepath will first fail to transition, and then all subsequent calls will fail sinceDial
(prior to this branch) will return early on that "error closing possibly stale connection" error, until gRPC's internal retry times out:We can also see evidence of these retries in the logs as soon as the vtctld is killed:
I did a bunch of digging into this a few months ago (which is part of the reason this branch has taken me forever 😭) and realize that we can likely disable gRPC's internal retries with some incantation of
grpc.DialOption
s. I remember in a past conversation with @ajm188 that configuring dial opts was more complicated than I anticipated, and... well, I'd propose addressing that in a separate PR. :')My understanding of the "worst case" scenario, as noted, is that VTAdmin can possibly leak gRPC connections that are improperly closed. In most cases, I think (hope?) these connections would terminate themselves once their retry timeout is up. There is a chance, though, that the once-dead vtctld comes back while gRPC is internally retrying, even if VTAdmin's proxy has since established a connection to a different vtctld.
FWIW, we've been running this change in our environment for several months without any issues. And this change is an enhancement given that the current behaviour is to simply fail forever until the vtadmin process is restarted. :')
Related Issue(s)
Closes #9422
Checklist
Deployment Notes
This PR introduces
grpc-connectivity-timeout
, a new per-cluster config option that sets the maxmium wait time to establish a gRPC connection between VTAdmin and the vtctld it queries in that region. The default value is 2 seconds.