Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vtctl|vtctldserver] List/Get Tablets timeouts #7715

Merged
merged 12 commits into from
Mar 22, 2021

Conversation

ajm188
Copy link
Contributor

@ajm188 ajm188 commented Mar 19, 2021

Description

This PR tackles several things at once (sorry!)

  1. Add context timeouts to VtctldServer.GetTablets similar to [vtctld] Migrate ShardReplicationPositions #7690 for that method.
  2. Add a Strict field to GetTablets, to dictate whether the rpc should treat partial results from the topo as fatal or not.
  3. Reimplement the legacy vtctl ListShardTablets / ListAllTablets to invoke the VtctldServer.GetTablets method under the hood, so that they get the timeout fix for free. This also let me delete some helper methods that were duplicated between go/vt/vtctl and go/cmd/vtctldclient/cli.

  1. It was at this point that I realized in order to also apply the timeout fix to ListTablets, I needed to add a TabletAliases field to the GetTabletsRequest proto message, so I chose to make that the highest-precedence filter.
  2. ⚠️ Then, in order to wrap errors coming back from the topo, but also allow me to check if the error was a topo.PartialResult or not, I updated topo.IsErrType to use errors.As to try to recursively Unwrap any wrapped error chain, before falling back to the single-depth type cast check.
  3. Reimplement the legacy vtctl ListTablets to use GetTablets under the hood, and delete the rest of the duplicated cli formatting helper code that was now dead.
  4. Update the vtctldclient CLI and the tests ✨

Examples!

vtctldclient - healthy => bad topo => strict
❯ time vtctldclient --server "localhost:15999" GetTablets
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-15T20:30:34Z
zone1-0000000101 commerce 0 replica SFO-M-AMASON02:15101 SFO-M-AMASON02:17101 [] <null>
zone1-0000000102 commerce 0 rdonly SFO-M-AMASON02:15102 SFO-M-AMASON02:17102 [] <null>
vtctldclient --server "localhost:15999" GetTablets  0.01s user 0.01s system 67% cpu 0.023 total
❯ vtctlclient -server "localhost:15999" AddCellInfo -root /vitess/zone0 -server_address bogus:1234 zone0
❯ time vtctldclient --server "localhost:15999" GetTablets --cell zone0
^C
vtctldclient --server "localhost:15999" GetTablets --cell zone0  0.01s user 0.01s system 0% cpu 4.695 total
❯ time vtctldclient --server "localhost:15999" GetTablets --cell zone1
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-15T20:30:34Z
zone1-0000000101 commerce 0 replica SFO-M-AMASON02:15101 SFO-M-AMASON02:17101 [] <null>
zone1-0000000102 commerce 0 rdonly SFO-M-AMASON02:15102 SFO-M-AMASON02:17102 [] <null>
vtctldclient --server "localhost:15999" GetTablets --cell zone1  0.01s user 0.01s system 75% cpu 0.029 total
❯ time vtctldclient --server "localhost:15999" GetTablets --cell zone1 --cell zone0
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-15T20:30:34Z
zone1-0000000101 commerce 0 replica SFO-M-AMASON02:15101 SFO-M-AMASON02:17101 [] <null>
zone1-0000000102 commerce 0 rdonly SFO-M-AMASON02:15102 SFO-M-AMASON02:17102 [] <null>
vtctldclient --server "localhost:15999" GetTablets --cell zone1 --cell zone0  0.01s user 0.01s system 0% cpu 5.029 total
❯ time vtctldclient --server "localhost:15999" GetTablets --cell zone1 --cell zone0 --strict
E0315 16:31:31.388673   21647 main.go:42] rpc error: code = Unknown desc = GetAllTablets(cell = zone0) failed: dial tcp: lookup bogus: no such host
failed to create topo connection to bogus:1234, /vitess/zone0
vtctldclient --server "localhost:15999" GetTablets --cell zone1 --cell zone0   0.01s user 0.02s system 0% cpu 5.035 total
legacy vtctlclient, pre-fix
❯ time vtctlclient --server "localhost:15999" ListAllTablets
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-18T12:43:59Z
zone1-0000000101 commerce 0 replica SFO-M-AMASON02:15101 SFO-M-AMASON02:17101 [] <null>
zone1-0000000102 commerce 0 rdonly SFO-M-AMASON02:15102 SFO-M-AMASON02:17102 [] <null>
vtctlclient --server "localhost:15999" ListAllTablets  0.01s user 0.01s system 0% cpu 5.029 total
❯ time vtctlclient --server "localhost:15999" ListTablets zone1-0000000100
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-18T12:43:59Z
vtctlclient --server "localhost:15999" ListTablets zone1-0000000100  0.01s user 0.02s system 23% cpu 0.103 total
❯ time vtctlclient --server "localhost:15999" ListShardTablets commerce/0
ListShardTablets Error: rpc error: code = Unknown desc = partial result: 0
E0318 10:02:42.527874   69634 main.go:72] remote error: rpc error: code = Unknown desc = partial result: 0
vtctlclient --server "localhost:15999" ListShardTablets commerce/0  0.01s user 0.01s system 0% cpu 5.040 total
legacy vtctlclient, post-fix
❯ time vtctlclient -v 1000 -server "localhost:15999" ListTablets zone1-0000000100
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-19T00:08:14Z
vtctlclient -v 1000 -server "localhost:15999" ListTablets zone1-0000000100  0.01s user 0.01s system 77% cpu 0.029 total
❯ time vtctlclient -v 1000 -server "localhost:15999" ListTablets zone1-0000000100 zone0-0000000900
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-19T00:08:14Z
vtctlclient -v 1000 -server "localhost:15999" ListTablets zone1-0000000100   0.01s user 0.01s system 77% cpu 0.027 total
❯ vtctlclient -server "localhost:15999" AddCellInfo -root /vitess/zone0 -server_address bogus:1234 zone0
❯ time vtctlclient -v 1000 -server "localhost:15999" ListTablets zone1-0000000100 zone0-0000000900
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-19T00:08:14Z
vtctlclient -v 1000 -server "localhost:15999" ListTablets zone1-0000000100   0.01s user 0.01s system 0% cpu 5.033 total
vtctldclient, with new tablet alias filtering
❯ time vtctldclient --server "localhost:15999" GetTablets -tzone1-0000000100
zone1-0000000100 commerce 0 master SFO-M-AMASON02:15100 SFO-M-AMASON02:17100 [] 2021-03-19T00:53:27Z
vtctldclient --server "localhost:15999" GetTablets -tzone1-0000000100  0.01s user 0.01s system 78% cpu 0.028 total

Related Issue(s)

Checklist

  • Should this PR be backported? no
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

Impacted Areas in Vitess

Components that this PR will affect:

  • Query Serving
  • VReplication
  • Cluster Management
  • Build/CI
  • VTAdmin

Signed-off-by: Andrew Mason <amason@slack-corp.com>
…l data

Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
…erver

Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Copy link
Member

@deepthi deepthi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

bool strict = 4;
// TabletAliases is an optional list of tablet aliases to fetch Tablet objects
// for. If specified, Keyspace, Shard, and Cells are ignored, and tablets are
// looked up by their respective aliase's Cells directly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tiny typo - you can choose to fix it or not 😁

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😬 let me push up a fix! lol

Copy link
Contributor

@doeg doeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to meeee

GetTablets.Flags().StringVar(&getTabletsOptions.Format, "format", "awk", "Output format to use; valid choices are (json, awk)")
GetTablets.Flags().BoolVar(&getTabletsOptions.Strict, "strict", false, "Require all cells to return successful tablet data. Without --strict, tablet listings may be partial.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chef kiss

err error
)

switch {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure to what extent this is a golang thing or a you thing, but I really like this (and other) uses of switch (instead of if/else, presumably). Going to osmose this pattern into our TypeScript more often. >:)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably a mix of both haha

wg.Wait()

if rec.HasErrors() {
if req.Strict {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super minor point that you are welcome to ignore -- my $0.02 is that if req.Strict || len(rec.Errors) == len(cells) is slightly more readable (or at least, what I'd expect, since it took me a sec to realize the two ifs are doing the same thing).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly :) (I ran into this issue when a test broke)

If len(cells) == 0, then len(rec.Errors) == len(cells) but rec.HasErrors() == false, which is why we have to be a little more verbose/redundant here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I misunderstood! I thought you were suggesting to replace the outer if rec.HasErrors() with the composite boolean, not to replace the two branches. You're totally right, and it's way easier to understand. Pushing up a fix! 😊

Signed-off-by: Andrew Mason <amason@slack-corp.com>
Signed-off-by: Andrew Mason <amason@slack-corp.com>
Copy link
Contributor

@doeg doeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

30min of rerunning flaky tests and everything finally passes. Haha. Here's an extra approval for good measure.

@ajm188
Copy link
Contributor Author

ajm188 commented Mar 22, 2021

cilcky clicky on the retry button! thanks ❤️

@deepthi deepthi merged commit 683d5c8 into vitessio:master Mar 22, 2021
@askdba askdba added this to the v10.0 milestone Mar 23, 2021
rafael pushed a commit to tinyspeck/vitess that referenced this pull request Apr 6, 2021
[vtctl|vtctldserver] List/Get Tablets timeouts

Signed-off-by: Rafael Chacon <rafael@slack-corp.com>
@ajm188 ajm188 added the Type: Enhancement Logical improvement (somewhere between a bug and feature) label May 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cluster management Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ListAllTablets exits with error if a cell's topo server is unreachable
4 participants