Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tablet picker cell alias fallback with local cell preference #11771

Closed
wants to merge 9 commits into from

Conversation

pbibra
Copy link
Contributor

@pbibra pbibra commented Nov 18, 2022

Signed-off-by: Priya Bibra pbibra@slack-corp.com

Description

Allow for cell alias fallback during tablet selection for VStreams when client does not specify list of cells. In addition, add the option for local cell preference during tablet selection.

Overall, this PR is trying to solve 2 problems:

Problem 1: We're trying to avoid passing in the list of cells/cell alias that TabletPicker can choose from in the gRPC request itself. Instead, we will now select a candidate from a list of tablets in the VTGate's local cell + any cell within a defined cell alias for the local cell, if one exists.

e.g. We have a cell alias called region-1 which includes the following cells {region-1a, region-1b, region-1c}
A VTGate in region-1b sets up a VStream. The gRPC VStreamRequest object does not have any cells sent from the client, but a cell alias is defined.

As a result TabletPicker.PickForStreaming() will choose from candidates not just in the VTGate's local cell, but in all cells within the region-1 alias without the client having to explicitly specify this.

Problem 2: Of all the candidates selected by TabletPicker, we need a way to allow it to give priority to the local cell. This is where the localPreferenceHint comes in. Two paths in which this can be specified:

a. If no cells specified by the client, send in the list of cells to the TabletPicker as local:region-1b, region-1. The first item in the list is the local cell we want to prioritize and the second is the cell alias that region-1b belongs to.

b. The second way is to allow the client side to also specify this if they choose to pass in the cells within the gRPC request itself rather than use the new flag. They'd send in local:,region-1 or if no cell alias, then something like local:,region-1a, region-1c within the VStreamRequest object.

In both cases, TabletPicker can then order candidates like:

tablet-1-region-1b - PRIMARY
tablet-2-region-1b - REPLICA
tablet-1-region-1c - REPLICA
tablet-1-region-1a - REPLICA

where tablet-1-region-1b would be prioritized since it is in the local cell. The in_order hint is then applied on top of this, where we order by tablet type within each "group" (local cells first, then all others). For example, if the tablet type ordering is inOrder:REPLICA,PRIMARY, then we get the following ordering of candidates:

tablet-2-region-1b - REPLICA
tablet-1-region-1b - PRIMARY
tablet-1-region-1c - REPLICA
tablet-1-region-1a - REPLICA

Testing

We did some testing on the Slack side and looks like cell alias selection and local cell preference is working as expected:

I1129 13:40:14.716001    4703 vstream_manager.go:445] [VSTREAM MANAGER] no cells specified by client, falling back to alias...
I1129 13:40:14.716020    4703 vstream_manager.go:461] [VSTREAM MANAGER] cells to pick from [local:us_east_1e us_east_1]
I1129 13:40:14.717516    4703 vstream_manager.go:445] [VSTREAM MANAGER] no cells specified by client, falling back to alias...
I1129 13:40:14.717536    4703 vstream_manager.go:461] [VSTREAM MANAGER] cells to pick from [local:us_east_1e us_east_1]

For a vtgate in us-east-1b got the following tablet selection logs:

I1128 16:48:05.447026    4703 vstream_manager.go:534] Starting to vstream from cell:"us_east_1b" ...
I1128 16:48:05.762611    4703 vstream_manager.go:534] Starting to vstream from cell:"us_east_1d" ...
I1128 16:48:05.762815    4703 vstream_manager.go:534] Starting to vstream from cell:"us_east_1e" ...
I1128 16:48:05.763515    4703 vstream_manager.go:534] Starting to vstream from cell:"us_east_1b" ...
I1128 16:48:05.765803    4703 vstream_manager.go:534] Starting to vstream from cell:"us_east_1e" ...

Related Issue(s)

More discussion here: https://vitess.slack.com/archives/C0PQY0PTK/p1668557424770139

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Documentation was added or is not required

Deployment Notes

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Nov 18, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.

If a new flag is being introduced:

  • Is it really necessary to add this flag?
  • Flag names should be clear and intuitive (as far as possible)
  • Help text should be descriptive.
  • Flag names should use dashes (-) as word separators rather than underscores (_).

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow should be required, the maintainer team should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should include a link to an issue that describes the bug.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from VTop, if used there.

@pbibra pbibra changed the title add cell alias fallback option with local cell preference to tablet p… tablet picker cell alias fallback with local cell preference Nov 18, 2022
@deepthi
Copy link
Member

deepthi commented Nov 18, 2022

We should add @rohit-nayak-ps @mattlord as code owners for tablet_picker.go and vstream_manager.go.

@mattlord mattlord self-assigned this Nov 21, 2022
@mattlord mattlord added Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature) labels Nov 21, 2022
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this! ❤️ The core of it seems good but the user interface is unclear to me.

When you say “a vtgate’s cell alias” what do you mean by that? Do you just mean the vtgate process's cell?

I was imagining, user would specify a vtgate flag of ~ --vstream-prefer-local-cell and then if the vtgate’s local cell was one of those specified in the vstream gRPC call then we’d prefer any of the tablets in that cell.

So then if you start a vtgate this way: vtgate --tablet_types="in_order:replica,primary" --vstream-prefer-local-cell we should be able to stream from tablets in the same cell as the vtgate. It seems that instead you were thinking that in the vstream gRPC the user would add a prefix to the cells of localPreference:<local cell, which can be an alias> but that feels unintuitive to me. Maybe I'm missing some things though.

I’m not sure where Cell alias comes into play, which e.g. could be zone1,zone2,zone3 (a common alias in the past was all cells). Maybe you have N cells per data center, and an alias is for all of those local cells? Or maybe alias is errantly thrown in there?

go/vt/vtgate/vtgate.go Outdated Show resolved Hide resolved
…icker

Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
@rohit-nayak-ps
Copy link
Contributor

Thanks for the detailed use case and the PR. I wanted to bring up another aspect: the tablet picker is also used by the standard VReplication flows.

Prefering the local cell when multiple cells are provided will be useful for these flows as well. The current implementation forces the user to choose between:

  • specifying the local cell only, to avoid cross-AZ traffic. This can lead to starvation if there is no tablet in the local cell at some point
  • specifying multiple cells to always find a candidate, in which case we could incur significant cross-AZ traffic

The main issue is that the picker doesn't know the default cell of the caller today. One option is to update the tablet picker api, so that we can pass that in and add picker strategy options. The options can be used instead of the "local:" hint (and we will also deprecate the "in_order:" hint for tablet types replacing it with an option).

For cell selection the options would be:

  • Default // prefer local cell, then specified cells
  • ExtendedDefault // prefer local cell, then cell alias of local cell, then specified cells
  • Specified // prefer specified cells, no fallback

I am ok with combining the two defaults and providing the functionality of the ExtendedDefault as the Default. But it will be a breaking change and I will need to validate internally if that choice is acceptable from a backward compatibility viewpoint.

Since I am joining this discussion late, I hope I have understood your requirement correctly. I know you have already had a call with @mattlord, @pbibra , but if it will help we can have another one. I was thinking of including the requirements of #11579 as well in the refactor @HenryCaiHaiying.

Regarding implementation, since it impacts other vreplication workflows as well, if you prefer, I can work on the tablet picker refactor this week. Your PR can then use that version of the tablet picker: by specifying vtgate's cell as the local cell and any strategy override. The VStream API's VStreamFlags can be used to specify any strategy override. Users with older Debezium will end up using the default behavior in their Vitess cluster.

Let us know what you think.

@pbibra pbibra marked this pull request as ready for review November 29, 2022 22:19
@pbibra
Copy link
Contributor Author

pbibra commented Nov 29, 2022

Hi @rohit-nayak-ps! Thanks for the additional context. Looks like modifying the tablet picker is the better long term change, but I think @HenryCaiHaiying might have some concerns because we'd like this update to be in our v13 release. Modifying tablet picker might make that backport tougher. Let's do a huddle tomorrow and I will start a chat to discuss this in Slack.

As for the options, I think having Default and ExtendedDefault be separate is completely fine as long as we can specify these server side in the VStream use case and not have to modify the client side gRPC request.

Thank you!

@HenryCaiHaiying
Copy link

I think @pbibra 's fix already satisfies my original request in #11579. It's better to close in on the current PR rather than doing another complex refactoring (which might also be harder to back port to V13).

@pbibra
Copy link
Contributor Author

pbibra commented Nov 29, 2022

Started a discussion here: https://vitess.slack.com/archives/C04DCRJS6QZ/p1669761282103889

go/vt/discovery/tablet_picker_test.go Outdated Show resolved Hide resolved
go/vt/vtgate/vstream_manager.go Outdated Show resolved Hide resolved
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
@rohit-nayak-ps
Copy link
Contributor

@pbibra, code changes look good. Sorry, in my previous review, I missed a doc change that needs to be part of this PR too. Similar to this update in an other PR: https://github.com/vitessio/vitess/pull/11874/files#diff-ccb9bf989b7450df22db2bf4bc687668cb4046e1974dd7d5e5e12b2b54a29de3.

We need to update summary.md to add the enhancement to the picker. I would have done it except that I can't push commits to your branch. Just add a section to it summarizing your changes and I will approve.

Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
@pbibra pbibra requested a review from rsajwani as a code owner December 8, 2022 17:29
Signed-off-by: pbibra <pbibra@slack-corp.com>
Signed-off-by: Priya Bibra <pbibra@slack-corp.com>
Copy link
Contributor

@rohit-nayak-ps rohit-nayak-ps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice work, lgtm!

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly had some minor comments/nits/suggestions, but I also have some larger concerns about tablet selection and the tablet picker behavior. I'm happy to chat about them via Slack or Zoom this week. Thanks! ❤️

// 2 scenarios:
//
// 1. No cells specified by the client via the gRPC request.
// Tablets from the local cell of the VTGate AND the cell alias that this cell belongs to will be selected by default
Copy link
Contributor

@mattlord mattlord Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think we're missing a period at the end of this line (we can also split it so that it aligns with the other newlines/line lengths).

// Local cell will take precedence.
//
// 2. Cells are specified by the client via the gRPC request
// These cels will take precendence over the default local cell and its alias
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cels->cells

Comment on lines +20 to +23
In [PR 11771](https://github.com/vitessio/vitess/pull/11771) we allow for default cell alias fallback during tablet selection for VStreams when client
does not specify list of cells. In addition, we add the option for local cell preference during tablet selection.
The local cell preference takes precedence over tablet type.See PR description for examples. If a client wants to specify local cell preference in the gRPC request,
they can pass in a new "local:" tag with the rest of the cells under VStreamFlags. e.g. "local:,cella,cellb".
Copy link
Contributor

@mattlord mattlord Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested tweaks:

In [PR 11771](https://github.com/vitessio/vitess/pull/11771) we modify the default [TabletPicker](https://vitess.io/docs/16.0/reference/vreplication/tablet_selection/)
behavior during tablet selection for [`VStreams`](https://vitess.io/docs/16.0/concepts/vstream/):
  - OLD: look for candidate tablets in the local cell
  - NEW: look for candidate tablets in the local cell, if none are found, use the local cell's cell alias — if it has one — as a fallback

In addition, we add support for the `local` notation when the client *does* specify a list of cells, e.g.: `--cells="local:zone1a,zone1b,zone1c"
with `vtctldclient` commands and `VStreamFlags.Cells = "local:zone1a,zone1b,zone1c"` in the
[VStreamFlags](https://pkg.go.dev/vitess.io/vitess/go/vt/proto/vtgate#VStreamFlags) with the vtgate VStream RPC.
The local cell will then always be searched first and takes precedence over any others specified.
See [the PR](https://github.com/vitessio/vitess/pull/11771) description for examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


tcases := []testCase{
{
"local preference",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this would be much easier to read if we had the struct field names, e.g.:

name: "local preference",
...
wantTablets: []uint32{102, 103},

Otherwise the reader needs to keep the struct's field indexes in their head.

@@ -161,7 +161,7 @@ func (tw *TopologyWatcher) loadTablets() {
return
default:
}
log.Errorf("cannot get tablets for cell: %v: %v", tw.cell, err)
log.Errorf("cannot get tablets for cell:%v: %v", tw.cell, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this would be slightly better:

log.Errorf("cannot get tablets for cell %q: %v", tw.cell, err)

for _, cell := range strings.Split(strings.TrimSpace(vs.optCells), ",") {
for i, cell := range strings.Split(strings.TrimSpace(vs.optCells), ",") {
// if the local tag is passed in, we must give local cell priority
// during tablet selection. Append the VTGate's local cell to the list of cells
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurately, I think we're prepending the local cell.

Comment on lines +446 to +454
log.Info("No cells provided by client, falling back to local cell and alias...\n")
// append the alias this cell belongs to, otherwise appends the vtgate's cell
alias := topo.GetAliasByCell(ctx, vs.ts, vs.vsm.cell)
// an alias was actually found
if alias != vs.vsm.cell {
// send in the vtgate's cell for local cell preference
cells = append(cells, fmt.Sprintf("local:%s", vs.vsm.cell))
}
cells = append(cells, alias)
Copy link
Contributor

@mattlord mattlord Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, so this new behavior is also specific to vtgate vstreams at this point. I thought we wanted to change the default behavior of the tablet picker itself so that when no cells are specified then we use the cell alias as a fallback when we find no candidate tablets in the local cell. That's what the docs seemed to suggest and I think that we should do that.

Keep in mind that we need to document this new behavior here: https://vitess.io/docs/16.0/reference/vreplication/tablet_selection/
We should also modify it to note that it applies to vtgate vstreams as well.

P.S. I have updated/corrected that page here in a docs PR: https://deploy-preview-1267--vitess.netlify.app/docs/16.0/reference/vreplication/tablet_selection/

Comment on lines +286 to +288
{"default-local", "", false, []string{"aa"}},
{"default-local-cell-alias", "", true, []string{"local:aa", "region1"}},
{"with-opt-cells", "local:,bb,cc", true, []string{"local:aa", "bb", "cc"}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here about using the struct field names to make it more readable.

ctx, cancel := context.WithCancel(context.Background())
defer cancel()

cell := "aa"
Copy link
Contributor

@mattlord mattlord Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make the local cell a test case variable too. That would have likely highlighted the (what I think is) broken local cell correction code and the "lost cell" issue too.

Comment on lines +323 to +325
cellsAlias := &topodatapb.CellsAlias{
Cells: []string{"aa", "bb"},
}
Copy link
Contributor

@mattlord mattlord Dec 12, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's value in using different cell alias definitions too. i.e. making this a test case string slice variable rather than a bool with a single alias definition.

cells = append(cells, strings.TrimSpace(cell))
}
}

// if no override provided in gRPC request, perform cell alias fallback
if len(cells) == 0 {
log.Info("No cells provided by client, falling back to local cell and alias...\n")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an extra newline here as log messages always end with one.

@pbibra
Copy link
Contributor Author

pbibra commented Dec 21, 2022

closing in favor of solution here: #11999

@pbibra pbibra closed this Dec 21, 2022
@pbibra pbibra deleted the pbibra-vstreams-cell-alias branch December 21, 2022 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: VReplication Type: Enhancement Logical improvement (somewhere between a bug and feature)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Add fallback to VTGate cell alias for VStreams with local cell preference
5 participants