Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

domain: use with kv timeout feature schema load kv timeout #48017

Conversation

cfzjywxk
Copy link
Contributor

@cfzjywxk cfzjywxk commented Oct 26, 2023

What problem does this PR solve?

Issue Number: ref #48124

Problem Summary:
Use kv timeout feature for schema reload.

What is changed and how it works?

When the leader peer of meta region is slow, use kv timeout could help alleviated by enabling the KV read timeout, avoid errors like "schema lease expire". For example when injecting slowness into the meta region TiKV node,

img_v2_bdffa267-615e-43ef-8ff3-2afc80a499fg

Check List

Tests

  • Unit test

Side effects

Documentation

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Oct 26, 2023
@tiprow
Copy link

tiprow bot commented Oct 26, 2023

Hi @cfzjywxk. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -184,6 +184,7 @@ func (do *Domain) loadInfoSchema(startTS uint64) (infoschema.InfoSchema, bool, i
loadSchemaDurationTotal.Observe(time.Since(beginTime).Seconds())
}()
snapshot := do.store.GetSnapshot(kv.NewVersion(startTS))
snapshot.SetOption(kv.TiKVClientReadTimeout, uint64(3000)) // 3000ms.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the default value if not set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have snapshot.SetOption(kv.ReplicaRead, kv.ReplicaReadMixed) set as well, otherwise it won't retry on replica, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zimulala
The default value of Get is 30s and BatchGet is 60s.

Copy link
Contributor Author

@cfzjywxk cfzjywxk Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tema
If snapshot.SetOption(kv.ReplicaRead, kv.ReplicaReadMixed) is not set, the leader peer would be tried first. If timeout happens the error is catched, the replica selector would be transferfred to TryFollower automatically , then follower read would be used on the other peers.

So it's not necessary to involve follower read in the first try I think.

Copy link
Contributor

@Tema Tema Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default value of Get is 30s and BatchGet is 60s.

If that is true, then retry does not go to replica. I've tested this branch without changes in this PR by setting lease=300s and it didn't help. If the default timeout is 30s then lease would be able to renew by falling back to heathy replica, which I didn't observe in the experiment. So the default timeout is much bigger, does not exists at all or retry does not go to replica for some reason.

Copy link
Contributor Author

@cfzjywxk cfzjywxk Oct 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Tema

The fast automatic fallback happens only when the timeout is configured to a value smaller than the default value, here is the check logic.
In the default path, the retry may happen at most maxReplicaAttempt times.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cfzjywxk! I've tested this patch on our test bed and it works pretty well!

@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 27, 2023

@Tema: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Oct 30, 2023
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 31, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: crazycs520, Tema, you06

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 31, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-10-30 23:32:17.616574252 +0000 UTC m=+2909535.203684396: ☑️ agreed by crazycs520.
  • 2023-10-31 01:23:59.756233691 +0000 UTC m=+2916237.343343904: ☑️ agreed by you06.

@cfzjywxk cfzjywxk merged commit da06a17 into pingcap:tidb-6.5-with-kv-timeout-feature Oct 31, 2023
3 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants