Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

puller: fix retry logic when check store version failed #11903

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

lidezhu
Copy link
Collaborator

@lidezhu lidezhu commented Dec 17, 2024

What problem does this PR solve?

Issue Number: close #11766

What is changed and how it works?

Change the retry logic to reload region when client.GetStore failed.

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Fix the problem that changefeed may get stuck after scaling out new tikv nodes.

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 17, 2024
Copy link
Contributor

ti-chi-bot bot commented Dec 17, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lidezhu, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. release-note-none Denotes a PR that doesn't merit a release note. and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Dec 17, 2024
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 90.32258% with 3 lines in your changes missing coverage. Please review.

Project coverage is 55.1700%. Comparing base (0bb4977) to head (bacfafc).
Report is 3 commits behind head on master.

✅ All tests successful. No failed tests found.

Additional details and impacted files
Components Coverage Δ
cdc 59.6017% <90.3225%> (+0.0059%) ⬆️
dm 50.0519% <ø> (-0.0126%) ⬇️
engine 53.2336% <ø> (+0.0225%) ⬆️
Flag Coverage Δ
unit 55.1700% <90.3225%> (+0.0013%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master     #11903        +/-   ##
================================================
+ Coverage   55.1686%   55.1700%   +0.0013%     
================================================
  Files          1003       1003                
  Lines        137493     137504        +11     
================================================
+ Hits          75853      75861         +8     
- Misses        56092      56094         +2     
- Partials       5548       5549         +1     

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Dec 17, 2024
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 18, 2024

/retest

1 similar comment
@lidezhu
Copy link
Collaborator Author

lidezhu commented Dec 18, 2024

/retest

@lidezhu lidezhu changed the title fix retry logic when check store version failed puller: fix retry logic when check store version failed Dec 18, 2024
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. and removed release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-linked-issue labels Dec 18, 2024
@lidezhu lidezhu added needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. labels Dec 18, 2024
@lidezhu lidezhu requested review from hicqu and asddongmen December 18, 2024 03:04
}
}

func (s *requestedStream) run(ctx context.Context, c *SharedClient, rs *requestedStore) error {
if err := version.CheckStoreVersion(ctx, c.pd, rs.storeID); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we just need to move CheckStoreVersion out from run.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change requestedStream.run from return bool to return error enable us to handle more kinds of error in the future.
And what's the benefits of move CheckStoreVersion out from run?

@asddongmen
Copy link
Contributor

Below is my understanding of how this PR works. Please correct me if I am wrong.

Initial Problem:

  1. When the region fails to connect to the current store, the system attempts to switch to the next store address.

  2. After switching, the system enters the newStream function and calls stream.run.

  3. Since stream.run fails, the system retrieves the next store address from the region cache.

  4. However, due to incorrect information in the region cache, the retrieved store is always unreachable, causing the system to fall into an infinite loop, repeatedly performing the above 1,2,3 steps.

Fix:

  1. When the region fails to connect to the current store, log the error and attempt to switch to the next store address.

  2. After switching to the next store address, proceed to the newStream function.

  3. Within the newStream function, call stream.run. If stream.run fails, handle the error based on its type.

  4. If the error type indicates that the information in the region cache might be incorrect, reset the region cache to ensure that the next retrieved store address is accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cdc: fix usage of tikv go-client
3 participants