Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

lightning: check and restore pd scheduler even if our task failed #1336

Merged
merged 10 commits into from
Aug 6, 2021

Conversation

glorv
Copy link
Collaborator

@glorv glorv commented Jul 12, 2021

What problem does this PR solve?

Fix the bug that in concurrency import mode, if one or more lightning failed before import all tables finished, pd schedulers won't be restored.

What is changed and how it works?

This PR adds a new status to mark that one task is exit before finishing. A lightning instance will set its status to one of (checksum_skipping, checksumming, unfinished), other lightinng can check these statuses to make sure whether it is the last running instance. The last instance should recover the pd schedulers configs.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

Side effects

  • Increased code complexity

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation

Release note

  • Fix the bug that in concurrency mode, pd schedulers may not be restored if one or more lightning failed.


err := exec.Transact(ctx, "check and init task status", func(ctx context.Context, tx *sql.Tx) error {
// avoid override existing metadata if the meta is already inserted.
stmt := fmt.Sprintf(`INSERT IGNORE INTO %s (task_id, status) values (?, ?)`, m.tableName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why insert be fore select......

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ensure there is exact one row represent current lightning. Need the reset the task status if current task's status is exit_unfinished

@@ -1435,6 +1439,7 @@ func (rc *Controller) restoreTables(ctx context.Context) error {
// finishSchedulers()
// cancelFunc(switchBack)
// finishFuncCalled = true
taskFinished = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is it used for?

Copy link
Collaborator Author

@glorv glorv Jul 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If current task is exit before lightning finished (maybe met error or by user terminating), we should not clean up the task/table meta tables if all other lightning are finished

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. finishSchedulers will be called before this method ending...... Emmmmmm, it seems to be difficult to understand

)

func (m taskMetaStatus) realStatus() taskMetaStatus {
return m & (taskMetaStatusExitUnfinished - 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why define a method realstatus ?

Copy link
Collaborator

@Little-Wallace Little-Wallace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link
Member

ti-chi-bot commented Jul 13, 2021

[REVIEW NOTIFICATION]

This pull request has been approved by:

  • Little-Wallace
  • gozssky

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@ti-chi-bot ti-chi-bot added the status/LGT1 LGTM1 label Jul 13, 2021
Comment on lines 1440 to 1442
// finishSchedulers()
// cancelFunc(switchBack)
// finishFuncCalled = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need clean up?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We originally want to restore pd schedulers and switch back tikv to normal mode after data import finished. Then the cluster can do possible rebalance during checksum and analyze. But In our test, these rebalance will bring non-trivial impact to checksum and analyze. So we need to investigate further to determine whether we can still do this. So I think we can keep these before coming up with a clear conclusion.

Copy link
Collaborator

@kennytm kennytm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM


err := exec.Transact(ctx, "check and init task status", func(ctx context.Context, tx *sql.Tx) error {
// avoid override existing metadata if the meta is already inserted.
stmt := fmt.Sprintf(`INSERT INTO %s (task_id, status) values (?, ?) ON DUPLICATE KEY UPDATE state = ?`, m.tableName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment feels outdated.

@@ -105,6 +105,7 @@ const (
task_id BIGINT(20) UNSIGNED NOT NULL,
pd_cfgs VARCHAR(2048) NOT NULL DEFAULT '',
status VARCHAR(32) NOT NULL,
state TINYINT(1) NOT NULL DEFAULT 0 COMMENT '0: normal, 1: exited before finish',
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the taskMetaTableName be changed?

otherwise if we used Lightning before, and this CREATE TABLE IF NOT EXISTS means the task_meta with the old table structure will be used

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. But since we don't GA this feature and if lightning exited before finished, the current logic may still not be recover except manually drop the meta schema. We also don't recommend change lightning binary during one import task.

Copy link
Contributor

@sleepymole sleepymole left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM after we have a consensus on #1336 (comment)

@sleepymole
Copy link
Contributor

/run-integration_test

1 similar comment
@sleepymole
Copy link
Contributor

/run-integration_test

@sleepymole
Copy link
Contributor

Unstable configlist test will be fixed by #1393.

@sleepymole
Copy link
Contributor

/run-integration_test

@ti-chi-bot ti-chi-bot added status/LGT2 LGTM2 and removed status/LGT1 LGTM1 labels Jul 28, 2021
@glorv
Copy link
Collaborator Author

glorv commented Aug 6, 2021

/merge

@ti-chi-bot
Copy link
Member

This pull request has been accepted and is ready to merge.

Commit hash: 799f4e0

@glorv
Copy link
Collaborator Author

glorv commented Aug 20, 2021

/cherry-pick release-5.1

ti-chi-bot pushed a commit to ti-chi-bot/br that referenced this pull request Aug 20, 2021
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created: #1422.

@glorv glorv deleted the check-restore-schedule branch August 20, 2021 08:50
@ti-chi-bot
Copy link
Member

@glorv: new pull request could not be created: failed to create pull request against pingcap/br#release-5.1 from head ti-chi-bot:cherry-pick-1336-to-release-5.1: status code 422 not one of [201], body: {"message":"Validation Failed","errors":[{"resource":"PullRequest","code":"custom","message":"A pull request already exists for ti-chi-bot:cherry-pick-1336-to-release-5.1."}],"documentation_url":"https://docs.github.com/rest/reference/pulls#create-a-pull-request"}

In response to this:

/cherry-pick release-5.1

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

ti-chi-bot pushed a commit to ti-chi-bot/br that referenced this pull request Aug 20, 2021
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants