Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disttask: maintain managed nodes separately #49623

Merged
merged 12 commits into from
Dec 22, 2023

Conversation

D3Hunter
Copy link
Contributor

@D3Hunter D3Hunter commented Dec 20, 2023

What problem does this PR solve?

Issue Number: ref #49008

Problem Summary:

What changed and how does it work?

  • GetEligibleInstances returns target node that task can run, if no node returned, we use all.
  • add nodeManager to maintain managed node

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Dec 20, 2023
Copy link

codecov bot commented Dec 20, 2023

Codecov Report

Merging #49623 (c18fddc) into master (da460f1) will increase coverage by 0.4362%.
Report is 36 commits behind head on master.
The diff coverage is 41.1042%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #49623        +/-   ##
================================================
+ Coverage   70.9801%   71.4163%   +0.4362%     
================================================
  Files          1368       1427        +59     
  Lines        398761     423088     +24327     
================================================
+ Hits         283041     302154     +19113     
- Misses        95945     102010      +6065     
+ Partials      19775      18924       -851     
Flag Coverage Δ
integration 44.0873% <41.1042%> (?)
unit 70.9803% <ø> (+0.0002%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9663% <ø> (ø)
parser ∅ <ø> (∅)
br 46.4459% <ø> (-6.4367%) ⬇️

pkg/ddl/backfilling_dist_scheduler.go Outdated Show resolved Hide resolved
return nil, err
}
logutil.Logger(s.logCtx).Debug("eligible instances", zap.Int("num", len(serverNodes)))
if len(serverNodes) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then import into can't scale out nodes during execution?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is same as before
if task can only run on some nodes(len > 0), we use it, else we can use all managed nodes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it, instance is for local mode.

@@ -173,8 +172,7 @@ func TestBackfillingSchedulerGlobalSortMode(t *testing.T) {
taskID, err := mgr.CreateTask(ctx, task.Key, proto.Backfill, 1, task.Meta)
require.NoError(t, err)
task.ID = taskID
serverInfos, _, err := sch.GetEligibleInstances(context.Background(), task)
require.NoError(t, err)
serverInfos := []string{":4000"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
serverInfos := []string{":4000"}
execIDs := []string{":4000"}

// if returned instances is empty, it means all instances are eligible.
// TODO: run import from server disk using framework makes this logic complicated,
// the instance might not be managed by framework.
GetEligibleInstances(ctx context.Context, task *proto.Task) ([]string, error)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove this interface?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not now, server disk import requires this

@@ -120,9 +121,18 @@ func (sm *Manager) Start() {
failpoint.Inject("disableSchedulerManager", func() {
failpoint.Return()
})
// init cached managed nodes
sm.nodeMgr.refreshManagedNodes(sm.ctx, sm.taskMgr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if refresh failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's called periodicly per second, here just init it to make test pass

return nil, err
}
logutil.Logger(s.logCtx).Debug("eligible instances", zap.Int("num", len(serverNodes)))
if len(serverNodes) == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it, instance is for local mode.

@ywqzzy
Copy link
Contributor

ywqzzy commented Dec 21, 2023

/cc @tangenta

@ti-chi-bot ti-chi-bot bot requested a review from tangenta December 21, 2023 06:18
type NodeManager struct {
// prevLiveNodes is used to record the live nodes in last checking.
prevLiveNodes map[string]struct{}
managedNodes atomic.Pointer[[]string]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some comments for managedNodes.

Copy link
Contributor

@ywqzzy ywqzzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

pkg/disttask/framework/scheduler/nodes.go Outdated Show resolved Hide resolved
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Dec 21, 2023
Co-authored-by: EasonBall <592838129@qq.com>
Copy link
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

17/28

Copy link
Member

@okJiang okJiang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

pkg/disttask/framework/scheduler/scheduler.go Show resolved Hide resolved
Copy link

ti-chi-bot bot commented Dec 22, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: okJiang, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Dec 22, 2023
Copy link

ti-chi-bot bot commented Dec 22, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-12-21 06:27:43.580311029 +0000 UTC m=+1115154.617537950: ☑️ agreed by ywqzzy.
  • 2023-12-22 07:51:42.747433725 +0000 UTC m=+1206593.784660652: ☑️ agreed by okJiang.

@ti-chi-bot ti-chi-bot bot merged commit 3e9bd47 into pingcap:master Dec 22, 2023
17 of 18 checks passed
@D3Hunter D3Hunter deleted the extr-global-mgr branch December 22, 2023 09:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants