scheduler: refine scheduler error message #1373

weekface · 2019-12-19T08:39:54Z

What problem does this PR solve?

This PR fixes: #1353 , refines the tidb-scheduler error messages.

return empty nodes when no node can be scheduled to, not an error (Improve error messages for "Failed filter with extender" #1346)
add more error message to K8s Events
refine the error messages

Before this PR:

Now:

What is changed and how does it work?

Check List

Tests

Unit test

Code changes

Has Go code change
Has documents change

Side effects

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Does this PR introduce a user-facing change?:

scheduler: refine scheduler error message

weekface · 2019-12-19T08:53:39Z

pkg/scheduler/predicates/ha.go

 	h := &ha{
-		kubeCli:  kubeCli,
-		cli:      cli,
-		recorder: recorder,


will emit Event in scheduler package, not here

weekface · 2019-12-19T08:56:18Z

pkg/scheduler/predicates/ha.go

-		if component == label.PDLabelVal {
+	maxPodsPerNode := 0
+
+	if component == label.PDLabelVal {


this if ... else ... block calculate the maxPodsPerNode var, it is only related to replicas var. so i move these codes out of the for ... range nodeMap block.

and also we want to use the maxPodsPerNode in error message.

@cofyc

weekface · 2019-12-19T08:57:46Z

pkg/scheduler/predicates/stable_scheduling.go

-		p.recorder.Event(pod, apiv1.EventTypeWarning, UnableToRunOnPreviousNodeReason, msg)
-	} else {
-		glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)
+		return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)


return an error here to let scheduler package emit an Event. @cofyc

weekface · 2019-12-19T08:58:19Z

pkg/scheduler/scheduler.go

@@ -115,7 +119,10 @@ func (s *scheduler) Filter(args *schedulerapiv1.ExtenderArgs) (*schedulerapiv1.E
 		glog.Infof("entering predicate: %s, nodes: %v", predicate.Name(), predicates.GetNodeNames(kubeNodes))
 		kubeNodes, err = predicate.Filter(instanceName, pod, kubeNodes)
 		if err != nil {
-			return nil, err
+			s.recorder.Event(pod, apiv1.EventTypeWarning, predicate.Name(), err.Error())


emit an Event if the error is not nil.

cofyc

LGTM

pkg/scheduler/predicates/ha.go

Co-Authored-By: onlymellb <luolibin@pingcap.com>

onlymellb · 2019-12-19T10:07:56Z

pkg/scheduler/scheduler.go

@@ -115,7 +119,10 @@ func (s *scheduler) Filter(args *schedulerapiv1.ExtenderArgs) (*schedulerapiv1.E
 		glog.Infof("entering predicate: %s, nodes: %v", predicate.Name(), predicates.GetNodeNames(kubeNodes))
 		kubeNodes, err = predicate.Filter(instanceName, pod, kubeNodes)
 		if err != nil {
-			return nil, err
+			s.recorder.Event(pod, apiv1.EventTypeWarning, predicate.Name(), err.Error())
+			return &schedulerapiv1.ExtenderFilterResult{


When an error occurs, we need to determine whether kubeNodes is empty. If it is empty, it should return an error normally. If it is not empty, it should continue to schedule.

fixed @onlymellb PTAL

onlymellb · 2019-12-19T10:08:59Z

pkg/scheduler/predicates/stable_scheduling.go

-		p.recorder.Event(pod, apiv1.EventTypeWarning, UnableToRunOnPreviousNodeReason, msg)
-	} else {
-		glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)
+		return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)


Suggested change

return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)

return nodes, fmt.Errorf("cannot run %s/%s on its previous node %q", ns, podName, nodeName)

cofyc

LGTM

tennix · 2019-12-23T07:06:46Z

pkg/scheduler/predicates/stable_scheduling.go

 	}
+	glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)


From the log message, there is no node for scheduling, but the following returns the nodes and the error is nil.

this happens when we can't find the previous node of this TiDB member. e.g. the tidb pod is new, so we can run it on any node

Yes, I got it. But we should use warning level log and also emit this event to users.

After the change in this PR, the error returned here will be propagated to users. @weekface can you help change the log level to warn and return the log message as an error too?

comments addressed, @tennix @cofyc PTAL

cofyc · 2019-12-23T07:40:39Z

pkg/scheduler/scheduler.go

+			if len(kubeNodes) == 0 {
+				// do not return error to k8s: https://github.com/pingcap/tidb-operator/issues/1353
+				return nil, nil
+			}


Does nil semantically equal to filter result with empty nodes? If it is, these three lines seem unnecessary because, at the end of this function, it will return filter results with an empty node list.

Does nil semantically equal to filter result with empty nodes?

Yes.

If it is, these three lines seem unnecessary because, at the end of this function, it will return filter results with an empty node list.

This return is in a for range loop. No need to step the next predicate when the kubeNodes is empty.

What about to change return nil, nil to break?

if len(kubeNodes) == 0 { break }

I think it's good

cofyc

LGTM

cofyc

LGTM

weekface · 2019-12-23T09:31:16Z

/run-e2e-tests

tennix

LGTM

tennix · 2019-12-23T09:51:40Z

/merge

sre-bot · 2019-12-23T09:51:44Z

Your auto merge job has been accepted, waiting for 1385, 1383

sre-bot · 2019-12-23T10:13:08Z

/run-all-tests

sre-bot · 2019-12-23T10:50:42Z

/run-all-tests

tennix · 2019-12-23T11:27:42Z

/merge

sre-bot · 2019-12-23T11:28:08Z

/run-all-tests

…release-1.0

…sage (#1400)

scheduler: refine scheduler error message

2087cd6

weekface requested review from cofyc and tennix December 19, 2019 08:40

weekface added the status/PTAL PR needs to be reviewed label Dec 19, 2019

fix make check error

9f59411

weekface commented Dec 19, 2019

View reviewed changes

cofyc previously approved these changes Dec 19, 2019

View reviewed changes

weekface requested a review from onlymellb December 19, 2019 09:22

onlymellb reviewed Dec 19, 2019

View reviewed changes

pkg/scheduler/predicates/ha.go Outdated Show resolved Hide resolved

Update pkg/scheduler/predicates/ha.go

f365ee7

Co-Authored-By: onlymellb <luolibin@pingcap.com>

weekface dismissed cofyc’s stale review via f365ee7 December 19, 2019 09:35

fix typo

98a626e

onlymellb reviewed Dec 19, 2019

View reviewed changes

weekface added 2 commits December 19, 2019 19:47

address comment

c7c3f2f

address comment

7840d22

cofyc previously approved these changes Dec 19, 2019

View reviewed changes

weekface added the status/LGTM1 label Dec 20, 2019

tennix reviewed Dec 23, 2019

View reviewed changes

cofyc reviewed Dec 23, 2019

View reviewed changes

address comment

182874e

weekface dismissed cofyc’s stale review via 182874e December 23, 2019 08:18

adress comment

bc7f6aa

cofyc previously approved these changes Dec 23, 2019

View reviewed changes

remove comment

2dfc38f

weekface dismissed cofyc’s stale review via 2dfc38f December 23, 2019 08:56

cofyc approved these changes Dec 23, 2019

View reviewed changes

Merge branch 'master' into refine-scheduler

8e91ce2

tennix approved these changes Dec 23, 2019

View reviewed changes

sre-bot added the status/can-merge label Dec 23, 2019

Merge branch 'master' into refine-scheduler

9a3e4c6

weekface mentioned this pull request Dec 23, 2019

we need to emit important messages as k8s events #165

Closed

2 tasks

sre-bot merged commit 72f4453 into pingcap:master Dec 23, 2019

weekface mentioned this pull request Dec 24, 2019

Automated cherry pick of #1373: scheduler: refine scheduler error message #1400

Merged

weekface added a commit to weekface/tidb-operator that referenced this pull request Dec 24, 2019

Merge branch 'release-1.0' into automated-cherry-pick-of-pingcap#1373-…

f79776e

…release-1.0

weekface added a commit to weekface/tidb-operator that referenced this pull request Dec 25, 2019

Merge branch 'release-1.0' into automated-cherry-pick-of-pingcap#1373-…

7aa6fd6

…release-1.0

weekface added a commit that referenced this pull request Dec 25, 2019

Automated cherry pick of #1373: scheduler: refine scheduler error mes…

f10cdb8

…sage (#1400)

weekface deleted the refine-scheduler branch January 3, 2020 02:53

weekface mentioned this pull request Jan 6, 2020

exposing the tidb cluster states/changes as k8s events to users #1375

Closed

7 tasks

cofyc mentioned this pull request Jan 17, 2020

Improve error messages for "Failed filter with extender" #1346

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: refine scheduler error message #1373

scheduler: refine scheduler error message #1373

weekface commented Dec 19, 2019

weekface Dec 19, 2019

weekface Dec 19, 2019

weekface Dec 19, 2019

weekface Dec 19, 2019

cofyc left a comment

onlymellb Dec 19, 2019 •

edited

Loading

weekface Dec 19, 2019

onlymellb Dec 19, 2019

cofyc left a comment

tennix Dec 23, 2019

cofyc Dec 23, 2019

tennix Dec 23, 2019

cofyc Dec 23, 2019

weekface Dec 23, 2019

cofyc Dec 23, 2019

weekface Dec 23, 2019

weekface Dec 23, 2019

cofyc Dec 23, 2019

weekface Dec 23, 2019

cofyc left a comment

cofyc left a comment

weekface commented Dec 23, 2019

tennix left a comment

tennix commented Dec 23, 2019

sre-bot commented Dec 23, 2019

sre-bot commented Dec 23, 2019

sre-bot commented Dec 23, 2019

tennix commented Dec 23, 2019

sre-bot commented Dec 23, 2019

	return nodes, fmt.Errorf("cannot run on its previous node %q", nodeName)
	return nodes, fmt.Errorf("cannot run %s/%s on its previous node %q", ns, podName, nodeName)

		}
		glog.V(2).Infof("no previous node exists for pod %q in TiDB cluster %s/%q", podName, ns, tcName)

scheduler: refine scheduler error message #1373

scheduler: refine scheduler error message #1373

Conversation

weekface commented Dec 19, 2019

What problem does this PR solve?

What is changed and how does it work?

Check List

Does this PR introduce a user-facing change?:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cofyc left a comment

Choose a reason for hiding this comment

onlymellb Dec 19, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cofyc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cofyc left a comment

Choose a reason for hiding this comment

cofyc left a comment

Choose a reason for hiding this comment

weekface commented Dec 23, 2019

tennix left a comment

Choose a reason for hiding this comment

tennix commented Dec 23, 2019

sre-bot commented Dec 23, 2019

sre-bot commented Dec 23, 2019

sre-bot commented Dec 23, 2019

tennix commented Dec 23, 2019

sre-bot commented Dec 23, 2019

onlymellb Dec 19, 2019 •

edited

Loading