Improve e2e test flakyness #1825

chmouel · 2020-01-08T16:36:53Z

When we run a lot of e2e tests with reasonably sized VMs we are getting some
OOMKilled pods or some race between the task/pipeline to the runs.

For the OOMKilled we reduce the number of tasks runs concurently from 25 to 5,
as I am not sure what are the advantages of using a greater number and this ease
up the nodes.

For the race we are waiting that the objects gets created before running
the *Runs objects so the TaskRun doesn't get created after the Task (or
Pipeline).

Fixes #1820
Fixes #1819
Fixes #1815

Changes

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

[👍] Includes tests (if functionality changed/added)
[ 🤷‍♂ ] Includes docs (if user facing)
[👍] Commit messages follow commit message best practices

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

If you are adding a new binary/image to the cmd dir, please update
the release Task to build and release this image.

Reviewer Notes

Release Notes

Make sure we wait that task and pipeline has been created before creating the pipelinerun or tasks or we will end up with some race where the task has not had time to be created (due of some load?) before the taskrun gets executed. Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>

Reduce the number of running tests from 25 to 5. Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>

tekton-robot · 2020-01-08T16:38:49Z

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
test/wait.go	50.0%	40.9%	-9.1

chmouel · 2020-01-08T16:55:56Z

/test pull-tekton-pipeline-integration-tests

tekton-robot · 2020-01-08T16:56:38Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [vdemeester]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Let's make sure to tests the tests helpers! Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>

tekton-robot · 2020-01-08T17:48:22Z

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File	Old Coverage	New Coverage	Delta
test/wait.go	50.0%	56.8%	6.8

imjasonh · 2020-01-08T20:06:46Z

test/duplicate_test.go

@@ -29,7 +29,7 @@ import (
 	knativetest "knative.dev/pkg/test"
 )

-// TestDuplicatePodTaskRun creates 10 builds and checks that each of them has only one build pod.
+// TestDuplicatePodTaskRun creates 5 builds and checks that each of them has only one build pod.


nit: This should say "TaskRuns", and not "builds"

it shouldn't say "build pod" either

imjasonh · 2020-01-08T20:09:03Z

test/duplicate_test.go

@@ -38,7 +38,7 @@ func TestDuplicatePodTaskRun(t *testing.T) {
 	defer tearDown(t, c, namespace)

 	var wg sync.WaitGroup
-	for i := 0; i < 25; i++ {
+	for i := 0; i < 5; i++ {


It's not you code, but this gives us an opportunity to clean up: wg.Add should go as close to defer wg.Done as possible. Can you move this to line ~55?

imjasonh · 2020-01-08T20:10:17Z

test/duplicate_test.go

@@ -29,7 +29,7 @@ import (
 	knativetest "knative.dev/pkg/test"
 )

-// TestDuplicatePodTaskRun creates 10 builds and checks that each of them has only one build pod.
+// TestDuplicatePodTaskRun creates 5 builds and checks that each of them has only one build pod.


Can we remove the specific number from the comment, in case we update it in the future (and probably forget to update the comment?)

imjasonh · 2020-01-08T20:14:15Z

test/duplicate_test.go

@@ -38,7 +38,7 @@ func TestDuplicatePodTaskRun(t *testing.T) {
 	defer tearDown(t, c, namespace)

 	var wg sync.WaitGroup
-	for i := 0; i < 25; i++ {
+	for i := 0; i < 5; i++ {


Reducing this number reduces the test's effectiveness -- now it needs duplicate pods to be created 1 in every 5 times to be able to detect it, as opposed to just 1 in every 25.

If the real underlying problem is OOMs, can we solve this by setting resource requests/limits on the TaskRuns, so they don't get OOM-killed?

I wonder if there is a better way to tests this now. Also, this was introduced when Tekton was backed by knative/build, which is not the case anymore..

We may want to try setting resources limit to see if it fixes things 👼

maybe i don't understand how resources limits works, but isn't that if we don't specify any limit by default it will starve memory for the pod until the node's limits is getting reach?

@chmouel right, but it should also take though into account while scheduling, not allowing pod to be scheduled on a node that is starving already 👼

am I understanding right that either way we will still be failing ? just failing early ?

well, it would wait for memory to be back to schedule new one if all nodes are starving, reducing the number of pods running in parallel at the same time 😅 … which may reduce the effectivness of the test too 😓 🤔

Are we really sure it detects duplication tho ? It does sounds a lot of hoopla hoops to go thru to make sure k8 behave rights (ie: to have no races between task creation and start)

@chmouel it did back in the day for sure (aka this test against the code non fixed would fail for the good reason)

imjasonh · 2020-01-08T20:18:31Z

test/dag_test.go

@@ -62,6 +62,12 @@ func TestDAGPipelineRun(t *testing.T) {
 		t.Fatalf("Failed to create echo Task: %s", err)
 	}

+	// Make sure the Pipeline has been created (wait for it)
+	if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil {


Would it be worth adding a helper method to create-and-wait-for-existence a resource? I'm worried we'll forget to wait for existence every time we create a Task/Pipeline/etc., and have to chase flakes forever as a result.

Do we know why Create returns before the resource is actually reliably created? That seems like a problem for lots of tests. Is there somewhere we're using a lister in the reconciler to retrieve a Task/Pipeline, where we should be using a direct client to get them, bypassing caching? We've seen listers being slow to notice new objects in the past, and have (mostly?) replaced them with regular k8s clients as a result.

I am not so sure and more I read your comment and more I think that my patch may not address the fundamental problem, it doesn't matter if we know it has been created if it's actually the reconciler being slow to be notified.

Is there somewhere we're using a lister in the reconciler to retrieve a

I am not so sure about it (and probably will start to ping you about if i have to figure out :-)

pipeline/pkg/reconciler/pipelinerun/controller.go

Lines 72 to 82 in f1bfda6

c := &Reconciler{

Base: reconciler.NewBase(opt, pipelineRunAgentName, images),

pipelineRunLister: pipelineRunInformer.Lister(),

pipelineLister: pipelineInformer.Lister(),

taskLister: taskInformer.Lister(),

clusterTaskLister: clusterTaskInformer.Lister(),

taskRunLister: taskRunInformer.Lister(),

resourceLister: resourceInformer.Lister(),

conditionLister: conditionInformer.Lister(),

timeoutHandler: timeoutHandler,

metrics: metrics,

?

imjasonh · 2020-01-08T20:22:26Z

test/dag_test.go

+	// Make sure the Pipeline has been created (wait for it)
+	if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil {
+		t.Errorf("Error waiting for Task echo-task to be created: %s", err)
+		t.Fatal("Pipeline execution failed, Task echo-task has not been created")


nit: Don't need a t.Error and a t.Fatal, you could just combine these into one descriptive t.Fatal.

afrittoli

Thanks for looking into this!
I think it would be best to get to the bottom of the "lister" issue.

afrittoli · 2020-01-10T10:51:47Z

test/wait.go

+
+	return wait.PollImmediate(interval, timeout, func() (bool, error) {
+		pc, err := c.PipelineClient.Get(name, metav1.GetOptions{})
+		if pc.GetName() == name {


I'm not sure why we need this check? Is it a way to check if no result was returned?
Wouldn't we get then an err =! nil?

afrittoli · 2020-01-10T10:54:17Z

test/wait.go

@@ -68,6 +68,36 @@ type TaskRunStateFn func(r *v1alpha1.TaskRun) (bool, error)
 // PipelineRunStateFn is a condition function on TaskRun used polling functions
 type PipelineRunStateFn func(pr *v1alpha1.PipelineRun) (bool, error)

+// WaitForPipelineCreated wait until a pipeline has been created
+func WaitForPipelineCreated(c *clients, name, desc string) error {


afrittoli · 2020-01-10T11:02:58Z

test/dag_test.go

@@ -62,6 +62,12 @@ func TestDAGPipelineRun(t *testing.T) {
 		t.Fatalf("Failed to create echo Task: %s", err)
 	}

+	// Make sure the Pipeline has been created (wait for it)
+	if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil {


pipeline/pkg/reconciler/pipelinerun/controller.go

Lines 72 to 82 in f1bfda6

c := &Reconciler{

Base: reconciler.NewBase(opt, pipelineRunAgentName, images),

pipelineRunLister: pipelineRunInformer.Lister(),

pipelineLister: pipelineInformer.Lister(),

taskLister: taskInformer.Lister(),

clusterTaskLister: clusterTaskInformer.Lister(),

taskRunLister: taskRunInformer.Lister(),

resourceLister: resourceInformer.Lister(),

conditionLister: conditionInformer.Lister(),

timeoutHandler: timeoutHandler,

metrics: metrics,

?

chmouel · 2020-01-10T12:11:54Z

/hold

I think we need to get down to why the lister is slow to react first...

bobcatfish · 2020-02-06T16:46:51Z

@chmouel what do you want to do next with this? maybe make an issue to investigate the lister responsiveness and close this in the meantime?

chmouel · 2020-02-06T18:25:57Z

@bobcatfish Yep sounds good, to be honest I wasn't able to recreate the OOMs, which I used to have for 1 in a 3 runs before. So something has changed which got things better and the controller more reactive.

But i think there is an issue we can fill about performances in general and see how the controller scales on small clusters.

/close

tekton-robot · 2020-02-06T18:26:01Z

@chmouel: Closed this PR.

In response to this:

@bobcatfish Yep sounds good, to be honest I wasn't able to recreate the OOMs, which I used to have for 1 in a 3 runs before. So something has changed which got things better and the controller more reactive.

But i think there is an issue we can fill about performances in general and see how the controller scales on small clusters.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chmouel added 2 commits January 8, 2020 17:30

Reduce number of pods in TestDuplicate

1ba3db9

Reduce the number of running tests from 25 to 5. Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>

tekton-robot requested review from bobcatfish and vdemeester January 8, 2020 16:36

googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Jan 8, 2020

tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 8, 2020

vdemeester approved these changes Jan 8, 2020

View reviewed changes

tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2020

Add unit tests for the Wait*Created test helpers

6296ca2

Let's make sure to tests the tests helpers! Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>

imjasonh requested changes Jan 8, 2020

View reviewed changes

afrittoli reviewed Jan 10, 2020

View reviewed changes

tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2020

tekton-robot closed this Feb 6, 2020

chmouel deleted the flakyness branch February 6, 2020 18:26

bobcatfish mentioned this pull request Jun 4, 2020

The pipeline resource is fetched via client vs. lister in resolver #2740

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve e2e test flakyness #1825

Improve e2e test flakyness #1825

chmouel commented Jan 8, 2020 •

edited

Loading

tekton-robot commented Jan 8, 2020

chmouel commented Jan 8, 2020

tekton-robot commented Jan 8, 2020

tekton-robot commented Jan 8, 2020

imjasonh Jan 8, 2020

afrittoli Jan 10, 2020

imjasonh Jan 8, 2020

imjasonh Jan 8, 2020

imjasonh Jan 8, 2020

vdemeester Jan 9, 2020

chmouel Jan 9, 2020

vdemeester Jan 9, 2020

chmouel Jan 9, 2020 •

edited

Loading

vdemeester Jan 9, 2020

chmouel Jan 9, 2020

vdemeester Jan 9, 2020

imjasonh Jan 8, 2020

chmouel Jan 9, 2020

afrittoli Jan 10, 2020

imjasonh Jan 8, 2020

afrittoli left a comment

afrittoli Jan 10, 2020

afrittoli Jan 10, 2020

afrittoli Jan 10, 2020

chmouel commented Jan 10, 2020

bobcatfish commented Feb 6, 2020

chmouel commented Feb 6, 2020

tekton-robot commented Feb 6, 2020

	c := &Reconciler{
	Base: reconciler.NewBase(opt, pipelineRunAgentName, images),
	pipelineRunLister: pipelineRunInformer.Lister(),
	pipelineLister: pipelineInformer.Lister(),
	taskLister: taskInformer.Lister(),
	clusterTaskLister: clusterTaskInformer.Lister(),
	taskRunLister: taskRunInformer.Lister(),
	resourceLister: resourceInformer.Lister(),
	conditionLister: conditionInformer.Lister(),
	timeoutHandler: timeoutHandler,
	metrics: metrics,

Improve e2e test flakyness #1825

Improve e2e test flakyness #1825

Conversation

chmouel commented Jan 8, 2020 • edited Loading

Changes

Submitter Checklist

Reviewer Notes

Release Notes

tekton-robot commented Jan 8, 2020

chmouel commented Jan 8, 2020

tekton-robot commented Jan 8, 2020

tekton-robot commented Jan 8, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chmouel Jan 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

afrittoli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chmouel commented Jan 10, 2020

bobcatfish commented Feb 6, 2020

chmouel commented Feb 6, 2020

tekton-robot commented Feb 6, 2020

chmouel commented Jan 8, 2020 •

edited

Loading

chmouel Jan 9, 2020 •

edited

Loading