Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve e2e test flakyness #1825

Closed
wants to merge 3 commits into from
Closed

Conversation

chmouel
Copy link
Member

@chmouel chmouel commented Jan 8, 2020

When we run a lot of e2e tests with reasonably sized VMs we are getting some
OOMKilled pods or some race between the task/pipeline to the runs.

For the OOMKilled we reduce the number of tasks runs concurently from 25 to 5,
as I am not sure what are the advantages of using a greater number and this ease
up the nodes.

For the race we are waiting that the objects gets created before running
the *Runs objects so the TaskRun doesn't get created after the Task (or
Pipeline).

Fixes #1820
Fixes #1819
Fixes #1815

Changes

Submitter Checklist

These are the criteria that every PR should meet, please check them off as you
review them:

See the contribution guide for more details.

Double check this list of stuff that's easy to miss:

Reviewer Notes

Release Notes

Make sure we wait that task and pipeline has been created before creating the
pipelinerun or tasks or we will end up with some race where the task has not
had time to be created (due of some load?) before the taskrun gets executed.

Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
Reduce the number of running tests from 25 to 5.

Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
@googlebot googlebot added the cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit label Jan 8, 2020
@tekton-robot tekton-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jan 8, 2020
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/wait.go 50.0% 40.9% -9.1

@chmouel
Copy link
Member Author

chmouel commented Jan 8, 2020

/test pull-tekton-pipeline-integration-tests

@tekton-robot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: vdemeester

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@tekton-robot tekton-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 8, 2020
Let's make sure to tests the tests helpers!

Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
@tekton-robot
Copy link
Collaborator

The following is the coverage report on pkg/.
Say /test pull-tekton-pipeline-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
test/wait.go 50.0% 56.8% 6.8

@@ -29,7 +29,7 @@ import (
knativetest "knative.dev/pkg/test"
)

// TestDuplicatePodTaskRun creates 10 builds and checks that each of them has only one build pod.
// TestDuplicatePodTaskRun creates 5 builds and checks that each of them has only one build pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This should say "TaskRuns", and not "builds"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it shouldn't say "build pod" either

@@ -38,7 +38,7 @@ func TestDuplicatePodTaskRun(t *testing.T) {
defer tearDown(t, c, namespace)

var wg sync.WaitGroup
for i := 0; i < 25; i++ {
for i := 0; i < 5; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not you code, but this gives us an opportunity to clean up: wg.Add should go as close to defer wg.Done as possible. Can you move this to line ~55?

@@ -29,7 +29,7 @@ import (
knativetest "knative.dev/pkg/test"
)

// TestDuplicatePodTaskRun creates 10 builds and checks that each of them has only one build pod.
// TestDuplicatePodTaskRun creates 5 builds and checks that each of them has only one build pod.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the specific number from the comment, in case we update it in the future (and probably forget to update the comment?)

@@ -38,7 +38,7 @@ func TestDuplicatePodTaskRun(t *testing.T) {
defer tearDown(t, c, namespace)

var wg sync.WaitGroup
for i := 0; i < 25; i++ {
for i := 0; i < 5; i++ {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing this number reduces the test's effectiveness -- now it needs duplicate pods to be created 1 in every 5 times to be able to detect it, as opposed to just 1 in every 25.

If the real underlying problem is OOMs, can we solve this by setting resource requests/limits on the TaskRuns, so they don't get OOM-killed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better way to tests this now. Also, this was introduced when Tekton was backed by knative/build, which is not the case anymore..

We may want to try setting resources limit to see if it fixes things 👼

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe i don't understand how resources limits works, but isn't that if we don't specify any limit by default it will starve memory for the pod until the node's limits is getting reach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chmouel right, but it should also take though into account while scheduling, not allowing pod to be scheduled on a node that is starving already 👼

Copy link
Member Author

@chmouel chmouel Jan 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am I understanding right that either way we will still be failing ? just failing early ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, it would wait for memory to be back to schedule new one if all nodes are starving, reducing the number of pods running in parallel at the same time 😅 … which may reduce the effectivness of the test too 😓 🤔

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we really sure it detects duplication tho ? It does sounds a lot of hoopla hoops to go thru to make sure k8 behave rights (ie: to have no races between task creation and start)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chmouel it did back in the day for sure (aka this test against the code non fixed would fail for the good reason)

@@ -62,6 +62,12 @@ func TestDAGPipelineRun(t *testing.T) {
t.Fatalf("Failed to create echo Task: %s", err)
}

// Make sure the Pipeline has been created (wait for it)
if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth adding a helper method to create-and-wait-for-existence a resource? I'm worried we'll forget to wait for existence every time we create a Task/Pipeline/etc., and have to chase flakes forever as a result.

Do we know why Create returns before the resource is actually reliably created? That seems like a problem for lots of tests. Is there somewhere we're using a lister in the reconciler to retrieve a Task/Pipeline, where we should be using a direct client to get them, bypassing caching? We've seen listers being slow to notice new objects in the past, and have (mostly?) replaced them with regular k8s clients as a result.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not so sure and more I read your comment and more I think that my patch may not address the fundamental problem, it doesn't matter if we know it has been created if it's actually the reconciler being slow to be notified.

Is there somewhere we're using a lister in the reconciler to retrieve a

I am not so sure about it (and probably will start to ping you about if i have to figure out :-)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c := &Reconciler{
Base: reconciler.NewBase(opt, pipelineRunAgentName, images),
pipelineRunLister: pipelineRunInformer.Lister(),
pipelineLister: pipelineInformer.Lister(),
taskLister: taskInformer.Lister(),
clusterTaskLister: clusterTaskInformer.Lister(),
taskRunLister: taskRunInformer.Lister(),
resourceLister: resourceInformer.Lister(),
conditionLister: conditionInformer.Lister(),
timeoutHandler: timeoutHandler,
metrics: metrics,
?

// Make sure the Pipeline has been created (wait for it)
if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil {
t.Errorf("Error waiting for Task echo-task to be created: %s", err)
t.Fatal("Pipeline execution failed, Task echo-task has not been created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Don't need a t.Error and a t.Fatal, you could just combine these into one descriptive t.Fatal.

Copy link
Member

@afrittoli afrittoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!
I think it would be best to get to the bottom of the "lister" issue.


return wait.PollImmediate(interval, timeout, func() (bool, error) {
pc, err := c.PipelineClient.Get(name, metav1.GetOptions{})
if pc.GetName() == name {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure why we need this check? Is it a way to check if no result was returned?
Wouldn't we get then an err =! nil?

@@ -68,6 +68,36 @@ type TaskRunStateFn func(r *v1alpha1.TaskRun) (bool, error)
// PipelineRunStateFn is a condition function on TaskRun used polling functions
type PipelineRunStateFn func(pr *v1alpha1.PipelineRun) (bool, error)

// WaitForPipelineCreated wait until a pipeline has been created
func WaitForPipelineCreated(c *clients, name, desc string) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice <3

@@ -62,6 +62,12 @@ func TestDAGPipelineRun(t *testing.T) {
t.Fatalf("Failed to create echo Task: %s", err)
}

// Make sure the Pipeline has been created (wait for it)
if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

c := &Reconciler{
Base: reconciler.NewBase(opt, pipelineRunAgentName, images),
pipelineRunLister: pipelineRunInformer.Lister(),
pipelineLister: pipelineInformer.Lister(),
taskLister: taskInformer.Lister(),
clusterTaskLister: clusterTaskInformer.Lister(),
taskRunLister: taskRunInformer.Lister(),
resourceLister: resourceInformer.Lister(),
conditionLister: conditionInformer.Lister(),
timeoutHandler: timeoutHandler,
metrics: metrics,
?

@chmouel
Copy link
Member Author

chmouel commented Jan 10, 2020

/hold

I think we need to get down to why the lister is slow to react first...

@tekton-robot tekton-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 10, 2020
@bobcatfish
Copy link
Collaborator

@chmouel what do you want to do next with this? maybe make an issue to investigate the lister responsiveness and close this in the meantime?

@chmouel
Copy link
Member Author

chmouel commented Feb 6, 2020

@bobcatfish Yep sounds good, to be honest I wasn't able to recreate the OOMs, which I used to have for 1 in a 3 runs before. So something has changed which got things better and the controller more reactive.

But i think there is an issue we can fill about performances in general and see how the controller scales on small clusters.

/close

@tekton-robot
Copy link
Collaborator

@chmouel: Closed this PR.

In response to this:

@bobcatfish Yep sounds good, to be honest I wasn't able to recreate the OOMs, which I used to have for 1 in a 3 runs before. So something has changed which got things better and the controller more reactive.

But i think there is an issue we can fill about performances in general and see how the controller scales on small clusters.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cla: yes Trying to make the CLA bot happy with ppl from different companies work on one commit do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
7 participants