-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve e2e test flakyness #1825
Conversation
Make sure we wait that task and pipeline has been created before creating the pipelinerun or tasks or we will end up with some race where the task has not had time to be created (due of some load?) before the taskrun gets executed. Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
Reduce the number of running tests from 25 to 5. Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
The following is the coverage report on pkg/.
|
/test pull-tekton-pipeline-integration-tests |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: vdemeester The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Let's make sure to tests the tests helpers! Signed-off-by: Chmouel Boudjnah <chmouel@redhat.com>
The following is the coverage report on pkg/.
|
@@ -29,7 +29,7 @@ import ( | |||
knativetest "knative.dev/pkg/test" | |||
) | |||
|
|||
// TestDuplicatePodTaskRun creates 10 builds and checks that each of them has only one build pod. | |||
// TestDuplicatePodTaskRun creates 5 builds and checks that each of them has only one build pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This should say "TaskRuns", and not "builds"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it shouldn't say "build pod" either
@@ -38,7 +38,7 @@ func TestDuplicatePodTaskRun(t *testing.T) { | |||
defer tearDown(t, c, namespace) | |||
|
|||
var wg sync.WaitGroup | |||
for i := 0; i < 25; i++ { | |||
for i := 0; i < 5; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not you code, but this gives us an opportunity to clean up: wg.Add
should go as close to defer wg.Done
as possible. Can you move this to line ~55?
@@ -29,7 +29,7 @@ import ( | |||
knativetest "knative.dev/pkg/test" | |||
) | |||
|
|||
// TestDuplicatePodTaskRun creates 10 builds and checks that each of them has only one build pod. | |||
// TestDuplicatePodTaskRun creates 5 builds and checks that each of them has only one build pod. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove the specific number from the comment, in case we update it in the future (and probably forget to update the comment?)
@@ -38,7 +38,7 @@ func TestDuplicatePodTaskRun(t *testing.T) { | |||
defer tearDown(t, c, namespace) | |||
|
|||
var wg sync.WaitGroup | |||
for i := 0; i < 25; i++ { | |||
for i := 0; i < 5; i++ { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reducing this number reduces the test's effectiveness -- now it needs duplicate pods to be created 1 in every 5 times to be able to detect it, as opposed to just 1 in every 25.
If the real underlying problem is OOMs, can we solve this by setting resource requests/limits on the TaskRuns, so they don't get OOM-killed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a better way to tests this now. Also, this was introduced when Tekton was backed by knative/build, which is not the case anymore..
We may want to try setting resources limit to see if it fixes things 👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe i don't understand how resources limits works, but isn't that if we don't specify any limit by default it will starve memory for the pod until the node's limits is getting reach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chmouel right, but it should also take though into account while scheduling, not allowing pod to be scheduled on a node that is starving already 👼
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am I understanding right that either way we will still be failing ? just failing early ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, it would wait for memory to be back to schedule new one if all nodes are starving, reducing the number of pods running in parallel at the same time 😅 … which may reduce the effectivness of the test too 😓 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we really sure it detects duplication tho ? It does sounds a lot of hoopla hoops to go thru to make sure k8 behave rights (ie: to have no races between task creation and start)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chmouel it did back in the day for sure (aka this test against the code non fixed would fail for the good reason)
@@ -62,6 +62,12 @@ func TestDAGPipelineRun(t *testing.T) { | |||
t.Fatalf("Failed to create echo Task: %s", err) | |||
} | |||
|
|||
// Make sure the Pipeline has been created (wait for it) | |||
if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding a helper method to create-and-wait-for-existence a resource? I'm worried we'll forget to wait for existence every time we create a Task/Pipeline/etc., and have to chase flakes forever as a result.
Do we know why Create
returns before the resource is actually reliably created? That seems like a problem for lots of tests. Is there somewhere we're using a lister in the reconciler to retrieve a Task/Pipeline, where we should be using a direct client to get them, bypassing caching? We've seen listers being slow to notice new objects in the past, and have (mostly?) replaced them with regular k8s clients as a result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not so sure and more I read your comment and more I think that my patch may not address the fundamental problem, it doesn't matter if we know it has been created if it's actually the reconciler being slow to be notified.
Is there somewhere we're using a lister in the reconciler to retrieve a
I am not so sure about it (and probably will start to ping you about if i have to figure out :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline/pkg/reconciler/pipelinerun/controller.go
Lines 72 to 82 in f1bfda6
c := &Reconciler{ | |
Base: reconciler.NewBase(opt, pipelineRunAgentName, images), | |
pipelineRunLister: pipelineRunInformer.Lister(), | |
pipelineLister: pipelineInformer.Lister(), | |
taskLister: taskInformer.Lister(), | |
clusterTaskLister: clusterTaskInformer.Lister(), | |
taskRunLister: taskRunInformer.Lister(), | |
resourceLister: resourceInformer.Lister(), | |
conditionLister: conditionInformer.Lister(), | |
timeoutHandler: timeoutHandler, | |
metrics: metrics, |
// Make sure the Pipeline has been created (wait for it) | ||
if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil { | ||
t.Errorf("Error waiting for Task echo-task to be created: %s", err) | ||
t.Fatal("Pipeline execution failed, Task echo-task has not been created") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Don't need a t.Error
and a t.Fatal
, you could just combine these into one descriptive t.Fatal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
I think it would be best to get to the bottom of the "lister" issue.
|
||
return wait.PollImmediate(interval, timeout, func() (bool, error) { | ||
pc, err := c.PipelineClient.Get(name, metav1.GetOptions{}) | ||
if pc.GetName() == name { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure why we need this check? Is it a way to check if no result was returned?
Wouldn't we get then an err =! nil?
@@ -68,6 +68,36 @@ type TaskRunStateFn func(r *v1alpha1.TaskRun) (bool, error) | |||
// PipelineRunStateFn is a condition function on TaskRun used polling functions | |||
type PipelineRunStateFn func(pr *v1alpha1.PipelineRun) (bool, error) | |||
|
|||
// WaitForPipelineCreated wait until a pipeline has been created | |||
func WaitForPipelineCreated(c *clients, name, desc string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice <3
@@ -62,6 +62,12 @@ func TestDAGPipelineRun(t *testing.T) { | |||
t.Fatalf("Failed to create echo Task: %s", err) | |||
} | |||
|
|||
// Make sure the Pipeline has been created (wait for it) | |||
if err := WaitForTaskCreated(c, "echo-task", "TaskCreated"); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pipeline/pkg/reconciler/pipelinerun/controller.go
Lines 72 to 82 in f1bfda6
c := &Reconciler{ | |
Base: reconciler.NewBase(opt, pipelineRunAgentName, images), | |
pipelineRunLister: pipelineRunInformer.Lister(), | |
pipelineLister: pipelineInformer.Lister(), | |
taskLister: taskInformer.Lister(), | |
clusterTaskLister: clusterTaskInformer.Lister(), | |
taskRunLister: taskRunInformer.Lister(), | |
resourceLister: resourceInformer.Lister(), | |
conditionLister: conditionInformer.Lister(), | |
timeoutHandler: timeoutHandler, | |
metrics: metrics, |
/hold I think we need to get down to why the lister is slow to react first... |
@chmouel what do you want to do next with this? maybe make an issue to investigate the lister responsiveness and close this in the meantime? |
@bobcatfish Yep sounds good, to be honest I wasn't able to recreate the OOMs, which I used to have for 1 in a 3 runs before. So something has changed which got things better and the controller more reactive. But i think there is an issue we can fill about performances in general and see how the controller scales on small clusters. /close |
@chmouel: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
When we run a lot of e2e tests with reasonably sized VMs we are getting some
OOMKilled pods or some race between the task/pipeline to the runs.
For the OOMKilled we reduce the number of tasks runs concurently from 25 to 5,
as I am not sure what are the advantages of using a greater number and this ease
up the nodes.
For the race we are waiting that the objects gets created before running
the *Runs objects so the TaskRun doesn't get created after the Task (or
Pipeline).
Fixes #1820
Fixes #1819
Fixes #1815
Changes
Submitter Checklist
These are the criteria that every PR should meet, please check them off as you
review them:
See the contribution guide for more details.
Double check this list of stuff that's easy to miss:
cmd
dir, please updatethe release Task to build and release this image.
Reviewer Notes
Release Notes