add `resume-ci` label #43929

MoLow · 2022-07-21T12:13:50Z

fixes #40817
awaiting merge of nodejs/node-core-utils#642

nodejs-github-bot · 2022-07-21T12:13:54Z

Review requested:

@nodejs/actions
@nodejs/tsc

targos · 2022-07-21T12:24:01Z

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

aduh95 · 2022-07-21T12:26:10Z

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

Unless you are not a collaborator (e.g. a trigger), in which case you cannot resume using Jenkins CI. I agree with the sentiment though.

MoLow · 2022-07-21T12:31:51Z

I agree, CI should only be resumed if a human has made sure the specific flakiness existed before that PR - this is a tool that can help triaggers do something they currently cannot do

this was already discussed in the original issue #40817 (comment)

MoLow · 2022-07-21T12:47:44Z

also, as for my understanding, TSC has addressed (part) of this discussion as well? #42125 (comment)
https://github.com/mhdawson/TSC/blob/a52f5bb892d986e470661c15635d79d384302dd1/meetings/2022-03-10.md#nodejsnode

targos · 2022-07-21T12:53:43Z

Was it considered to allow triagers to access the Jenkins feature?

tniessen · 2022-07-21T13:07:39Z

Was it considered to allow triagers to access the Jenkins feature?

IIRC that is the intention here: not to make resuming CIs simpler in general, but as a workaround to give this particular permission to triagers.

targos · 2022-07-21T13:24:27Z

I mean, do we really need a workaround, rather than giving them the permission inside Jenkins?

MoLow · 2022-07-21T13:26:18Z

I mean, do we really need a workaround, rather than giving them the permission inside Jenkins?

If that is possible that sounds like a better solution to me

mcollina

wow, amazing

GeoffreyBooth · 2022-07-21T15:56:14Z

I'm very reluctant about adding this because I think it's a mistake to resume CI without looking at the failures, and if you are looking at the failures you can just click on the Jenkins button.

If I didn't need to resume CI repeatedly for every PR, I would sympathize with this. But CI is much too flaky for looking at errors to be worth my time until after I've resumed at least three, maybe five times. It should resume by default and stop after 3-5 attempts.

benjamingr

Not ideal that we have to do this but I do think this is the correct thing to do given the circumstances.

tniessen · 2022-07-23T10:47:22Z

But CI is much too flaky for looking at errors to be worth my time until after I've resumed at least three, maybe five times. It should resume by default and stop after 3-5 attempts.

That is exactly how I assume #43522 was merged, and it has made CI much, much worse for all collaborators, to the point where it was virtually unusable for days.

It should resume by default and stop after 3-5 attempts.

Resuming CI without looking at errors makes it more likely to miss related failures and thus to introduce new flaky tests, which makes the situation worse for everyone, beyond that PR.

Let's say a PR introduces a test that flakes 50 % of the time. Running one CI and checking for errors gives you a 50 % chance of catching it. Resuming CI without checking what errors occurred reduces the chances of catching the flaky test exponentially!

I can't find it right now, but there was a PR a while ago that implemented something like this (i.e., automated resuming or something similar), which I was against for the same reason.

There are other approaches that might be worth investigating. For example, the test runner could, when a test fails, re-run it n times and count how many times it fails. This can be used to estimate whether a test flaked or whether it failed deterministically, but CI should still fail. If a test is marked as flaky, regardless of whether it passes, the test runner could run it n times and see if at least one run passes, to make sure it really is flaky and not failing every time. (But this would need some experimentation.)

mcollina · 2022-07-23T11:20:03Z

The problem is more profound: it's virtually impossible to get a green CI without resuming. I currently have 7 PRs that I'm restarting CI every day.

Something that would be extremely helpful is to get the list of failed tests as a PR comment.

We could have a different approach:

mark all tests that fails at least once a day as flaky
only run a reduced set of very reliable tests on the platforms that are more likely to fail (windows, arm, smartos, aix). This can be based on the support tier

tniessen · 2022-07-23T11:38:21Z

I completely agree @mcollina, we are in a tough spot right now. All I am saying is that resuming without properly checking for errors only makes things worse.

mark all tests that fails at least once a day as flaky

Big +1 as long as we treat it as an urgent TODO list. (Essentially what I wrote in #43754 (comment).)

Refs: nodejs#43929 (comment)

Refs: #43929 (comment) PR-URL: #43954 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Feng Yu <F3n67u@outlook.com>

kvakil · 2022-08-05T03:33:24Z

I think that this PR would be a great change. But I also think that request-ci should always do the right thing: if there are no commits since the last CI, it should retry the CI. If there is an intervening commit, it should start a new build. People shouldn't need to think about which is correct. #44130 clears up some of the documentation here but I think it's hard to solve UX issues with documentation.

(signed, an idiot who didn't realize request and retry were different, and so requested too many builds)

MoLow · 2022-08-05T05:28:51Z

@kvakil that sounds a great improvement! but:

I feel this PR is currently too controversial to land, without addressing some of the feedback. Perhaps if we add your suggestion with a limit that after 3 times, it will comment on the PR - stating that "resume must be done manually after three failures." WDYT?
The PR in ncu needs an approval first. It seems to me like it is ok merging it since running a cli command won't be abused easily as a label, can someone approve and merge it? CC @nodejs/node-core-utils

kvakil · 2022-08-05T05:44:03Z

In my mind resume wouldn't happen automatically, the author would still need to go back and retag the PR with `request-ci` in order to resume it. There just wouldn't be a separate `resume-ci` label: ideally the tooling can detect if the CI needs to be rebuilt entirely (if there were additional commits since the last CI) or if it can just be resumed (if there have been no additional commits). & to be clear this is just a wish, I definitely don't think it should stop us from using this PR. I am not sure how hard the implementation would be. I just think the user experience would be better.

MoLow · 2022-08-05T06:05:45Z

@kvakil adding a request-ci label is still too automatic in the sence that CI failures should not be ignored, and need minimal inspection.
See #43929 (comment)

aduh95 · 2022-08-05T08:34:43Z

I just think the user experience would be better.

Currently the user experience is to go to the Jenkins Web UI to check what are the failures, and if it turns out the failures are indeed unrelated to the PR to test, the "Resume build" button is right there on Jenkins UI, there's no reason to go back to GitHub to add a label. What I'm trying to say is this label is not meant to improve the UX (it won't), it's to enable triagers to resume CIs, which they currently can't.

People shouldn't need to think about which is correct.

That's exactly what's controversial about this PR: some are concerned that adding this label would make collaborators/triagers less likely to think about the CI failures and instead re-apply the label without checking the failures until they get a passing CI – in particular, it would enable the landing of PRs that introduce flaky tests, making the CI even less reliable than they currently are.

Perhaps if we add your suggestion with a limit that after 3 times, it will comment on the PR - stating that "resume must be done manually after three failures." WDYT?

That'd be a very nice feature indeed! I would go further: only accept 1 request-ci and one resume-ci when no new commits have landed; with clear error messages it could help educate people on the specificities of our CI system, and it addresses the concern that folks could use the system to land flaky PRs.
(Lately you rarely need more than one resume to get a passing CI anyway).

MoLow · 2022-08-05T08:40:20Z

That'd be a very nice feature indeed! I would go further: only accept 1 request-ci and one resume-ci when no new commits have landed; with clear error messages it could help educate people on the specificities of our CI system, and it addresses the concern that folks could use the system to land flaky PRs. (Lately you rarely need more than one resume to get a passing CI anyway).

yes, I will probably wait for my nomination to complete so my jenkins token will actually work :)

kvakil · 2022-08-05T09:29:30Z

just think the user experience would be better.

Currently the user experience is to go to the Jenkins Web UI to check what are the failures, and if it turns out the failures are indeed unrelated to the PR to test, the "Resume build" button is right there on Jenkins UI, there's no reason to go back to GitHub to add a label. What I'm trying to say is this label is not meant to improve the UX (it won't), it's to enable triagers to resume CIs, which they currently can't.

The particular UX I dislike here is having resume-ci and request-ci which to me sound very similar. I think I understand the concerns around having one label better now, thanks for elaborating.

Refs: nodejs#43929 (comment) PR-URL: nodejs#43954 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Feng Yu <F3n67u@outlook.com>

Refs: nodejs/node#43929 (comment) PR-URL: nodejs/node#43954 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Feng Yu <F3n67u@outlook.com>

Refs: nodejs#43929 (comment) PR-URL: nodejs#43954 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Feng Yu <F3n67u@outlook.com>

Refs: #43929 (comment) PR-URL: #43954 Backport-PR-URL: #45126 Reviewed-By: Matteo Collina <matteo.collina@gmail.com> Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Feng Yu <F3n67u@outlook.com>

nodejs-github-bot added meta Issues and PRs related to the general management of the project. tools Issues and PRs related to the tools directory. labels Jul 21, 2022

MoLow requested a review from benjamingr July 21, 2022 12:14

meta: add resume-ci label

3705060

MoLow force-pushed the add-resume-ci-label branch from 2932675 to 3705060 Compare July 21, 2022 12:15

mcollina approved these changes Jul 21, 2022

View reviewed changes

benjamingr approved these changes Jul 21, 2022

View reviewed changes

aduh95 added a commit to aduh95/node that referenced this pull request Jul 23, 2022

tools: add more options to track flaky tests

895db5a

Refs: nodejs#43929 (comment)

aduh95 mentioned this pull request Jul 23, 2022

tools: add more options to track flaky tests #43954

Merged

tniessen mentioned this pull request Jul 23, 2022

Ruthlessly mark tests that fail frequently as flaky #43955

Closed

MoLow closed this Sep 11, 2022

MoLow deleted the add-resume-ci-label branch May 24, 2024 09:02

Uh oh!

add resume-ci label #43929

add resume-ci label #43929

Uh oh!

Conversation

MoLow commented Jul 21, 2022

Uh oh!

nodejs-github-bot commented Jul 21, 2022

Uh oh!

targos commented Jul 21, 2022

Uh oh!

aduh95 commented Jul 21, 2022

Uh oh!

MoLow commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MoLow commented Jul 21, 2022

Uh oh!

targos commented Jul 21, 2022

Uh oh!

tniessen commented Jul 21, 2022

Uh oh!

targos commented Jul 21, 2022

Uh oh!

MoLow commented Jul 21, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mcollina left a comment

Choose a reason for hiding this comment

Uh oh!

GeoffreyBooth commented Jul 21, 2022

Uh oh!

benjamingr left a comment

Choose a reason for hiding this comment

Uh oh!

tniessen commented Jul 23, 2022

Uh oh!

mcollina commented Jul 23, 2022

Uh oh!

tniessen commented Jul 23, 2022

Uh oh!

kvakil commented Aug 5, 2022

Uh oh!

MoLow commented Aug 5, 2022

Uh oh!

kvakil commented Aug 5, 2022 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MoLow commented Aug 5, 2022

Uh oh!

aduh95 commented Aug 5, 2022

Uh oh!

MoLow commented Aug 5, 2022

Uh oh!

kvakil commented Aug 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

add `resume-ci` label #43929

add `resume-ci` label #43929

MoLow commented Jul 21, 2022 •

edited

Loading

MoLow commented Jul 21, 2022 •

edited

Loading

kvakil commented Aug 5, 2022 via email •

edited

Loading