Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ACTION REQUIRED: workflow change for merging changes to nodejs/node #2598

Closed
orangemocha opened this issue Aug 28, 2015 · 76 comments
Closed
Labels
build Issues and PRs related to build files or the CI. meta Issues and PRs related to the general management of the project.

Comments

@orangemocha
Copy link
Contributor

Based on previous discussion, feedback, and the approval by the TSC (#2434), we are adopting a new workflow for merging pull request in nodejs/node, and making any changes to the code branches at nodejs/node in general. The new workflow needs to be adopted by all @nodejs/collaborators for all changes to be merged to nodejs/node, starting this coming Monday at 2pm UTC: http://www.timeanddate.com/worldclock/fixedtime.html?msg=Start+merging+changes+to+nodejs%2Fnode+with+Jenkins&iso=20150831T14&p1=1440

From that time on, please refrain from pushing changes to nodejs/node manually, and instead use the workflow documented at https://github.com/nodejs/node/wiki/Merging-pull-requests-with-Jenkins. I also added a wiki page specifically on how to deal with flaky tests.

The workflow for testing (but not landing) pull requests with Jenkins is unchanged, and now documented here: https://github.com/nodejs/node/wiki/Testing-pull-requests-with-Jenkins

I'll be monitoring Jenkins to make sure things keep working smoothly. Let me and @nodejs/jenkins-admins know if you encounter any issues with the CI infrastructure or to share your feedback. Thank you!

@Fishrock123
Copy link
Contributor

👍

@Fishrock123 Fishrock123 added build Issues and PRs related to build files or the CI. meta Issues and PRs related to the general management of the project. labels Aug 28, 2015
@silverwind
Copy link
Contributor

Do these merges add the committer line of whoever performed the merge?

@Fishrock123
Copy link
Contributor

Do these merges add the committer line of whoever performed the merge?

I was told they will, and they need to for us to use it.

@orangemocha
Copy link
Contributor Author

Do these merges add the committer line of whoever performed the merge?

Yes, the information is taken from your full name and email address as set in Jenkins. I will update the wiki to reflect that.

@ChALkeR
Copy link
Member

ChALkeR commented Aug 28, 2015

What about documentation-only changes? I believe we don't yet have documentation-related tests, do we?

@orangemocha
Copy link
Contributor Author

To ensure you committer info is set correctly, go to https://jenkins-iojs.nodesource.com/, login, click on your username on the top-right, then on the left click on 'Configure'. Double-check 'Full Name' and 'E-mail address'.

@orangemocha
Copy link
Contributor Author

What about documentation-only changes? I believe we don't yet have documentation-related tests, do we?

Use node-accept-pull-request, and you can set NODES_SUBSET to pure_docs_changes to skip the tests. If we'll ever invent a way to test documentation, those tests will run as part of that configuration 😄

@ChALkeR
Copy link
Member

ChALkeR commented Aug 28, 2015

On the left hand side, you should see "Build with parameters". If not, it probably means that you're not logged in. You need to be logged in to start this job.

Doesn't work for me. I am logged in and there is nothing like «Build with parameters». Can I have a screenshot?

I remeber it being ok in iojs+any-pr+multi.

@orangemocha
Copy link
Contributor Author

image

@ChALkeR
Copy link
Member

ChALkeR commented Aug 28, 2015

No, it's definitely not there.

@orangemocha
Copy link
Contributor Author

@ChALkeR
Copy link
Member

ChALkeR commented Aug 28, 2015

@orangemocha Yes, of course.

@Fishrock123
Copy link
Contributor

screen shot 2015-08-28 at 10 52 21 am

Weren't we going to have dropdowns or something for names? If not that's ok.

@Fishrock123
Copy link
Contributor

@orangemocha also, it still points by default to the joyent org, not to nodejs. :)

@orangemocha
Copy link
Contributor Author

@Fishrock123 there are dropdowns, in node-accept-pull-request. Maybe you are looking at node-merge-commit?

image

@orangemocha
Copy link
Contributor Author

@Fishrock123 I will update the defaults from joyent/node in the 2-hour window between the move of the joyent/node repo and this workflow change 😄

@orangemocha
Copy link
Contributor Author

@ChALkeR : do you see "Build with Parameters" for other jobs, like node-test-pull-request ?

@Fishrock123
Copy link
Contributor

@orangemocha oops, you're right. :)

@silverwind
Copy link
Contributor

The steps required for small corrections like modifying the commit message or squashing commits seem unwieldly. I feel like this is taking quite a bit flexibilty away. How about introducing an alternative CLI-only workflow like this:

  • Pull commits to local repo
  • Rebase and modify the commits
  • Push changes to a special CI branch which kicks off the tests and if it's green, it gets commited with all original information intact.

One could do multiple target branches for different subsets of tests to run. A way of feedback to the user could be a bot that posts CI results on the issue through the GitHub API.

As a die-hard CLI user, I sure would appreciate something like this.

@orangemocha
Copy link
Contributor Author

@ChALkeR : if you are still having problems with the missing "Build with Parameters", I would suggest trying the following in order:

  1. Try a different browser.
  2. We can create another account in Jenkins for you.

@orangemocha
Copy link
Contributor Author

@silverwind : I am definitely not against additional tooling to make things even easier. We can work together on this if you want, or we could even let you mess with Jenkins if you are interested.

Having said that.. are the steps for modifying the commit message or squashing commits really more complicated under this workflow? For doing that, you'd still have to pull the changes locally and make your edits (I simply described a possible procedure in the wiki), and finally push a branch to GitHub (which in the pre-Jenkins way could be the final branch). The only additional steps are now a) filling the form to kick off node-accept-pull-request, and b) deleting the branch after the fact. The time it takes for filling the form should be compensated by the ease of filling in reviewers with the drop-downs. And we could add a checkbox to node-merge-commit to have it delete the temporary source branch after the merge.

@Fishrock123
Copy link
Contributor

@orangemocha so just to be clear, I can already use this for nodejs/node, right now?

@orangemocha
Copy link
Contributor Author

@Fishrock123 : technically yes, but if anybody lands another pull request manually in the middle of your run, it will break it.

@Fishrock123
Copy link
Contributor

technically yes, but if anybody lands another pull request manually in the middle of your run, it will break it.

But nothing fatal that another run can't fix?

@silverwind
Copy link
Contributor

@orangemocha I'm a bit confused. Do you still have to start off node-accept-pull-request when doing node-merge-commit as described here?

What I like to see is an alternative entry point to node-merge-commit through a simple git push. From what I understand, you're already interfacing with the branches on github. Would it be possible to start off node-merge-commit when a new branch, maybe with a special prefix, is created?

(Also, I'm not really motivated to learn Jenkins internals, so I think I'll decline your offer 😉)

@orangemocha
Copy link
Contributor Author

@Fishrock123 : correct. You can try it if you want.

@Fishrock123
Copy link
Contributor

@orangemocha https://jenkins-iojs.nodesource.com/job/node-test-commit-arm/380/ I think the Pi1s aren't depending on the other task correctly?

@trevnorris
Copy link
Contributor

@orangemocha On last Friday I landed two commits. You can find each of those original commits listed in their respective PRs. Later that day @Fishrock123 landed another patch using Jenkins. For some reason the two patches I had landed earlier were left stranded. This lead me to the assumption that Jenkins was responsible.

@orangemocha
Copy link
Contributor Author

Folks, thank you for your patience with this. The last couple days definitely didn't go as well as I had hoped. There were a lot of random test failures all over the place, preventing people from landing PRs. There have been so many unexpected ones, that it makes me wonder whether those failures could be caused by an underlying bug in node itself.

We have been discussing this today within the build WG and the sane thing to do would seem to halt using the automated merge temporarily, until we can get things to a stable place. Hence I am re-opening this issue to elicit people's feedback.

Right now we are operating under this plan:

  • Gather all the test failures from the last runs (and more runs), and mark them all as flaky
  • Re-run all the accept jobs that have failed because of flaky tests
  • Only if everything looks stably green/yellow after that, and if we reach that state before morning PDT (9am), continue with the automated merges. Otherwise suspend the trial until we can reach a more stable state.

/cc @nodejs/collaborators @nodejs/tsc

@orangemocha orangemocha reopened this Sep 3, 2015
@Fishrock123
Copy link
Contributor

Gather all the test failures from the last runs (and more runs), and mark them all as flaky

I am worried we are going to have a high potential of starting to ignoring legitimate failures due to current hardware issues.

A while ago (1-2 months?) we were having all green runs. What happened to that?

@targos
Copy link
Member

targos commented Sep 3, 2015

In order to win some time, would it be possible to automatically cancel a merge job as soon as one of the test jobs fails ?
There is no point in waiting for the RPi to compile/test if something went wrong on a faster hardware.

@orangemocha
Copy link
Contributor Author

@targos : yes, no harm in doing that if you don't need complete results from the run.

@orangemocha
Copy link
Contributor Author

A while ago (1-2 months?) we were having all green runs. What happened to that?

That would be a tough question to answer precisely. The high level guess is that due the fast pace of development the quality is slowly drifting down (no offense to anyone, I hope). It does highlight the importance of having a system like this in place. The other side is that some of these can be due to faulty hardware or bad configuration of the slaves. We do check for that, and while this is true in some cases, there are still a majority of flaky failures that cannot be clearly attributed to issues with the slaves.

@orangemocha
Copy link
Contributor Author

I am worried we are going to have a high potential of starting to ignoring legitimate failures due to current hardware issues.

We are already making those calls when we say "not related to this change" and try again (or land it manually). Once we make that decision and move forward with the PR, the flakiness is now in master. There is little harm in marking those tests as flaky, and at least we are opening GitHub issues to track the problems and follow up. This might be an argument against nodejs/build#182 though.

@Fishrock123
Copy link
Contributor

The high level guess is that due the fast pace of development the quality is slowly drifting down (no offense to anyone, I hope).

Ok this is just incorrect.

They were failing BEFORE THAT. And we cleaned it up besides what was still in our issue tracker. Some of these failures have only manifested in the last few days. I've been keeping an eye on the CI since the beginning of io.js.

there are still a majority of flaky failures that cannot be clearly attributed to issues with the slaves.

Anything with failures relating to reset connections or port binding is a) relatively new, and b) appeared around some CI changes about 2 or so months ago.

Also now the iojs+any+pr+multi job and all of it's history is gone. Wonderful.

We are already making those calls when we say "not related to this change" and try again (or land it manually). Once we make that decision and move forward with the PR, the flakiness is now in master.

We can also parse if the output is or isn't actually flaky.

@Fishrock123
Copy link
Contributor

They were failing BEFORE THAT. And we cleaned it up besides what was still in our issue tracker.

Also for clarification, I am quite sure we have had green runs for over a week, at a couple intervals, on master. (This was after the point where we were accidentally ignoring windows failures.)

@orangemocha
Copy link
Contributor Author

I doubt that the failures are related to the shift from iojs+any+pr+multi to node-test-pull-request. They differ in the arguments they take and the way they fetch things, but internally they are just calling the test runner in the same way.

I have to agree that this started in the last few days. Before proceeding with opening this issue, we had several test runs with node-accept-pull-request where things were stably yellow. That means that there might be a real bug in node. If someone has time to start investigating those flaky tests locally, it would be helpful.

@orangemocha
Copy link
Contributor Author

OK, after chasing after failures for the last few days, I am convinced that we need to suspend this experiment. 😞
Even though we tried aggressively to mark tests as flaky - this PR is marking 29 new tests as flaky! - we are not even able to land that PR because new tests are failing every time. Given the current state of things, most attempts to land PRs via CI would fail. So please refrain from using node-accept-pull-request / node-merge-commit, and instead merge changes manually.

This level of flakiness in the tests is unprecedented, and I believe it started very recently (this week). While there are occasional failures due to misconfigured machines, those are easy to spot. The vast majority of failures seem due reasons outside of the Jenkins/CI realm, and seem to indicate a real problem with the state of the master branch. The recent libuv upgrade (a161594) is high on the list of suspects. What we can do is run some test jobs on recent commits and try to bisect it that way. If you have a chance to investigate some of those failures locally, that would be helpful too.

We can resume this experiment after we have found the underlying cause of all this instability. I guess that at this point, we'll also take the time to fix a few other high priority issues with the CI itself (like making the build faster), and restart only when it looks really solid.

Sorry for the headaches that this has caused.

/cc @nodejs/collaborators @nodejs/tsc

@silverwind
Copy link
Contributor

These flaky tests have been plaguing us a lot longer than a week, and the failures were almost never reproduceable locally, so I think we indeed should take a close look on the CI itself. Are other projects using Jenkins also experiencing this? Maybe we're doing something fundamentally wrong?

@orangemocha
Copy link
Contributor Author

Jenkins is just running a bunch of scripts on a variety of machines. Sure, it has its own flukes, but I don't see how it could be causing some of the issues that we have seen lately.

Maybe to get a better shot at reproducing locally, we should try running make run-ci. The top failing platforms seems to be armv7-wheezy, RPis, and centos5.

@Trott
Copy link
Member

Trott commented Sep 3, 2015

Wild guess, probably wrong, but hey, that's what the Internet is for, so here you go:

Didn't we move a test or three from sequential to parallel recently? Maybe something in one of those tests slipped by that may be making other tests somehow unstable?

@orangemocha
Copy link
Contributor Author

That's not to be excluded @Trott . I already started a few runs to try and bisect commits in CI, so that if there was one offending commit, we can find it.

@orangemocha
Copy link
Contributor Author

Also, I just checked PRs that landed in v0.12 recently. About 10 PRs landed in the last couple of weeks using node-accept-pull-request from the new Jenkins, using the same exact Jenkins infra, except they don't run on ARM. No records of people having to retry runs. This also seems to confirm that the problem is in the current master branch.

@Fishrock123
Copy link
Contributor

The vast majority of failures seem due reasons outside of the Jenkins/CI realm, and seem to indicate a real problem with the state of the master branch.

Anything that deals with ECONNRESETor unavailable ports has in the past been said to be configuration, that stuff was already there before the Libuv upgrade. Those were the largest CI issue for a while iirc.

Related: I saw two unavailable ports after like 50 test suite runs on my OS X 10.10.5 machine. None on my remote Ubuntu 15 testing box. (It did has a weird fs EACCESS but I am quite certain this is config also.)

@indutny
Copy link
Member

indutny commented Sep 3, 2015

@orangemocha what do you think about delaying the accept-pullrequest project? It looks like it is blocking us from making progress at the moment, and considering the upcoming release - it is very troublesome. We can always return to running the experiments with this at some later point, when things will be more stable.

@orangemocha
Copy link
Contributor Author

@indutny yes see above #2598 (comment)

@indutny
Copy link
Member

indutny commented Sep 3, 2015

@orangemocha argh, right! sorry, I didn't get it.

@orangemocha
Copy link
Contributor Author

Closing for now. I will open a new issue when we have addressed the stability and performance concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Issues and PRs related to build files or the CI. meta Issues and PRs related to the general management of the project.
Projects
None yet
Development

No branches or pull requests