Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new AIX machines #459

Closed
mhdawson opened this issue Aug 3, 2016 · 33 comments
Closed

Adding new AIX machines #459

mhdawson opened this issue Aug 3, 2016 · 33 comments
Assignees
Labels

Comments

@mhdawson
Copy link
Member

mhdawson commented Aug 3, 2016

I now have access to one of the new AIX machines from osuosl, the other 2 are still being configured for networking.

Opening this issue for awareness and to capture any info while doing the install.

@mhdawson mhdawson self-assigned this Aug 3, 2016
@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

First machine added (https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-1/). A few tweaks to the instructions were required. PR for those here:
#461

built/ran tests, 2 new failures. Failures covered by this issue: nodejs/node#7973

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

Build/test time seems to be ~30 mins on the new machine, which is about the same as the slowest platforms so once we get the second machine on-line will plan to add to regular regression runs.

@mhdawson mhdawson added the build label Aug 4, 2016
@jbergstroem
Copy link
Member

@mhdawson is that with ccache? 30min tests sounds awfully slow.

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

ccache is installed and I think it should be being used but I was planning to double check

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

Good news is ccache was not properly in the path, so we should do better, will fix that up.

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

Ok ccache is working now and down 19 mins. It seems that the node-test-commit-linux job seems to typically run 20 mins as do a few of the sub jobs I looked at so seems in the right ballpark. https://ci.nodejs.org/job/node-test-commit-aix/288/nodes=aix61-ppc64/

@jbergstroem
Copy link
Member

Are you using JOBS?

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

Second machine almost ready, just needs to be added to firewall
https://ci.nodejs.org/computer/test-osuosl-aix61-ppc64_be-2/)

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

@jbergstroem the parallelism was previously set to 1 but I changed that to 5 as part of the change since it looked like that was how many cpus were available on the new machines. So right now it has -j 5

@mhdawson
Copy link
Member Author

mhdawson commented Aug 4, 2016

I think this means we have 5 procs

bash-4.3# /usr/sbin/lsdev -C -c processor
proc0 Available 00-00 Processor
proc4 Available 00-04 Processor
proc8 Available 00-08 Processor
proc12 Available 00-12 Processor
proc16 Available 00-16 Processor
bash-4.3#

@jbergstroem
Copy link
Member

On the phone, will add shortly!

@jbergstroem
Copy link
Member

(added)

@mhdawson
Copy link
Member Author

mhdawson commented Aug 5, 2016

Second machine in and working now.

Will plan to stitch AIX into the regular regression runs on Monday. There are still some failures (AIX was green but there are issues related to malloc bnoorhuis is working on and one new issue seen on the new machines), but I think its still good to be able to tell if commit introduce new failures.

@mhdawson
Copy link
Member Author

mhdawson commented Aug 8, 2016

Ok 2 test machines are up and running. They have been running ok for the last few days and seem to run about 20 mins which is consistent with other linux jobs.

Will plan to to change to run test on AIX as part of standard job to test PRs as opposed to nightly some time later today.

@mhdawson
Copy link
Member Author

mhdawson commented Aug 8, 2016

Next step will then be to setup/add the release machine to the release CI.

@mhdawson
Copy link
Member Author

mhdawson commented Aug 9, 2016

Added AIX to node-test-commit. Run here to validate: https://ci.nodejs.org/job/node-test-commit/4475/

@Trott
Copy link
Member

Trott commented Aug 9, 2016

node-test-commit-aix is red every run. So now all of our CI runs are red. Any chance we can remove it until that is sorted out?

@joaocgreis
Copy link
Member

joaocgreis commented Aug 10, 2016

@mhdawson I disabled the job in node-test-commit for now as per @Trott 's request. Your test run was red so I'm not sure what you intended, of course we can discuss this further. Our CI infra is full of known issues that we must dismiss (or, when in doubt, re-run), but a permanently red job is not an advantage for collaborators, so I opted to disable it for now.

EDIT: CI back to green: https://ci.nodejs.org/job/node-test-commit/4489/

@Trott
Copy link
Member

Trott commented Aug 10, 2016

Thanks @joaocgreis!

If this needs to be re-added quickly for some reason, and if the same tests are failing each time, I suppose they could be marked flaky in the status file, although obviously that's not as good as fixing whatever the issue is.

@mhdawson
Copy link
Member Author

AIX was green until a little while back until it was broken by some new changes, which then covered up other changes being made that cause more problems. This is bound to happen when its not run as part of the regular job.

Now that we have adequate hardware to run AIX for each run, I was hoping we could add AIX even though it was red as it would help more quickly find regressions (even if submitters don't notice because it was already red) and would help the triage if one did sneak in.

The failures are consistent and there are 2 issues open to cover the existing failures. There has been enough red in the past I did not think it was going to be a major issue but maybe that's changed now.

I can look to see if I can mark them as flaky for just AIX but not sure its the best thing to do.

@Trott
Copy link
Member

Trott commented Aug 10, 2016

There has been enough red in the past I did not think it was going to be a major issue but maybe that's changed now.

Definitely. We've been mostly-green since December and I'd hate to go back to a world where red CI was shrugged off and code landed.

There are two ways to mark the tests as flaky, I think. I believe one way results in green and the other way results in yellow. I'd be OK with a yellow CI indicating there's tests that need addressing but not anything that should hold up code landing. /cc @orangemocha in case I'm wrong about that. I'm pretty sure he set that stuff up and I don't actually know how it works. I'm just a happy user.

@joaocgreis
Copy link
Member

I see there are already two tests in https://github.com/nodejs/node/blob/ab3306ad/test/parallel/parallel.status for AIX, it would be great to add the missing ones as well and re-enabling the job. Please test v4 as well (if AIX is to run for v4).

@mhdawson if you expect the tests to be corrected soon, you can add them as PASS,FLAKY, that will make every job yellow until the tests are fixed. If it might take long, add them as PASS,FAIL or just FAIL if it fails reliably, making the job green.

@mhdawson
Copy link
Member Author

They should be fixed relatively soon. I will add then as PASS,FLAKY so that it shows up as yellow and are still visible.

@mhdawson
Copy link
Member Author

ok PR here to mark as flaky
nodejs/node#8065

@mhdawson
Copy link
Member Author

@mhdawson
Copy link
Member Author

mhdawson commented Aug 11, 2016

There are 2 failures on 4.x:
not ok 659 parallel/test-regress-GH-1899
not ok 730 parallel/test-stdio-closed

The second is one that we are actively investigating in master which was only seen after we moved to the new AIX macihnes. The first seems to have been fixed in libuv as per nodejs/node#3676 which likely has not made it back to 4.x.

I'm thinking that given that people can get 4.x AIX builds from the IBM developerworks site with any required fixes for these issues, and 6.x LTS is not that far away that the goal should be having AIX in the community releases for 6.x and just leave 4.x as it is. We can revisit/confirm this once we have AIX downloads for 6.x (as stable) available on the community download page.

@mhdawson
Copy link
Member Author

One additional comment, if we think the failures will be an issue in the CI when people do runs against 4.x I'm happy to submit a PR to mark those 2 tests as expected to fail in 4.x. @Trott, @joaocgreis what's your views on that

@joaocgreis
Copy link
Member

If v4 does not support AIX, then CI should not run it for v4. Currently, CI detects node versions v0.x and arm is not run on those. I'll have to extend this mechanism soon for v4 (easy but not much).

@mhdawson my view is that for now you're welcome to mark those as flaky. When we have the mechanism to run only on v6, you can keep supporting it or not, your call I guess.

@mhdawson
Copy link
Member Author

It would be nice for runs against 4.x to catch any new regressions since we do ship 4.x binaries even if the community does not. I'll submit a PR to mark those 2 as flaky

@mhdawson
Copy link
Member Author

build to validate my branch before creating PR for v4.x-staging https://ci.nodejs.org/job/node-test-commit-aix/346

@mhdawson
Copy link
Member Author

A couple of failures on tests marked as flaky for linux so marking flaky on AIX as well. New run https://ci.nodejs.org/job/node-test-commit-aix/351/nodes=aix61-ppc64/console

@mhdawson
Copy link
Member Author

PR for 4.x nodejs/node#8076

@mhdawson
Copy link
Member Author

mhdawson commented Sep 9, 2016

Ok so they have been in the regressions runs for a while now. Still some tests marked as flaky but I'm getting those prioritized so that people in IBM will work to burn down that list. Going to close this issue for now and have opened a separate one for getting the release machine added.

@mhdawson mhdawson closed this as completed Sep 9, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants