-
Notifications
You must be signed in to change notification settings - Fork 860
WeeklyTelcon_20180814
Geoffrey Paulsen edited this page Jan 15, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres
- Brian
- David Bernholdt
- Howard Pritchard
- Geoffroy Vallee
- George
- Josh Hursey
- Peter Gottesman (Cisco)
- Ralph Castain
- Thomas Naughton
- Geoff Paulsen
- Todd Kordenbrock
- Xin Zhao
- Nathan Hjelm
- akshay
- Matthew Dosanjh
- Joshua Ladd
- Matias Cabral
- Edgar Gabriel
- Akvenkatesh (nVidia)
- Dan Topa (LANL)
-
Silent Wrong Issue(s)
- Branching issue
- The head of v2.1.x branch is essentially the same (tag and branch)
- For v3.0.x and v3.1.x, branch has a lot AFTER the last tag.
- The Nightly tarballs would test the v3.0.x and v3.1.x, but not the special branch.
- Lets not panic, and fix it like we'd normally fix it.
- This issue is NOT on v3.0.x. It was fixed by Mark Allen March 26 here:
- https://github.com/open-mpi/ompi/issues/4937
- https://github.com/open-mpi/ompi/pull/4955
- Didn't think it was an issue on master.
- New maxsoak test is failing at Cisco - so possibly this issue affects x86 platforms as well, which makes it higher priority than just arm and ppc
- MAY flip the dates of v3.0.x and v3.1.x milestones, since we just put out a v3.0.x release, and it'd be easy to cherry-pick a few changes and role that release, and pickup the larger v3.1.x release after v4.0.0 goes out.
-
Nathan is requestiong Comments on
- C11 integration into master. PR5445
- eliminate all of our atomic for C11 atomics.
- ACTION: Please review and comment on code.
-
ORTE discussion went well, Geoffroy Vallee wrote up summary and posted to devel-core on Jul 24th.
- ACTION: Everyone please read and reply to devel-core with your thoughts.
Review All Open Blockers
Review v2.x Milestones v2.1.4
- v2.1.4 - Released v2.1.4 ON TIME.
- Now we need v2.1.5
- A serious issue on VADER came in. Bad memory barrier in fast box.
- Introduced last Dec. on all supported release streams.
- Potential Silent Data Corruption.
- George issue -
- If users use an overlapping datatype (data overlaps itself), Open MPI sends wrong data.
- Potential Silent Data Corruption.
- Effects all releases.
- George has a patch
- Always used to have a src RPM as part of RC.
- Jeff had some problems using Python scrypt to upload 2.1.4 tarballs built on aws to s3.
- Type-o fix for PMIx (MB prefix), but not upgrading because 2.1.4 is end of 2.x stream
- Peter filed an Issue 5520
- Thread Multiple warnings when exit on an error. Doesn't block.
- Aug 10th is release date.
- Test RC, get feedback back.
Review v3.0.x Milestones v3.0.3
- Schedule:
- v3.0.x will try to do an RC today.
- Probably won't have George's patch in it.
- Want to have Nathan's patch (which isn't in master yet).
- Is there a reason it's not in master? Jeff will followup.
- PR 5484 - want into RC1, but Giles on vacation. - Nathan can test
- need
- v3.0.3 - targeting Sept 1st (more start RCs when 2.1 wraps up.
- Anticipate RC1 after Aug 10th release of v2.1.4 releases.
- Got good progress in reviews.
Review v3.1.x Milestones v3.1.0
- v3.1.2 release process, starts after Sept 1st release of v3.0.3
- Lots of PRs multiple 5485
- ucx segfault
- 5083 - we just need some update. Xin Zhao will update issue.
- Schedule: branch: July 18. release: Sept 17
- Date for first RC - Aug 13 (after sunset of 2.1.4)
- Cuda support:
- Does nVidia want if --with-cuda, then openib included by default?
- Yes, because at this moment UCX is not on par, but still want to migrate to ucx cuda.
- Warning message will mention deficate openib vs ucx
- Has this work been done???
- Does nVidia want if --with-cuda, then openib included by default?
- NEWS - Depricate MPIR message for NEWs - Ralph can help with this.
- PR 5497 - ROMIO wait for Giles to review. Later this week.
- PR 5472 - joint effort of 4 commits - Jeff to review
- status update: Good enough at the moment, Not exactly the scheme we outlined in prior issue. It does satisfy external hwloc or external libevent. Since it broke aws.
- New OMPI-IO components - PR 5539 -DDN added support for infiniate memory engine.
- Can we Pull this into v4.0.0?
- Sorry, No. This is new functionality and we've already branched for v4.0.x
- We can consider this for v4.0.1, but it might not get it until v4.1.x
- Who has a filesystem that can test this?
- Very well isolated component. Can it be considered?
- PR 5504 - Please ensure bug fixes only, and seperate commits to allow us to consider seperately.
- Geoff and Howard will build test suites with v3.1.x and run with master/v4.0 to see if anything breaks.
- ORTE/PRTE - Geoffroy Vallee sent out document with summary to core-devel. Everyone please read and reply.
- Just asked everyone to please read this, and will discuss next week.
- Want to make sure that there are very good alternatives to whatever orte is turning into that will use PMIx.
- Replacing framework and calling PMIx directly is a really good idea.
- Will mess up if there is no native support for PMIx.
- in Open MPI v5.0.x timeframe.
- A couple of PMIx release branches getting closer to released.
- Some updates that might be worth getting into Open MPI, but don't hold up release for.
- From two weeks ago:
- MTT License discussion - MTT needs to be de-GPL-ified.
- All go try the python. - All the GPL is in the perl modules (using python works around that).
- Ralph started a PR, and now in limbo. Need to get this done by end of 2018
- Main concern is python is in a repo with no GPL code.
- Could delete perl alltogether, but may need to just move perl to different repo for a period of time, until everyone can move off of python.
- Has cisco found an alternative to perl funclets?
- Python ini execution is different than perls.
- Cisco has one perl ini for each branch, and under than 20-30 mpi installs.
- Probably will go with a template and stamp out 20-30 times
- MTT License discussion - MTT needs to be de-GPL-ified.
Review Master Master Pull Requests
- PR for setting VERSION on master Have we broken any VERSIONs
- Issue 5529 - George and Jeff discussing a flag that doesn't work. Not sure how to fix it yet.
Review Master MTT testing
-
Hope to have better Cisco MTT in a week or two
- Peter is going through, and he found a few failures, which some have been posted.
- one-sided - nathan's looking at.
- some more coming.
- OSC_pt2pt will exclude yourself in a MT run.
- One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
- Now that osc_pt2pt is ineligible, many tests fail.
- on Master, this will fix itself 'soon'
- BLOCKER for v4.0 for this work so we'll have vader and something for osc_pt2pt.
- Probably an issue on v3.x also.
- Did this for release branches, Nathan's not sure if on Master. - v4.0.x has RMA capable vader. Once
- One of Cisco MTTs runs with env to turn all MPI_Init to MPI_Thread_init (even though single threaded run).
- Peter is going through, and he found a few failures, which some have been posted.
-
Next Face to Face
- When? Discuss results of doodle. Settle: Oct 16-18 week
- Where? Settle: San Jose - Cisco * Brian may be able to come if it's San Jose for a day-trip Albuquerque - Sandia (believe it's okay, but need to verify) * May have problems with foriegn nationals (90 days), so too late.
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu
- Amazon,
- Cisco, ORNL, UTK, NVIDIA