-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20190611
Geoffrey Paulsen edited this page Jul 2, 2019
·
1 revision
- Dialup Info: (Do not post to public mailing list or public wiki)
- Akshay Venkatesh (nVidia)
- Artem Polyakov
- Brian Barrett
- Dan Topa
- David Bernholdt
- Geoff Paulsen
- Howard Pritchard
- Jeff Squyres
- Josh Hursey
- Ralph Castain
- Thomas Naughton
- Todd Kordenbrock
- Aravind Gopalakrishnan (Intel)
- Arm (UTK)
- Brandon Yates (Intel)
- Brendan Cunningham (Intel)
- Edgar Gabriel
- Geoffroy Vallee
- George Bosilca
- Jake Hemstad
- Joshua Ladd
- Matias Cabral
- Matthew Dosanjh
- Michael Heinz (Intel) - Introducing Brandon
- Nathan Hjelm
- Noah Evans (Sandia)
- Peter Gottesman (Cisco)
- Xin Zhao
- mohan
- Version checks vs functionality.
- UCX has committed to stability
- We don't want to be caught flat footed again, sounds like
- Starting with UCX v1.7, they'll make UCT backwards compatible.
- other than people who explicitly set the route to debrujin, who will this impact?
- No one, never used unless specifically asked for.
- Ralph's not convinced that debrujin component can even WORK with launch tree mechnaism. That's not how it's designed.
- Thought, is there a way to make debrujin just alias the 'other' routed component, could we do that?
- No way to do this easily.
- Is it better to say - you asked for X and gave you Y, or you asked for X and we no longer support it.
- Component routed initialization phase is not stateless, but fixed on master.
- To immediately fix release branches is to just remove the problem component.
- Did it ever work in .0 release, then no one could have been using it.
- Worked when developed because only one component at a time.
- When we transitioned to all components being active (for another reason)
- we weren't testing this component at scale.
- Don't want to do aliasing. Rather just git rm this broken component.
- ACTION: Ralph will repush a PR to remove the broken component that is also messing up the OTHER components.
Blockers All Open Blockers
Review v3.0.x Milestones v3.0.4
Review v3.1.x Milestones v3.1.4
- issue 6655.
- hostname work isn't advancing
- Waiting for vader and atomic fixes from v4.0.x (Geoff)
Review v4.0.x Milestones v4.0.2
- Moved release date to June 21st.
- Shouldn't move up releases quickly like this.
- Others were planning on a Sept release
- Should we do a 4.0.3 in Sept? (latest we can practically do in 2019)
- Drivers: UCX compile issue
- Vader / ob1 - more serious.
- Other issues in vader, some attempts to fix (double free using xpmem)
- Proposed fix, but never actually fixed.
- Other issues in vader, some attempts to fix (double free using xpmem)
- If you compile with gcc 5 or 6, IMB hangs on x86_64.
- yes have fix in hand (on master PR 6711)
- Work around is to use an older GCC.
- What testing gaps led to us not hitting this.
- ACTION: we should discuss how testing holes.
- ACTION: we should run IMB or OSU as part of CI.
- Vader / ob1 - more serious.
- Perhaps we should have a testing czar.
- Now we look at it as an after thought.
- So how bad is our vader issues?
- If we don't have fixes we need, June is unrealistic
- Could use Github Project KanBan for this.
- Howard's making a new label and applying it to all issues needed for v4.0.2
- We will review that next week and discuss schedule.
Review Master Master Pull Requests
- Schedule
- Schedule
- Suggesting Sept 16
- Jeff re-update doodle for availability. We'll pick next week.
- Location
- TBD
- IBM's not submitting after cluster update
- Brian working on 512 nodes ssh.
- not much MTT development going on.
- Mellanox, Sandia, Intel
- LANL, Houston, IBM, Fujitsu
- Amazon,
- Cisco, ORNL, UTK, NVIDIA