Skip to content

WeeklyTelcon_20210615

Geoffrey Paulsen edited this page Jul 5, 2021 · 1 revision

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

  • Austen Lauria (IBM)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • David Bernholdt (ORNL)
  • Geoffrey Paulsen (IBM)
  • Jeff Squyres (Cisco)
  • Josh Hursey (IBM)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Raghu Raja
  • Sam Gutierrez (LANL)
  • Tomislav Janjusic (NVIDIA)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Artem Polyakov (NVIDIA)
  • Aurelien Bouteiller (UTK)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Christoph Niethammer (HLRS)
  • Edgar Gabriel (UH)
  • Erik Zeiske (HPE)
  • Geoffroy Vallee (ARM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (NVIDIA))
  • Howard Pritchard (LANL)
  • Joseph Schuchart (HLRS)
  • Joshua Ladd (NVIDIA)
  • Marisa Roman (Cornelius)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Naughton III, Thomas (ORNL)
  • Noah Evans (Sandia)
  • Ralph Castain (Intel)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Todd Kordenbrock (Sandia)
  • Xin Zhao (NVIDIA)

New Items

v4.0.x

  • Will ship v4.0.6 today.

v4.1.x

  • No driver to rush, so now just in bugfix phase.

v5.0.x

  • PMIX / PRRTE plan to release in next few weeks

  • Need to do a v5.0 rc as soon as PRRTE v2 ships.

    • Need feedback if we've missed an important one.
  • PMIx Tools support is still not functional. Opened tickets in PRRTE.

    • Not a common case for most users.
    • This also impacts the MPIR shim.
      • PRRTE v2 will probably ship with broken tool support.
  • Is the driving force for PRRTE v2.0 OMPI?

    • So we'd be indirectly/directly responsible for PRRTE shipping with broken tool support?
    • Ralph would like to retire, and really wants to finish PRRTE v2.0 before he retires.
    • Or just fix it in PRRTE v2.0?
    • Is broken tool support a blocker for PRRTE v2.0?
      • Don't ship OMPI v5.0 with broken Tools support.
  • Is there any objections to delaying

    • Either we resource this
  • https://github.com/openpmix/pmix-tests/issues/88#issuecomment-861006665

    • Current state of PMIx tool support.
    • We'd like to get Tool support in CI, but need it to be working to enable the CI.
  • https://github.com/openpmix/prrte/issues/978#issuecomment-856205950

    • Blocking issue for Open-MPI
    • Brian
  • PR 9014 - new blocker.

    • fix should just be a couple of lines of code... hard to decide what we want.
    • Ralph, Jeff and Brian started talking.
  • Need some configury changes in before we RC.

  • Issue 8850, 8990 and more

  • Brian will file 3-ish issues

    • One is configure pmix
  • Dynamic Windows fix in for UCX.

  • Any update on debugger support?

  • Need some documentation that Open MPI v5.0 supports PMIx based debuggers, and that if

  • MPIR Shim - pushed up fixes, and enabled CI.

    • Could add it to some more CI, to ensure that PMIx doesn't break
    • IBM is working on some CI testing with MPIR (typically very brittle)
    • Need some guidance on pmix version.
    • Right not, probably not a big deal, but perhaps in 2 years when we have 3 release branches with different pmix versions on different release branches, it might make sense to do open-mpi CI testing.
      • Shouldn't be too much work to do.
  • UCC coll component updating to just set to be default when UCX is selected. PR 8969

    • Intent is that this will eventually replace hcoll.

Documentation

  • Solid progress happening, on Read the docs.
  • These docs would be on the readthedocs.io site, or on our site?
    • Haven't thought either way yet.
    • No strong opinion yet.

Master

MPI 4.0 API

  • Now released.

  • We don't KNOW that OMPI v6.0 may not be an ABI break

    • So nice to get MPIX_ rename into v5.0
  • Would be NICE to get MPIX symbols into a seperate library.

    • What's left in MPIX after persistant collectives?
      • Short Float,
      • Pcall_req - persistant collective
      • Affinity
    • If they're NOT built by default, it's not too high of a priority.
  • Should just be some code-shuffling.

    • On the surface shouldn't be too much.
    • If they use wrapper compilers, or official mechanism
    • Top level library, since app -> MPI and app -> MPIX lib.
    • libmpi_x library can then be versioned differently.
  • Dont change to build MPIX by default.

  • Open an issue to track all of our MPI 4.0 items

    • MPI Forum will want, certainly before supercomputing.
  • Do we want an MPI 4.0 Design meeting in place of a Tuesday meeting.

    • In person meeting is off the table for many of us. We might want an out of sequence meeting.
    • Lets doodle something a couple of weeks out.
    • Doodle and send it out
    • trivial wiki page in style of other in person wiki.

MTT

  • Mellanox hasn't been reporting for a while. Tommi will follow up.
  • Jeff did some work on Cisco MTT.
    • There are a bunch of one-sided issues across node.
    • Austen and Jeff looking into.
    • Narrowed it down to strange results from MPI_Comm_split
      • Local Peers value appears to be set wrong under PRRTE
  • Joseph see when he installed hwloc in installation path, which leads to warnings if using another hwloc.
    • We changed how all of this worked a few weeks ago.
    • We shouldn't be installing one unless we can't find an external one.
    • Problem is if you link the application to a different hwloc, it now complains.
    • This has always been true, we just warn now. Don't do this.
  • Austen filed a couple of issues from MTT.

PMIx

  • No discussion

PRRTE v2.0

  • No update

Longer Term discussions

  • No discussion.
Clone this wiki locally