Skip to content

WeeklyTelcon_20210125

Geoffrey Paulsen edited this page Jan 27, 2021 · 1 revision

Open MPI Weekly Telecon ---

  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

  • Aurelien Bouteiller (UTK)
  • Brendan Cunningham (Cornelis Networks)
  • Brian Barrett (AWS)
  • Christoph Niethammer (HLRS)
  • David Bernhold (ORNL)
  • Edgar Gabriel (UH)
  • Geoffrey Paulsen (IBM)
  • George Bosilca (UTK)
  • Harumi Kuno (HPE)
  • Hessam Mirsadeghi (UCX/nVidia)
  • Howard Pritchard (LANL)
  • Jeff Squyres (Cisco)
  • Joseph Schuchart
  • Josh Hursey (IBM)
  • Joshua Ladd (nVidia/Mellanox)
  • Matthew Dosanjh (Sandia)
  • Michael Heinz (Cornelis Networks)
  • Raghu Raja (AWS)
  • Ralph Castain (Intel)
  • Todd Kordenbrock (Sandia)
  • William Zhang (AWS)

not there today (I keep this for easy cut-n-paste for future notes)

  • Akshay Venkatesh (NVIDIA)
  • Austen Lauria (IBM)
  • Naughton III, Thomas (ORNL)
  • Artem Polyakov (nVidia/Mellanox)
  • Brandon Yates (Intel)
  • Charles Shereda (LLNL)
  • Erik Zeiske
  • Geoffroy Vallee (ARM)
  • Mark Allen (IBM)
  • Matias Cabral (Intel)
  • Nathan Hjelm (Google)
  • Noah Evans (Sandia)
  • Scott Breyer (Sandia?)
  • Shintaro iwasaki
  • Tomislav Janjusic
  • Xin Zhao (nVidia/Mellanox)

4.0.x

  • v4.0 release, would like to take this ROMIO one-off fix instead of
    • https://github.com/open-mpi/ompi/pull/8370 - Fixes HDF5 on LUSTRE
    • Proposing take this one-off for v4.0.6, as a whole new ROMIO is a big change.
    • Waiting on v4.0.6rc2 until we get an answer.
    • Everyone seems okay with taking this into release branch, and waiting for ROMIO update on master.
    • Just needs a review

v4.1

  • Issue 8334 - a performance regression with AVX512 on Skylake. Still digging into.
  • Issue 8410 - Build Failure on Apple Silicon.
    • Do we just need new updated string, or is that just one of the issues.
    • Code changes we need in v4.1.
    • Will have exact same problem in PMIx and PRRTE
    • Performance with Atomic FIFO is another issue, might not need to backport to v4.1
  • Issue 8367 - will take to UCX community
    • Not yet brought up to UCX community. Josh will take up
  • Issue 8379 - UCT appears to be default and not UCX
    • Jeff repinged for request

Open-MPI v5.0

What's the state of ULFM (PR 7740) for v5.0?

  • Does the community want this ULFM PR 7740 for OMPI v5.0? If so, we need a PRRTE v3.0
    • Howard gave it a spin, and worked with a few issues.
      • -no-orte CI tests will fail
        • This PR detects if you're using an external PRRTE, and if RT is not FT, then it errors out in the configure.
        • -no-orte is just an alias to -no-prrte, so if this is causing issues, may
        • We're pushing to externl PRRTE.
        • This build should only abort if it's requested, but not found.
        • Aurlien will fix, which will make CI fix.
        • compiler fix is just duplicate typedefs.
    • Aurelien will make a PR today to add some tests, but unsure how to add to MTT.
      • Put into ibm suite, most will pickup by default.
  • Are we Feature Complete?
    • PRRTE should be ready end of Q1.
    • Based on v5.0 tracker, there is a bunch of stuff not in.
    • GPU Direct support for OFI MTL
      • AWS working on now. Need to rebase, and upstream.
    • OFI BTL changes need to get upstreamed.
    • Weeks for MTL
  • Edgar atomicity issue for OMPIO. Not sure if it's a full feature, but need to have on radar.
    • ETA: a few days after Edgar finds time. 2-3 weeks.
  • Any other big features?
  • Branch Date will discuss next week.

New Topics

  • How to implement so that ./configure --help presents all configure options to users?
  • Didn't get to 1/25
  • Process returns wrong result unless pml is ^ucx.
  • Looks like the user is trying to use UCX with TCP inside of a container.
    • Not sure how well tested UCX+TCP
    • If users are using TCP sockets, why is selection picking UCX instead of OB1/TCP_BTL
  • Should be straight forward to chase down, and
    • Possibly an issue with collective and UCX in this runmode.

Setup Github Teams

  • Jeff can setup so we have single point of contact in github, that many members of organizations can watch
  • Don't go crazy to start, just setup a few

Longer Term discussions

ROMIO Long Term (12/8)

  • What do we want to do about ROMIO in general.
    • OMPIO is the default everywhere.
    • Giles is saying the changes we made are integration changes.
      • There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
      • We may be able to work with upstream to make a clear API between the two.
    • As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
  • Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
  • Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.

Doc update

  • PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
    • Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
    • Has a built from this PR, so we can see what it looks like.
    • Have a look. It's a different approach to have one document that's the whole thing.
      • FAQ, README, HACKING.
  • Do people even use manpages anymore? Do we need/want them in our tarballs?
  • Putting new tests there
  • Very little there so far, but working on adding some more.
  • Should have some new Sessions tests

What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?

  • What's the general state? Any known issues?

  • AWS would like to get.

  • Josh Ladd - Will take internally to see what they have to say.

  • From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.

  • Hessam Mirsadeg - All Cuda awareness through UCX

  • May ask George Bosilica about this.

  • Don't want to remove a BTL if someone is interested in it.

  • UCX also supports TCP via CUDA

  • PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on

  • Update 11/17/2020

    • UTK is interested in this BTL, and maybe others.
    • Still gap in the MTL use-case.
    • nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
    • What's the state of the shared memory in the BTL?
      • This is the really old generation Shared Memory. Older than Vader.
    • Was told after a certain point, no more development in SM Cuda.
    • One option might be to
    • Another option might be to bring that SM in SMCuda to Vader(now SM)
  • Discussion on:

    • Didn't get to this week. :(
    • Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
    • One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
    • Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)

Video Presentation

  • ECP Community days ( March 30-April 1st )
    • David Bernholdt and/or George Bosilica
    • Each day 90 minute time slots.
    • Get proposal in by this Friday.
Clone this wiki locally