WeeklyTelcon_20201215

Open MPI Weekly Telecon ---

Dialup Info: (Do not post to public mailing list or public wiki)

Attendees (on Web-ex)

NOT-YET-UPDATED

Web-Ex

link has changed for 2021. Please see email from Jeff Squyres to devel-core@lists.open-mpi.org on 12/15/2020 for the new link

4.0.x

v4.0.6rc1 - built, please test.

v4.1

SLURM fix
Talk about Edgar
Shooting for a release THIS week (12/15)
Want SLURM and possible OMPIO default change?

Open-MPI v5.0

What's the state of ULFM (PR 7740) for v5.0?

Does the community want this ULFM PR 7740 for OMPI v5.0? If so, we need a PRRTE v3.0
On or off by default?
in PPRTE, think can disable by default?
- PRRTE had a bunch of issue turning this off.
- Is the bar to bring into master, if it's off, it's REALLY off?
runtime or configure time enablement?
Some folks want one release of PRRTE without this, but others thinks it's production ready.
LARGE PR, quite disruptive. Would want it in soon as possible, so we can shake out bugs.
Might want some Test cases for this as well. Different application
- Think they have some tests in the other ULFM branch, not sure about this branch.

Jeff Squyres want the v5.0 RMs to generate a list of versions it'll support, to document.

Still need to coirdinate on this. He'd like this, this week.
PMIx v4.0 working on Tools, hopefully done soon.
- PMIx go through python bindings.
- a new Shmem component to replace
- Still working on.
Dave Wooten pushed up some PRRTE patches, and making some progress there.
- Slow but steady progress.
- Once tool work is more stabilized on PMIx v4.0, will add some tool tests to CI.
- Probably won't start until first of the year.
How is the submodule reference updatees on Open-MPI master
- Josh was still looking to see about adding some cross checking CI
- When making a PRTE PR, could add some comment to the PR and it'll trigger Open-MPI CI with that PR.

This is the last Tuesday call of December.

New web-ex for January

New items

SLURM 2020.11

Slurm is now always using a Cgroup, and always setting default number of cores in cgroup to 1.
So when using mpirun with orted/prrted in slurm, orted/prrted can't
Ralph working on PR from user comment (PR 8288)
Issue, and possibly in README (will catch a lot of people)

ROMIO issue on Lustre

Too latest ROMIO from and it failed on both
But then he took LAST week's 3.4 BETA ROMIO and it passed. But it's a little too new.
- He gave a bit more info about the stuff he integrates, and stuff he moves forward.
  - 1. ROMIO modernization (don't use MPI1 based things)
  - 1. ROMIO integration items.
- We're hesitant to put this into 4.1.0 because it's NOT yet release from MPICH
- hesitant to even update ROMIO in v4.0.6 since it's a big change.
- If we delay and pickup newer ROMIO in the next minor, would there be backwards compatibility issues?
  - Need to ask about compatibility between ROMIO 3.2.2 and 3.4
    - If fully compatibile, then only one ROMIO
- We could ship multiple ROMIOs, but that has a lot of problems.

Edgar hunted down performance issue of OMPIO

Just got resources to test, and root caused the issue in OMPIO
So, given some more time Edgar will get a fix, and OMPIO can be default

ROMIO Long Term (12/8)

What do we want to do about ROMIO in general.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
- Long Term we need to figure out what to do about this.
- We may be able to work with upstream to make a clear API between the two.
Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.

How's the state of https://github.com/open-mpi/ompi-tests-public/

Putting new tests there
Very little there so far, but working on adding some more.
Should have some new Sessions tests
What's going to be the state of the SM Cuda BTL and CUDA support in v5.0?
- What's the general state? Any known issues?
- AWS would like to get.
- Josh Ladd - Will take internally to see what they have to say.
- From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
- Hessam Mirsadeg - All Cuda awareness through UCX
- May ask George Bosilica about this.
- Don't want to remove a BTL if someone is interested in it.
- UCX also supports TCP via CUDA
- PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
Update 11/17/2020
- UTK is interested in this BTL, and maybe others.
- Still gap in the MTL use-case.
- nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
- What's the state of the shared memory in the BTL?
  - This is the really old generation Shared Memory. Older than Vader.
- Was told after a certain point, no more development in SM Cuda.
- One option might be to
- Another option might be to bring that SM in SMCuda to Vader(now SM)
Restructure Tech Doc (more features than Markdown, including crossrefrences)
- Jeff had a first stab at this, but take a look. Sent it out to devel-list.
- All work for master / v5.0
  - Might just be useful to do README for v4.1.? (don't block v4.1.0 for this)
- Sphynx is tool to generate docs from restructured doc.
  - can handle current markdown manpages together with new docs.
- readthedocs.io encourages "restructured text" format over markdown.
  - They also support a hybrid for projects that have both.
- Thomas Naughton has done the restructured text, and it allows
- LICENSE question - what license would the docs be available under? Open-MPI BSD license, or
Ralph tried the Instant on at scale:
- 10,000 nodes x 32PPN
- Ralph verified Open-MPI could do all of that in < 5 seconds, Instant-On.
- Through MPI_Init() (if using Instant-On)
- TCP and Slingshot (OFI provider private now)
- PRRTE with PMIx v4.0 support
- SLURM has some of the integration, but hasn't taken this patch yet.
Discussion on:
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Talking about amending to request MCAs to know if it should be slurped in.
  - (if the component hard links or dlopens their libraries)
- Roadrunner experiments... The Bottleneck in launching was I/O in loading all the .sos
  - spindle, and burst buffer reduce this, but still
- Still going through function pointers, no additional inlining.
  - can do this today.
- Still different than STATIC (sharing this image across process), just not calling dlopen that many times.
- New proposal is to have a 3rd option where component decides it's default is to be slurped into libmpi
  - It's nice to have fabric provider's not bring their dependencies into libmpi so that the main libmpi can be run on nodes that may not have the provider's dependencies installed.
- Low priority thing anyway, if we get it in for v5.0 it'd be nice, but not critical.

Video Presentation

George and Jeff are leading
No new updates this week (see last week)

WeeklyTelcon_20201215

Open MPI Weekly Telecon ---

Attendees (on Web-ex)

Web-Ex

4.0.x

v4.1

Open-MPI v5.0

What's the state of ULFM (PR 7740) for v5.0?

Jeff Squyres want the v5.0 RMs to generate a list of versions it'll support, to document.

This is the last Tuesday call of December.

New items

SLURM 2020.11

ROMIO issue on Lustre

Edgar hunted down performance issue of OMPIO

ROMIO Long Term (12/8)

How's the state of https://github.com/open-mpi/ompi-tests-public/

Video Presentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!