-
Notifications
You must be signed in to change notification settings - Fork 868
WeeklyTelcon_20210209
- Dialup Info: (Do not post to public mailing list or public wiki)
- Jeff Squyres (Cisco)
- Howard Pritchard (LANL)
- Ralph Castain (Intel)
- Geoffrey Paulsen (IBM)
- Austen Lauria (IBM)
- Joseph Schuchart
- Hessam Mirsadeghi (UCX/nVidia)
- Edgar Gabriel (UH)
- Brendan Cunningham (Cornelis Networks)
- Josh Hursey (IBM)
- Matthew Dosanjh (Sandia)
- Naughton III, Thomas (ORNL)
- Raghu Raja (AWS)
- Todd Kordenbrock (Sandia)
- William Zhang (AWS)
- George Bosilca (UTK)
- Aurelien Bouteiller (UTK)
- Christoph Niethammer (HLRS)
- Harumi Kuno (HPE)
- Brian Barrett (AWS)
- David Bernhold (ORNL)
- Howard Pritchard
- Marisa Roman (Cornelius)
- Joshua Ladd (nVidia/Mellanox)
- Michael Heinz (Cornelis Networks)
- Akshay Venkatesh (NVIDIA)
- Artem Polyakov (nVidia/Mellanox)
- Brandon Yates (Intel)
- Charles Shereda (LLNL)
- Erik Zeiske
- Geoffroy Vallee (ARM)
- Mark Allen (IBM)
- Matias Cabral (Intel)
- Nathan Hjelm (Google)
- Noah Evans (Sandia)
- Scott Breyer (Sandia?)
- Shintaro iwasaki
- Tomislav Janjusic
- Xin Zhao (nVidia/Mellanox)
- PR 8435 - https://github.com/open-mpi/ompi/pull/8435/files#r570096876
- Question as to what George was saying.
- George just saying that MPI already has that info and we don't need to ask PMIx again.
- Need it in HAN, and if we need it elsewhere, just move to base
- That being said, George doesn't want it in Tuned at all.
- mistake this was targeting v4.1 instead of master.
- UCX Issue 8321,
- We do need to understand what's going on , as there were comments saying we should not support anything older than 1.9.0, but then there was a comment that it's reproducable in 1.9 also
- Is this a UCX problem, or a PML problem?
- We don't know if it's PML or UCX
- UCX 1.9.0 + OMPI 4.0.4 - Issue 8442
- may not be related to Issue 8321
- We're ready to cut an RC for both 4.1.1 and 4.0.6, these two are blocking.
- UCX meeting is on Wednesdays
- Howard may go tomorrow.
- UCX community didn't like us configuring out, they're looking into
- It'd be nice to link this to an issue tomorrow.
- George will look 8466 (
- Schedule - If we could get something for Issue 8321, we can do an RC soon.
- We'll put out 4.0.6rc2 this week, but we'll know more about UCX and maybe 8466.
- A few changes to PMIx v3.2 series, that we might want in v4.0.6rc2
- Setting of hostname to NULL to protect against multiple PMIx init calls.
-
Some AVX problems / emails with Jeff and Brian and user will email the email-distro
-
Will take PMIx stuff from RALPH
-
Will do a v4.1.1 RC
-
Issue 8379 - UCT appears to be default and not UCX
- Jeff repinged for request
- Does UCT BTL even get built?
- Still in discussion in Issue 8102.
- Common missconception that people can install over existing install.
- Jeff repinged for request
-
Might be an older mca component from
-
We had a PR to have a Unique signature for each build.
- If we had this, we could use this signature in the modules themselves, but then we'd avoid this issue at runtime, and only open mca if from same build.
- We currently have something for mca VERSION, but we never update the mca version.
- So maybe we want to add OMPI version into this mca version check.
- But this might not be enough, as recompiles might have different configure.
- We need something to have something to identify the configure itself.
- If we had this, we could use this signature in the modules themselves, but then we'd avoid this issue at runtime, and only open mca if from same build.
-
8431 - git commit checks as action.
-
hwloc are we tracking the usage of the hwloc topology loads?
- George wants to take a stab at it. Using it in HAN and Treematch
- Setting a Goal to branch for v5.0 on Last working day of February.
- geoff will send email to devel list
- No comments other than good to set a goal, and try to make it.
- Austen created a tab in the google spreadsheets
- Excluded Mellanox due to MTT issues, Mellanox is passing 0 tests in their MTT.
- Cisco Turned off Cisco MTT because the testing harness creates an MPI Window, which is failing
- One-sided This is a blocker for release
- Jeff's talking to Nathan every day.
- Thinks there is an issue
- Possibly because pt2pt got removed, so possibly just master only.
- Christoph thinks it's master only, and that v4.0.x and v4.1.x is okay for one-sided issue
- Might be
- May be another issue, in that it should be an MPI_Abort, so shouldn't drop core.
- MTT is showing that the master branch is pretty good. We don't need to wait for PRRTE to be complet to branch v5.0.x in OMPI
- Raghu added an entry for libFabric entry.
- One-sided tests are still busted. Do we keep running these if they're failing?
- Nathan is actively working on, so hopeful we'll get this.
- XL sheet needs to be updated, as most of the stuff for ompio
- PR on ompi-tests public - Aurelien
- need to configure with ULFM to run the test.
- If run inside of a slurm job, it should just figure it out.
- Edgar atomicity issue for OMPIO. Not sure if it's a full feature, but need to have on radar.
- Not yet resolved.
- ETA: a few days after Edgar finds time. 2-3 weeks.
- Github 8431 - Git commit checks as github actions
- check for bogus emails.
- branches will have a file in a flag in a special path. 0 on master, 1 on release branches.
- Jeff detailed out the cases for the checker.
- When it's enabled and a commit is not exempted for any reason.
- Jeff will move from draft to
- Josh summarized discussion from last week in issue.
- Anything else Josh needs to implement?
- No, Josh will get to before end of month, before v5.0 branches.
- PR 8329 - convert README, HACKING, and possibly Manpages to restructured text.
- Uses https://www.sphinx-doc.org/en/master/ (Python tool, can pip install)
- Intent this is for v5.0
- mpirun / prrterun - we had quite a bit of details in orte, but are updating as much as possible.
- Ralph has asked about this for PMIx/PRRTE since this is turning out to work
- for v5.0 both of these will need to be fixed.
- Luster configure option, Edger sees it, but no idea how to fix it.
- Not sure if he should open an issue. Ralph thinks Giles fixed. Edger will give it a try
- SharedFP component, Edger opened an issue this morning.
- Blocker for v5.0
- What do we want to do about ROMIO in general.
- OMPIO is the default everywhere.
- Giles is saying the changes we made are integration changes.
- There have been some OMPI specific changes put into ROMIO, meaning upstream maintainers refuse to help us with it.
- We may be able to work with upstream to make a clear API between the two.
- As a 3rd party package, should we move it upto the 3rd party packaging area, to be clear that we shouldn't make changes to this area?
- Need to look at this treematch thing. Upstream package that is now inside of Open-MPI.
- Might want a CI bot to watch a set of files, and flag PRs that violate principles like this.
How's the state of https://github.com/open-mpi/ompi-tests-public/
- Putting new tests there
- Very little there so far, but working on adding some more.
- Should have some new Sessions tests
-
What's the general state? Any known issues?
-
AWS would like to get.
-
Josh Ladd - Will take internally to see what they have to say.
-
From nVidia/Mellanox, Cuda Support is through UCX, SM Cuda isn't tested that much.
-
Hessam Mirsadeg - All Cuda awareness through UCX
-
May ask George Bosilica about this.
-
Don't want to remove a BTL if someone is interested in it.
-
UCX also supports TCP via CUDA
-
PRRTE CLI on v5.0 will have some GPU functionality that Ralph is working on
-
Update 11/17/2020
- UTK is interested in this BTL, and maybe others.
- Still gap in the MTL use-case.
- nVidia is not maintaining SMCuda anymore. All CUDA support will be through UCX
- What's the state of the shared memory in the BTL?
- This is the really old generation Shared Memory. Older than Vader.
- Was told after a certain point, no more development in SM Cuda.
- One option might be to
- Another option might be to bring that SM in SMCuda to Vader(now SM)
-
Discussion on:
- Didn't get to this week. :(
- Draft Request Make default static https://github.com/open-mpi/ompi/pull/8132
- One con is that many providers hard link against libraries, which would then make libmpi dependent on this.
- Non-Homogenous clusters (GPUs on some nodes, and non-GPUs on some other)
- ECP Community days ( March 30-April 1st )
- David Bernholdt and/or George Bosilica
- Each day 90 minute time slots.
- Get proposal in by this Friday.