Skip to content

WeeklyTelcon_20160105

Jeff Squyres edited this page Nov 18, 2016 · 1 revision

Open MPI Weekly Telcon


  • Dialup Info: (Do not post to public mailing list or public wiki)

Attendees

  • Jeff Squyres
  • Brad Benton
  • Edgar Gabriel
  • Geoffrey Vallee
  • Geoff Paulsen
  • Nathan Hjelm
  • Sylvain Jeaugey
  • Todd Kordenbrock
  • Ralph C.

Agenda

Review 1.10

  • Milestones: https://github.com/open-mpi/ompi-release/milestones/v1.10.2
  • Looks good in MTT world.
    • Jeff's cluster had some timeouts on 10.10 network (no route to house), maybe cluster config. Ignore for now. Weird that no route should FAIL, not timeout. TCP BTL. Maybe multirail issue. inf. loop rather than fail.
    • Don't block 1.10 release since probably not common use case.
  • Paul found some issues.
  • NAG fortran support configuree isn't right.
    • NOT a regression (in 1.8). Do we care if this is a blocker?
  • mpirun is hanging after good run. Only in SLES. Also Cray (uses SLES).
    • proc is defunct / zombied.
    • IO Forwarding file descriptors may not be getting HANGUP.
    • Not a Regression, 1.10, 2.0, master. ORTED in event library. Pretty serious, but hoping it's just SLES issue.
      • Set state machine verbosity to 5.
      • Nathan will look at SLES 11 and SLES 12 (Different kernels even, very different)
      • Try to find if it's sigchild or file descriptor
    • Hold 1.10.2 release until Nathan runs tests today.

Review 2.0.x

Review Master

General Discussion

  • Debian/Ubuntu package support
    • Ubuntu doesn't have a maintainer anymore for Open MPI. Packaged not officially "orphaned"
    • Then when it gets adopted, we could adopt it. Nathan has been
    • Old maintainer has a repo, and a bunch of patches, which no one in community has ever looked at.
    • Sent directions to Ralph on his directions, but quite complex.
    • Send request that the package get correctly orphaned.
    • Geoffrey Vallee willing to pickup official maintainer.

MTT status:

Status Updates:

  • Cisco - nothing OMPI specific to report. Please go sign up for face to face on wiki.
  • ORNL - MTT - running. Announced today that they'd be picking up Debian Package maintance of Open MPI.
  • NVIDIA - got MTT back to normal or close to that. Couple of things failing when enabling GPU Direct RDMA. Has something to do with Atomic operations.
    • Can turn off atomic operations via MCA parameter. Look at bit flags in OMPI_INFO BTL openib
    • Turn off the Fetching ops and atomic ops (find bit values, calculate new flags without bits and reset)

Status Update Rotation

  1. Cisco, ORNL, UTK, NVIDIA
  2. Mellanox, Sandia, Intel
  3. LANL, Houston, HLRS, IBM

Back to 2015 WeeklyTelcon-2015

Clone this wiki locally