-
Notifications
You must be signed in to change notification settings - Fork 901
WeeklyTelcon_20161122
Jeff Squyres edited this page Nov 18, 2016
·
4 revisions
- Dialup Info: (Do not post to public mailing list or public wiki)
- All issues and pull requests for v1.10.x
- 1.10.5
- Nathan sees a segv and will be submitting a bug. May be a driver for 1.10.5.
- Some PRs still here, waiting for reviews (Jeff).
- Did we ever fix signal handler?
- All issues and pull requests for v2.0.2
- Desired / must-haves for v2.0.x series
- Known / ongoing issues to discuss
-
#2234 COMM_SPAWN broken:
- nathan just filed a PR on this yesterday.
- v2.0.2 schedule:
- IBM jenkins is down, due to lost Filesystem.
- Josh found something with idup with MT
- Nathan's comm_spawn fix should fix this too.
- New issue from yesterday: neighborhood collectives
- https://github.com/open-mpi/ompi/issues/2324
- Neighbor gatherv and Neighbor Igatherv. - Giles asked for test case this morning.
- Nonuniformed datatypes in base and tuned.
- same issue in tuned, but can shut off
- assuming same count at all ranks. "Fix" is to turn off the logic based on count, with mca to turn on if really know your app sends same counts at all ranks.
- ibcast stuff is just a work around.
- libnbc also has this same issue. So whatever fix, fix both blocking and non-blocking.
-
#2234 COMM_SPAWN broken:
-
Desired / must-haves for v2.1.x series
- Reviewed this today (Nov 1st)
- MPI-IO - no good MT test for MPI_IO - Edger would like this.
- Possible we'll see something for new coll_tuned component.
-
Known / ongoing issues to discuss
- PMIx 1.2.0: status?
-
PR 2286: will update to PMIx v1.2.0rc1. Testing is looking good. Two outstanding issues -- both should be done this week:
- Update to Get
- One thing Boris is working on.
- Estimate release PMIx v1.2.0 this time next week.
- People please try it out!
-
PR 2286: will update to PMIx v1.2.0rc1. Testing is looking good. Two outstanding issues -- both should be done this week:
- PMIx 1.2.0: status?
-
Performance issue 1831.
- PSM2 and also libfabric.
- Don't know if it's the psm component or the CM component.
- Performance on v2.x and master is good. v2.0.x is worst, v1.10.x is best.
- Perhaps something from Request refactor didn't get back ported correctly?
- Fixed the BLTs, but not the MTLs. Something in that code path is not right.
- Is this a blocker for a 2.0.2 release? From Intel perspective, would like it to be a blocker.
- It's weird that it only affects larger 64KB messages.
- Only single threaded build.
- Hard to tell if it's in the PSM library, or in the Open MPI code-path.
- If we see the Genie provider is impacted, then we know it's probably Open MPI.
- Should get data by tomorrow.
- If it also affects u-genie, then it's probably Open MPI, and should be an Open MPI 2.0.x blocker.
- Where are we on PMIx?
- performance difference between two different types of machines, especially when a lot of core counts.
- Not going to happen before super computing.
- Job Info in PMIx v1.2 - Artem is working on. Should go into PMIx master in next day or so.
- Will get another RC after datastore.
- Nathan can run a launch scaling test, but not a data scaling test.
- Any data could help. Without datastore, will probably die about 512 nodes @ 272ppn.
- A few other PRs on
- PR #2354 - Can Artem explain what this is for?
- PR #2365 - Slurm bindings, Open MPI doesn't recognize that it's binding, and does it's own thing. Had fixed this in skitzo framework. In the ESS we can make more intellegent decision, but this meant we had to bring over SLURM ESS, and added one new Skitzo framework call. Change in Orte Init to open up skitzo. This allows us to detect slurm bindings, but also allowed us to fix singletons in a slurm environment. Since singleton ESS doesn't recognize it's in a SLURM environment.
- This question posed to the user community in the BOF
- We were requested to make two lists:
- Features that we anticipate we could port to a v2.2.x release
- Features that we anticipate would be too difficult to port to a v2.2.x release
- Here's Jeff Sq's first cut at these two lists -- please expand / fill out during the call:
- Features that we anticipate we could port to a v2.2.x release
- Improved collective performance (new “tuned” module)
- Enable Linux CMA shared memory support by default
- Features that we anticipate would be too difficult to port to a v2.2.x release
- THREAD_MULTIPLE improvements for MTLs
- Revamped CUDA support
- PMIx 3.0 integration
- MPI_ALLOC_MEM integration with memkind
- OpenMP affinity / placement integration
- Features that we anticipate we could port to a v2.2.x release
-
PR #2285: enabling orte to use libfabric
- Please go test it!
- Uses RDM messaging
- @hppritcha would like to test, but will not be able to test until next week
SPI - http://www.spi-inc.org
- We have been officially invited to SPI
- Ralph's new information about the two organizations.
Review Master MTT testing (https://mtt.open-mpi.org/)
- Not seeing Morning MTT reports, or tarball generation email or coverity.
- 2.0.x series, still having some failures. Cisco has 2041 failures (1800ish is OSHMEM)
- Not getting morning MTT result emails. Jeff looked into that last week, and went back and forth with Brian.
- mail to gmail gets there, mail to cisco doesn't get there.
- Ralph thinks, there is a new security thing called SPF, if you don't have things setup correctly on server, some sites (google is not one) will reject email (not even return). Setting on server side, to say that I'm spoofing this domain name, to get systems to accept it.
- Week of Jan 23, 2016
- Meeting-2017-01
- LANL, Houston, IBM
- Cisco, ORNL, UTK, NVIDIA
- Mellanox, Sandia, Intel