Skip to content

Lpf collectives enhancement #54

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 191 commits into
base: master
Choose a base branch
from
Open

Conversation

anyzelman
Copy link
Member

Split off from #52 , to be reviewed & applied after #52

…quests in a queue pair), not just relying on device information about max_qp_wr, but actually trying to create QPs via ibv_create_qp with different max_send_wr until we find the largest still working number (via binary search). This becomes the updated m_maxSrs. Independently, the 100K element test manyPuts needs to be downgraded to 5K for our cluster, as our count is just over 10K, but actually 10K does not work as well (not sure why?)
…e in manyPuts -- the device does not support having too many messages in the send WR QP
…it is very complicated to fix these tests - they seem all over the place, not working, but commiting it
… (only the minimum) but still have many failing tests without explanation, and not tested at all properly
…in the execute command. Also, reduce some example message count as it does not work with IB Verbs with very large tests on the ARM machine
…ent logic if the bloody Gtest wants to just list the tests or run them
…. Also, using Setup and TearDown for entire test suite now, pretty neat, no code duplication each time
…d remove c99 requirement and turn them into C++ tests
… I need to make MPI engines in debug layer call MPI_Abort, and pthread engine in debug layer call std::abort
…y, while it works, this didn't solve the problem. Mpirun still is used with pthreads, so it changes the std::abort signal to 134. This is why now I changed the launcher. Still having issues with some hybrid tests though.
…ap script from pre-existing googletest messages
… which internally is non-portably converted to 134. This also simplifies the launcher script. Also fix some incorrect delete's for arrays in the collectives
…y, we get all tests at the moment via gtest_add_tests. It would be good to replace gtest_add_tests with gtest_discover_tests in the future though, because the current one takes 60-90 seconds to configure. Also, there is a horrible bug now where if I specify a high CMake version (e.g. the needed 3.29.0), the GoogleTests would simply not compile at all
KADichev and others added 26 commits October 23, 2024 11:46
… not expose lpf_debug in the API, but only internally expose lpf_debug_abort in each engine
…ter some debugging, it seems to happen in a free call with an error message about corrupted chunks. It seems like it happens at destruction of some threads. I believe me replacing abort with exit has lead to that. Based on online documentation, exit calls destructors, and abort does not. As a workaround, I now use quick_exit - I can still pass the exit code, but no destructors are called, and the erros seems to disappear
This MR makes all LPF functional tests use GoogleTest.

The main changes for CMake are:
- reorganize CMake to use GTest CMake package for compilation, and gtest_add_tests clause to define Gtests;
- remove all use of the run.sh script;
- make explicit the list of tests and do not use file(GLOB, ...) clauses to look for all files in test directories, as recommended by CMakesource files:

The main changes for source files are:
- every single test file needed to be modified to not include internal Test.h but gtest.h, and to use the assert/equal clauses of GoogleTest
- all death tests are slightly modified now to expect FAIL() after the expected fail condition
- some modernization of C to C++ was needed for a number of files, incl. use of new/delete instead of malloc/free, or string manipulation instead of C char * manipulation

A new Python wrapper script test_launcher.py is being employed now for more flexible checking of return codes, which is needed for death tests and normal tests alike. The expected return codes are still parsed from scanning the source files as before.

For the complete history, see GitHub PR #26 .

This MR also resolves bugs that have appeared on more modern cluster deployments compared to the previous main tag of LPF. This is hence a priority MR to base all further extensions and evolutions of LPF on. This MR was tested successfully for:
 - x86_64, CentOS Stream 9, GCC 11.5.0, MPICH 4.1.1 (caveat 2)
 - x86_64, CentOS Stream 9, GCC 11.5.0, OpenMPI 4.1.1 (caveats 1, 2, and 5)
 - x86_64, Fedora, GCC 9.3.1, MPICH 3.3.2 (caveat 2)
 - x86_64, Ubuntu 23.04, GCC 13.1.0, OpenMPI 4.1.6 (caveat 5)
 - x86_64, openEuler 20.03 LTS, GCC 7.3, MPICH 3.2.1
 - x86_64, Ubuntu 22.04 LTS, GCC 11.4.0, MPICH 4.3.0b1 (caveat 3, transient)
 - ARM, Ubuntu 22.04 LTS, GCC 11.4.0, MPICH 4.3.0b1 (caveats 3 & 4)

These are under different variations of calls to `bootstrap.sh` to test all the above-described changes.

The following caveats apply:
1. with OpenMPI and default arguments to its `mpirun`, oversubscription to larger process counts may be limited, causing some tests to fail. Issue #37 has been raised to pass the appropriate flags to the OpenMPI `mpirun` during testing;
2. `make install` post-install checks fail (as they should) when: 1) tests are disabled, 2) ibverbs dev libraries are present, but  3) no IB card is present. On the machines with IB we tested on, all post-install checks succeed.
3. The hybrid backend on MPICH 4.3.0b1 does not pass the debug layer tests due to a segfault in MPICH's MPI_Abort. Issue #41 has been raised to track this. This error presently only is consistently reproducible on our cluster's ARM nodes and transient on our cluster's x86_64 nodes. They have not been observed outside our cluster.
4. With MPICH 4.3.0b1, we furthermore hit issues #43 and #44. Also these issues are only reproducible on ARM nodes.
5. With OpenMPI 4.1.6 and OpenMPI 4.1.1 on x86_64 we hit issue #45
6. At present, we have no CI flow that can run all build variations relevant to this MR. Issues #40 has been raised to (partially) address as well as to track this issue.
…ingSyncPerSlot, which fails correctly to account for all received messages. The key is that a message can be received either via a put from a remote active process, or via a get from the local active process. This was uncovered during tests in the HiCR project. Other minor improvements: 1) The MemoryRegistration is now a separate class outside of IBVerbs, which helps in future efforts if we serialize its information. 2) Improve logging inside IBVerbs communication. 3) Follow the pattern to pass on a memslot_t in all methods declared in mesgqueue.hpp, and then convert the slot ID via getVerbID. I suspect this is important to avoid issues.
…s.c test is wrong, it calls lpf_put / lpf_get after lpf_register_global, not complying to spec. For some reason, only the zero engine exposes this issue. Second, a BSP test is used, which is untested with zero engine, and we disable it for zero engine.
@anyzelman
Copy link
Member Author

It is TBD whether this MR should be reviewed & applied after (or before) #55

anyzelman added a commit that referenced this pull request Feb 12, 2025
anyzelman added a commit that referenced this pull request Feb 12, 2025
anyzelman added a commit that referenced this pull request Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants