Noc changes on top of zero engine #50

KADichev · 2024-12-15T14:06:01Z

In comparison with zero engine branch:

add the latest from LPF master (the new Gtest functionality)
add No-coherence protocol

This MR makes all LPF functional tests use GoogleTest. The main changes for CMake are: - reorganize CMake to use GTest CMake package for compilation, and gtest_add_tests clause to define Gtests; - remove all use of the run.sh script; - make explicit the list of tests and do not use file(GLOB, ...) clauses to look for all files in test directories, as recommended by CMakesource files: The main changes for source files are: - every single test file needed to be modified to not include internal Test.h but gtest.h, and to use the assert/equal clauses of GoogleTest - all death tests are slightly modified now to expect FAIL() after the expected fail condition - some modernization of C to C++ was needed for a number of files, incl. use of new/delete instead of malloc/free, or string manipulation instead of C char * manipulation A new Python wrapper script test_launcher.py is being employed now for more flexible checking of return codes, which is needed for death tests and normal tests alike. The expected return codes are still parsed from scanning the source files as before. For the complete history, see GitHub PR #26 . This MR also resolves bugs that have appeared on more modern cluster deployments compared to the previous main tag of LPF. This is hence a priority MR to base all further extensions and evolutions of LPF on. This MR was tested successfully for: - x86_64, CentOS Stream 9, GCC 11.5.0, MPICH 4.1.1 (caveat 2) - x86_64, CentOS Stream 9, GCC 11.5.0, OpenMPI 4.1.1 (caveats 1, 2, and 5) - x86_64, Fedora, GCC 9.3.1, MPICH 3.3.2 (caveat 2) - x86_64, Ubuntu 23.04, GCC 13.1.0, OpenMPI 4.1.6 (caveat 5) - x86_64, openEuler 20.03 LTS, GCC 7.3, MPICH 3.2.1 - x86_64, Ubuntu 22.04 LTS, GCC 11.4.0, MPICH 4.3.0b1 (caveat 3, transient) - ARM, Ubuntu 22.04 LTS, GCC 11.4.0, MPICH 4.3.0b1 (caveats 3 & 4) These are under different variations of calls to `bootstrap.sh` to test all the above-described changes. The following caveats apply: 1. with OpenMPI and default arguments to its `mpirun`, oversubscription to larger process counts may be limited, causing some tests to fail. Issue #37 has been raised to pass the appropriate flags to the OpenMPI `mpirun` during testing; 2. `make install` post-install checks fail (as they should) when: 1) tests are disabled, 2) ibverbs dev libraries are present, but 3) no IB card is present. On the machines with IB we tested on, all post-install checks succeed. 3. The hybrid backend on MPICH 4.3.0b1 does not pass the debug layer tests due to a segfault in MPICH's MPI_Abort. Issue #41 has been raised to track this. This error presently only is consistently reproducible on our cluster's ARM nodes and transient on our cluster's x86_64 nodes. They have not been observed outside our cluster. 4. With MPICH 4.3.0b1, we furthermore hit issues #43 and #44. Also these issues are only reproducible on ARM nodes. 5. With OpenMPI 4.1.6 and OpenMPI 4.1.1 on x86_64 we hit issue #45 6. At present, we have no CI flow that can run all build variations relevant to this MR. Issues #40 has been raised to (partially) address as well as to track this issue.

…ing for the reason behind the SIGILL I am getting with the example

…bs-based example similar to ibverbs.t.cpp. Second, revert a change in ibverbs.hpp, which created a but -- I changed the order of declaration of attributes, which leads to bugs when using the constructor initializer list.

…hey are not equal

…rtunately, the implementation is currently not working for binary streams. Investigating still...

… to binary now

…ler warnings

…t improved the code quality. On the way, I resolved an error in the implementation of getting sent and received messages - with and without slot parameters. Still need to validate the NOC example

…eueNoc structure, because I don't need it so far. Also, disable debug layer for NOC, as it is currently not correctly validating (future work).

…on about the process ID owning the NOC slot was not set correctly (always the local PID), which is of course wrong for n > 1. It seems to work now.

…cal slots.

…of zero engine. Also, fix the lpf API test to 2 processes.

…ting files

…ingSyncPerSlot, which fails correctly to account for all received messages. The key is that a message can be received either via a put from a remote active process, or via a get from the local active process. This was uncovered during tests in the HiCR project. Other minor improvements: 1) The MemoryRegistration is now a separate class outside of IBVerbs, which helps in future efforts if we serialize its information. 2) Improve logging inside IBVerbs communication. 3) Follow the pattern to pass on a memslot_t in all methods declared in mesgqueue.hpp, and then convert the slot ID via getVerbID. I suspect this is important to avoid issues.

…s.c test is wrong, it calls lpf_put / lpf_get after lpf_register_global, not complying to spec. For some reason, only the zero engine exposes this issue. Second, a BSP test is used, which is untested with zero engine, and we disable it for zero engine.

…rograms. However, we have ported all tests to use Gtest. In order to decouple the need for Gtest from this simple test, we separate it from the existing test in tests in a simpler form

…ith size 0 are not needed. In particular, my template for addNoc is addLocal, which needs no such assertions.

KADichev and others added 29 commits November 29, 2024 17:06

start working on a NOC extension and example

9520f4d

Start using NOC notation in example

d1055f7

WIP

0aa7363

WIP, now it compiles, but logic incomplete

3c2c337

Use more inheritance instead of separate NOC methods, and fix a few bugs

69608f0

Revert the use of shared pointers. Nothing really changes, still look…

de50310

…ing for the reason behind the SIGILL I am getting with the example

Remove use of shared_ptr, suspecting SIGILL is due to that

b37b1eb

Fix string comparison, it seems I need to convert to std::string or t…

43a6544

…hey are not equal

First working version! ToDo: implement serializable objects

b63f5b0

Implement serialize/deserialize methods for memory region class. Unfo…

d2d6ce9

…rtunately, the implementation is currently not working for binary streams. Investigating still...

Working version using strings for serialize/deserialize. Will improve…

bfed254

… to binary now

Replace string and string stream with direct binary copying via memcpy

d121723

Implemented Noc within LPF layer (not only IBVerbs). Now fixing compi…

9d3a1d9

…ler warnings

A lot of refactoring treating every GCC warning as error. I think tha…

0e3a1c2

…t improved the code quality. On the way, I resolved an error in the implementation of getting sent and received messages - with and without slot parameters. Still need to validate the NOC example

Forgot to add the new LPF-based NOC functionality test. Remove mesgqu…

ac3161d

…eueNoc structure, because I don't need it so far. Also, disable debug layer for NOC, as it is currently not correctly validating (future work).

A fully working example now. Resolved a very nasty bug: The informati…

0ad1ec7

…on about the process ID owning the NOC slot was not set correctly (always the local PID), which is of course wrong for n > 1. It seems to work now.

Update test name, name it 'lpf_test_noc_ring'

bf68a0c

Refactoring towards registering NOC slots as NOC slots, and not as lo…

a9dadeb

…cal slots.

Slightly refactoring verbs API and lpf API test for NOC functionaliy …

bb65de3

…of zero engine. Also, fix the lpf API test to 2 processes.

rename serialize/deserialize to be named lpf_noc_*

52ea900

Merge branch 'master' into noc_extension

07ac180

Complete merging manually, LPF_HAS_ABORT was missing for some conflic…

7078fe6

…ting files

During the merge, forgot to re-introduce zero engine, back on

141e55b

A file from tests is used to validate we can compile LPF MPI simple p…

974f3df

…rograms. However, we have ported all tests to use Gtest. In order to decouple the need for Gtest from this simple test, we separate it from the existing test in tests in a simpler form

I think the assertions that a mem slot is not a null poiner and not w…

a455246

…ith size 0 are not needed. In particular, my template for addNoc is addLocal, which needs no such assertions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Noc changes on top of zero engine #50

Noc changes on top of zero engine #50

Uh oh!

KADichev commented Dec 15, 2024

Uh oh!

Uh oh!

Noc changes on top of zero engine #50

Are you sure you want to change the base?

Noc changes on top of zero engine #50

Uh oh!

Conversation

KADichev commented Dec 15, 2024

Uh oh!

Uh oh!