Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC client-server does not work between macos and linux #616

Open
jspanchu opened this issue Aug 24, 2022 · 1 comment
Open

RPC client-server does not work between macos and linux #616

jspanchu opened this issue Aug 24, 2022 · 1 comment

Comments

@jspanchu
Copy link

Describe the bug
The hello world thallium RPC example doesn't work in a heterogeneous environment (mac + linux). See hello-world. I modified the source to use 'sockets' provider instead of TCP. I am posting this here because the error messages come from mercury and maybe libfabric?

Run the server on mac:

~/hello-thallium  $ ./server
Server running at address ofi+sockets://10.50.58.248:39517
# [80739.928023] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:2431
 # na_ofi_addr_map_insert(): fi_av_insert() failed, inserted: 0
# [80739.928109] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:2320
 # na_ofi_addr_key_lookup(): Could not insert new address
# [80739.928120] mercury->addr: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:4756
 # na_ofi_cq_process_recv_unexpected_event(): Could not lookup address
# [80739.928128] mercury->msg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/na/na_ofi.c:4680
 # na_ofi_cq_process_event(): Could not process unexpected recv event
# [80739.928156] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3917
 # hg_core_progress_na(): Could not make progress on NA (NA_PROTOCOL_ERROR)
# [80739.928167] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3809
 # hg_core_poll_wait(): hg_core_progress_na() failed
# [80739.928173] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:3708
 # hg_core_progress(): Could not make blocking progress on context
# [80739.928180] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury_core.c:5077
 # HG_Core_progress(): Could not make progress
# [80739.928208] mercury->hg: [error] /var/folders/9p/hrppv1m97xs53_jcddyysq4x6t32tv/T/jaswant.panchumarti/spack-stage/spack-stage-mercury-master-nocsov6z3xrlrmlvlisfupujclikc2hu/spack-src/src/mercury.c:2074
 # HG_Progress(): Could not make progress on context (HG_PROTOCOL_ERROR)
[critical] unexpected return code (12: HG_PROTOCOL_ERROR) from HG_Progress()
Assertion failed: (0), function __margo_hg_progress_fn, file margo-core.c, line 1659.
zsh: abort      ./server

and client on Linux:

$ ./client ofi+sockets://10.50.58.248:39517

I get the same output for a client on mac and a server on linux.

To Reproduce
Steps to reproduce the behavior:
On macOS, spack installs argobots@1.1 which simply crashes the server (segmentation fault), so use argobots@main on both Linux and mac with this command.

$ spack install mochi-thallium@develop^libfabric fabrics=tcp,rxm,sockets ^argobots@main
$ spack load mochi-thallium@develop

Compile

  1. server.cpp
// c++ --std=c++14 -o server server.cpp `pkg-config --cflags --libs thallium`
#include <iostream>
#include <thallium.hpp>

namespace tl = thallium;

void hello(const tl::request& req) {
    std::cout << "Hello World!" << std::endl;
}

int main(int argc, char** argv) {
    HG_Set_log_level("debug");
    tl::engine myEngine("sockets", THALLIUM_SERVER_MODE);
    myEngine.define("hello", hello).disable_response();
    std::cout << "Server running at address " << myEngine.self() << std::endl;

    return 0;
}
  1. client.cpp
// c++ --std=c++14 -o server server.cpp `pkg-config --cflags --libs thallium`
#include <thallium.hpp>

namespace tl = thallium;

int main(int argc, char** argv) {

    if(argc != 2) {
        std::cerr << "Usage: " << argv[0] << " <address>" << std::endl;
        exit(0);
    }

    tl::engine myEngine("sockets", THALLIUM_CLIENT_MODE);
    tl::remote_procedure hello = myEngine.define("hello").disable_response();
    tl::endpoint server = myEngine.lookup(argv[1]);
    hello.on(server)();

    return 0;
}

Platforms:
MacOS: Monterey 12.5.1 on M1 with clang-13.1.6
Linux: Ubuntu 22.04 with GCC 11.2.0

Here's output of spack spec mochi-thallium on each platform.

# macOS
$ spack spec mochi-thallium 
Input spec
--------------------------------
mochi-thallium

Concretized
--------------------------------
mochi-thallium@develop%apple-clang@13.1.6+cereal~ipo build_type=RelWithDebInfo arch=darwin-monterey-m1
    ^cereal@1.3.2%apple-clang@13.1.6~ipo build_type=RelWithDebInfo patches=2dfa0bf arch=darwin-monterey-m1
        ^cmake@3.23.3%apple-clang@13.1.6~doc+ncurses+ownlibs~qt build_type=Release arch=darwin-monterey-m1
            ^ncurses@6.2%apple-clang@13.1.6~symlinks+termlib abi=none arch=darwin-monterey-m1
                ^gnuconfig@2021-08-14%apple-clang@13.1.6 arch=darwin-monterey-m1
                ^pkgconf@1.8.0%apple-clang@13.1.6 arch=darwin-monterey-m1
            ^openssl@1.1.1q%apple-clang@13.1.6~docs~shared certs=mozilla patches=3fdcf2d arch=darwin-monterey-m1
                ^ca-certificates-mozilla@2022-07-19%apple-clang@13.1.6 arch=darwin-monterey-m1
                ^perl@5.34.1%apple-clang@13.1.6+cpanm+shared+threads arch=darwin-monterey-m1
                    ^berkeley-db@18.1.40%apple-clang@13.1.6+cxx~docs+stl patches=b231fcc arch=darwin-monterey-m1
                    ^bzip2@1.0.8%apple-clang@13.1.6~debug~pic+shared arch=darwin-monterey-m1
                        ^diffutils@3.8%apple-clang@13.1.6 arch=darwin-monterey-m1
                            ^libiconv@1.16%apple-clang@13.1.6 libs=shared,static arch=darwin-monterey-m1
                    ^gdbm@1.19%apple-clang@13.1.6 arch=darwin-monterey-m1
                        ^readline@8.1.2%apple-clang@13.1.6 arch=darwin-monterey-m1
                    ^zlib@1.2.12%apple-clang@13.1.6+optimize+pic+shared patches=0d38234 arch=darwin-monterey-m1
    ^mochi-margo@develop%apple-clang@13.1.6~debug~pvar arch=darwin-monterey-m1
        ^argobots@main%apple-clang@13.1.6~affinity~debug~lazy_stack_alloc+perf~stackunwind~tool~valgrind stackguard=none arch=darwin-monterey-m1
            ^autoconf@2.69%apple-clang@13.1.6 patches=35c4492,7793209,a49dd5b arch=darwin-monterey-m1
                ^m4@1.4.19%apple-clang@13.1.6+sigsegv patches=9dc5fbd,bfdffa7 arch=darwin-monterey-m1
                    ^libsigsegv@2.13%apple-clang@13.1.6 arch=darwin-monterey-m1
            ^automake@1.16.5%apple-clang@13.1.6 arch=darwin-monterey-m1
            ^libtool@2.4.7%apple-clang@13.1.6 arch=darwin-monterey-m1
        ^json-c@0.16%apple-clang@13.1.6~ipo build_type=RelWithDebInfo arch=darwin-monterey-m1
        ^mercury@master%apple-clang@13.1.6~bmi+boostsys+checksum~debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg build_type=RelWithDebInfo arch=darwin-monterey-m1
            ^boost@1.79.0%apple-clang@13.1.6+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug+exception~fiber+filesystem+graph~graph_parallel~icu+iostreams~json+locale+log+math~mpi+multithreaded~nowide~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded~stacktrace+system~taggedlayout+test+thread+timer~type_erasure~versionedlayout+wave cxxstd=98 patches=a440f96 visibility=hidden arch=darwin-monterey-m1
            ^libfabric@1.15.1%apple-clang@13.1.6~debug~disable-spinlocks~kdreg fabrics=rxm,sockets,tcp arch=darwin-monterey-m1
# linux
spack spec mochi-thallium 
Input spec
--------------------------------
mochi-thallium

Concretized
--------------------------------
mochi-thallium@develop%gcc@11.2.0+cereal~ipo build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
    ^cereal@1.3.2%gcc@11.2.0~ipo build_type=RelWithDebInfo patches=2dfa0bf arch=linux-ubuntu22.04-icelake
        ^cmake@3.23.2%gcc@11.2.0~doc+ncurses+ownlibs~qt build_type=Release arch=linux-ubuntu22.04-icelake
            ^ncurses@6.2%gcc@11.2.0~symlinks+termlib abi=none arch=linux-ubuntu22.04-icelake
                ^pkgconf@1.8.0%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
            ^openssl@1.1.1q%gcc@11.2.0~docs~shared certs=mozilla patches=3fdcf2d arch=linux-ubuntu22.04-icelake
                ^ca-certificates-mozilla@2022-03-29%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                ^perl@5.34.1%gcc@11.2.0+cpanm+shared+threads arch=linux-ubuntu22.04-icelake
                    ^berkeley-db@18.1.40%gcc@11.2.0+cxx~docs+stl patches=b231fcc arch=linux-ubuntu22.04-icelake
                    ^bzip2@1.0.8%gcc@11.2.0~debug~pic+shared arch=linux-ubuntu22.04-icelake
                        ^diffutils@3.8%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                            ^libiconv@1.16%gcc@11.2.0 libs=shared,static arch=linux-ubuntu22.04-icelake
                    ^gdbm@1.19%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                        ^readline@8.1.2%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
                    ^zlib@1.2.12%gcc@11.2.0+optimize+pic+shared patches=0d38234 arch=linux-ubuntu22.04-icelake
    ^mochi-margo@develop%gcc@11.2.0~pvar arch=linux-ubuntu22.04-icelake
        ^argobots@main%gcc@11.2.0~affinity~debug~lazy_stack_alloc+perf~stackunwind~tool~valgrind stackguard=none arch=linux-ubuntu22.04-icelake
            ^autoconf@2.69%gcc@11.2.0 patches=35c4492,7793209,a49dd5b arch=linux-ubuntu22.04-icelake
                ^m4@1.4.19%gcc@11.2.0+sigsegv patches=9dc5fbd,bfdffa7 arch=linux-ubuntu22.04-icelake
                    ^libsigsegv@2.13%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
            ^automake@1.16.5%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
            ^libtool@2.4.7%gcc@11.2.0 arch=linux-ubuntu22.04-icelake
        ^json-c@0.15%gcc@11.2.0~ipo build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
        ^mercury@master%gcc@11.2.0~bmi+boostsys+checksum~debug~hwloc~ipo~mpi+ofi~psm~psm2+shared+sm~ucx~udreg build_type=RelWithDebInfo arch=linux-ubuntu22.04-icelake
            ^boost@1.79.0%gcc@11.2.0+atomic+chrono~clanglibcpp~container~context~contract~coroutine+date_time~debug+exception~fiber+filesystem+graph~graph_parallel~icu+iostreams~json+locale+log+math~mpi+multithreaded~nowide~numpy~pic+program_options~python+random+regex+serialization+shared+signals~singlethreaded~stacktrace+system~taggedlayout+test+thread+timer~type_erasure~versionedlayout+wave cxxstd=98 patches=a440f96 visibility=hidden arch=linux-ubuntu22.04-icelake
            ^libfabric@1.15.1%gcc@11.2.0~debug~disable-spinlocks~kdreg fabrics=rxm,sockets,tcp arch=linux-ubuntu22.04-icelake
@soumagne
Copy link
Member

soumagne commented Dec 9, 2022

we should investigate what is the right solution for that now as anything that uses OFI's sockets provider will be unsupported.

@soumagne soumagne added this to the future milestone Dec 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants