Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AI models base pytorch run failed #948

Closed
sampleyang opened this issue Sep 30, 2022 · 21 comments
Closed

AI models base pytorch run failed #948

sampleyang opened this issue Sep 30, 2022 · 21 comments

Comments

@sampleyang
Copy link

sampleyang commented Sep 30, 2022

Description of the problem

I have a model for machine translation system,it will use torch.distributed.init_process_group to fork some processes. It will report error when i run it on gramine v1.3

Debug info

The 2nd enclave start failed. The trace infos as follow:

[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P1:T1:python3] error: Internal memory fault at 0x00000000 (0x3fa02ae44, VMID = 1, TID = 1)
debug: PalProcessExit: Returning exit code 1
warning: PalVirtualMemoryProtect is unimplemented in Linux-SGX PAL
debug: Gramine was built from commit: 00e91a0
debug: Host: Linux-SGX
debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7
debug: Initial VMA region 0x3fa000000-0x3fa1a3000 (LibOS) bookkeeped
debug: Initial VMA region 0x3ffffc000-0x400000000 (manifest) bookkeeped
debug: ASLR top address adjusted to 0x136997000
debug: host is Linux-SGX and remote attestation type is 'dcap', adding SGX-specific /dev/attestation/ files: report, quote, etc.
debug: LibOS loaded at 0x3fa000000, ready to initialize
error: libos_init: failed to read the whole checkpoint header: -61
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1

My template

loader.entrypoint = "file:{{ gramine.libos }}"
loader.log_level = "all"

loader.env.LD_LIBRARY_PATH = "/lib:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu"
loader.env.PATH = "/usr/bin"

loader.insecure__use_cmdline_argv = true

fs.root.type = "chroot"
fs.root.path = "/"
fs.root.uri = "file:/"

fs.mounts = [
{ path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
]

sgx.nonpie_binary = true
sgx.enclave_size = "16G"
sgx.thread_num = 512
sgx.remote_attestation = "dcap"

sgx.trusted_files = [
"file:{{ gramine.runtimedir() }}/",
]

sgx.allowed_files = [
"file:/",
]

@dimakuv
Copy link
Contributor

dimakuv commented Sep 30, 2022

[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)

Based on this, looks like you have a huge checkpoint (that is prepared by the parent enclave and then sent to the child enclave). Have you tried increasing sgx.enclave_size even more, to 32G?

@boryspoplawski
Copy link
Contributor

@dimakuv no, these are normal numbers. Try running fork_and_exec from libos/test/regression you will get the same.

@boryspoplawski
Copy link
Contributor

@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway).

@sampleyang
Copy link
Author

[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)

Based on this, looks like you have a huge checkpoint (that is prepared by the parent enclave and then sent to the child enclave). Have you tried increasing sgx.enclave_size even more, to 32G?

32G also have the same problem.

@sampleyang
Copy link
Author

sampleyang commented Oct 1, 2022

@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway).

commit: 00e91a0

I try to open trace, but find that only the parent enclave can print trace info, the child enclave can not(only little debug info)。

I try gramine v1.1, it can pass on this point, but will happen another problem that program hung on futex.
gramine v1.1 infos:
image
hung infos:
image

@boryspoplawski
Copy link
Contributor

@Simon-aikuier the 2nd log is definitely neither from 00e91a0 nor v1.3.

@sampleyang
Copy link
Author

sampleyang commented Oct 1, 2022

the 2nd log is definitely neither from 00e91a0 nor v1.3.

The 2nd log is from gramine v1.1. I try it on different gramine version.

@sampleyang
Copy link
Author

sampleyang commented Oct 8, 2022

@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway).

@boryspoplawski
I try it on gramine v1.3.1. It seems that the internal memory error can work ok. And a new error happens:

=> New Error:
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/google/protobuf/pyext/_message.cpython-38-x86_64-linux-gnu.so loaded at 0x12ff30000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/termios.cpython-38-x86_64-linux-gnu.so loaded at 0x1690ce000
[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] debug: Creating pipe: pipe.srv:e924eb0954ddf3874f12e6cf370c07918d2be989e154a5e2d63b1d0659b72771
[P1:T1:python3] warning: Unsupported system call clone3
[P1:T1:python3] debug: Creating pipe: pipe.srv:9ea56ac046f16173d0da37af637a436f6f6bad67f70288ae4cb51e8f7e34b529
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
terminate called after throwing an instance of 'std::system_error'
what(): Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort
[P1:T23:python3] debug: killed by signal 6
[P1:T1:python3] debug: Installed async event at 1665199760214444

=> Trace Info:
[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/etc/gai.conf", O_RDONLY|0x80000, 0000) = 0x3
[P1:T1:python3] trace: ---- newfstatat(3, "", 0x8480dca0, 4096) = 0x0
[P1:T1:python3] trace: ---- newfstatat(3, "", 0x8480daa0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(3, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0xa18
[P1:T1:python3] trace: ---- read(3, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(3) = 0x0
[P1:T1:python3] trace: ---- futex(0xabddce84, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0xf59048, 7) ...
[P1:T1:python3] trace: ---- return from futex(...) = 0x0
[P1:T1:python3] trace: ---- socket(INET6, SOCK_CLOEXEC|DGRAM, 0) = 0x3
[P1:T1:python3] trace: ---- connect(3, {family=IPv6,ip=[0:0:0:0:0:0:0:0],port=40379}, 28) ...
[P1:T1:python3] trace: ---- return from connect(...) = 0x0
[P1:T1:python3] trace: ---- getsockname(3, 0x8480dd58, 0x8480de70) = 0x0
[P1:T1:python3] trace: ---- connect(3, UNKNOWN, 16) ...
[P1:T1:python3] trace: ---- return from connect(...) = 0x0
[P1:T1:python3] trace: ---- connect(3, {family=IPv4,ip=0.0.0.0,port=40379}, 16) ...
[P1:T1:python3] trace: ---- return from connect(...) = -22
[P1:T1:python3] trace: ---- close(3) = 0x0
[P1:T1:python3] trace: ---- socket(INET6, STREAM, 6) = 0x3
[P1:T1:python3] trace: ---- setsockopt(3, 1, 2, 0x8480e420, 4) = 0x0
[P1:T1:python3] trace: ---- bind(3, {family=IPv6,ip=[0:0:0:0:0:0:0:0],port=40379}, 28) = 0x0
[P1:T1:python3] trace: ---- listen(3, 2048) = 0x0
[P1:T1:python3] trace: ---- getsockname(3, 0x8480e420, 0x8480e3c4) = 0x0
[P1:T1:python3] debug: Creating pipe: pipe.srv:91b61487213987a6ecae039250272dcd7b083e7fac834042be0d1b239511c7f7
[P1:T1:python3] trace: ---- pipe2(0x6f1395e0, 0) = 0x0
[P1:T1:python3] trace: ---- mmap(0, 0x401000, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0x0) ...
[P1:T1:python3] trace: ---- return from mmap(...) = 0x6dc4e000
[P1:T1:python3] trace: ---- mprotect(0x6dc4f000, 0x400000, PROT_READ|PROT_WRITE) ...
[P1:T1:python3] trace: ---- return from mprotect(...) = 0x0
[P1:T1:python3] trace: ---- rt_sigprocmask(BLOCK, [SIGHUP,SIGINT,SIGQUIT,SIGILL,SIGTRAP,SIGABRT,SIGBUS,SIGFPE,SIGKILL,SIGUSR1,SIGSEGV,SIGUSR2,SIGPIPE,SIGALRM,SIGTERM,SIGSTKFLT,SIGCHLD,SIGCONT,SIGSTOP,SIGTSTP,SIGTTIN,SIGTTOU,SIGURG,SIGXCPU,SIGXFSZ,SIGVTALRM,SIGPROF,SIGWINCH,SIGIO
[P1:T1:python3] trace: ,SIGPWR,SIGSYS,SIG32,SIG33,SIG34,SIG35,SIG36,SIG37,SIG38,SIG39,SIG40,SIG41,SIG42,SIG43,SIG44,SIG45,SIG46,SIG47,SIG48,SIG49,SIG50,SIG51,SIG52,SIG53,SIG54,SIG55,SIG56,SIG57,SIG58,SIG59,SIG60,SIG61,SIG62,SIG63,SIG64,], [], 0x8) = 0x0
[P1:T1:python3] warning: Unsupported system call clone3
[P1:T1:python3] trace: ---- clone(CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, 0x6e04def0, 0x6e04e910, 0x6e04e910, 0x6e04e640) ...
[P1:T1:python3] debug: Creating pipe: pipe.srv:e5efe11f5bd73f6a0d34c8e5ed5a14ae4b18d0ab54939d9f3404627f4015b491
[P1:T1:python3] trace: ---- return from clone(...) = 0x17
[P1:T23:python3] trace: ---- set_robust_list(0x6e04e920, 0x18) = 0x0
[P1:T1:python3] trace: ---- rt_sigprocmask(SETMASK, [], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- rt_sigprocmask(SETMASK, [], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- mmap(0, 0x8000000, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x0) ...
[P1:T1:python3] trace: ---- newfstatat(AT_FDCWD, "/etc/nsswitch.conf", 0x8480d930, 0) = 0x0
[P1:T1:python3] trace: ---- newfstatat(AT_FDCWD, "/etc/resolv.conf", 0x8480da70, 0) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/etc/hosts", O_RDONLY|0x80000, 0000) = 0x6
[P1:T1:python3] trace: ---- newfstatat(6, "", 0x8480d900, 4096) = 0x0
[P1:T1:python3] trace: ---- lseek(6, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(6, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x24e
[P1:T1:python3] trace: ---- read(6, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(6) = 0x0
[P1:T1:python3] trace: ---- clock_gettime(0, 0x8480e3a0) = 0x0
[P1:T1:python3] trace: ---- socket(INET, STREAM, 6) = 0x6
[P1:T1:python3] trace: ---- fcntl(6, F_SETFL, 0x800) = 0x0
[P1:T1:python3] trace: ---- connect(6, {family=IPv4,ip=127.0.0.1,port=40379}, 16) ...
[P1:T1:python3] trace: ---- return from connect(...) = 0x0
[P1:T1:python3] trace: ---- clock_gettime(0, 0x8480e3a0) = 0x0
[P1:T1:python3] trace: ---- poll(0x8480e428, 1, 1800000) ...
[P1:T1:python3] trace: ---- return from poll(...) = 0x1
[P1:T1:python3] trace: ---- getsockopt(6, 1, 4, 0xabaaab00, 0x8480e3f4) = 0x0
[P1:T1:python3] trace: ---- fcntl(6, F_GETFL, 0x4) = 0x802
[P1:T1:python3] trace: ---- fcntl(6, F_SETFL, 0x2) = 0x0
[P1:T1:python3] trace: ---- setsockopt(6, 6, 1, 0x8480e394, 4) = 0x0
[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x1, 0, 0, 0) = 0x1
[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95
[P1:T1:python3] trace: ---- futex(0xab28b1e0, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0x1, 202) ...
[P1:T1:python3] trace: ---- return from futex(...) = 0x0
[P1:T1:python3] trace: ---- close(5) = 0x0
[P1:T1:python3] trace: ---- futex(0x6e04e910, FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 23, 0, 0, -1) ...
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T23:python3] trace: ---- return from mmap(...) = 0x65c4e000
[P1:T23:python3] trace: ---- munmap(0x65c4e000, 0x23b2000) ...
[P1:T23:python3] trace: ---- return from munmap(...) = 0x0
[P1:T23:python3] trace: ---- munmap(0x6c000000, 0x1c4e000) ...
[P1:T23:python3] trace: ---- return from munmap(...) = 0x0
[P1:T23:python3] trace: ---- mprotect(0x68000000, 0x21000, PROT_READ|PROT_WRITE) ...
[P1:T23:python3] trace: ---- return from mprotect(...) = 0x0
[P1:T23:python3] trace: ---- poll(0x68000b90, 2, -1) ...
[P1:T23:python3] trace: ---- return from poll(...) = 0x2
[P1:T23:python3] trace: ---- poll(0x68000b70, 1, -1) ...
[P1:T23:python3] trace: ---- return from poll(...) = 0x1
[P1:T23:python3] trace: ---- accept(3, 0, 0) ...
[P1:T23:python3] trace: ---- return from accept(...) = 0x5
[P1:T23:python3] trace: ---- getpeername(5, 0x6e04dce0, 0x6e04dcbc) = 0x0
[P1:T23:python3] trace: ---- setsockopt(5, 6, 1, 0x6e04dc74, 4) = 0x0
[P1:T23:python3] trace: ---- write(2, 0xaa7a57d8, 0x30) ...
terminate called after throwing an instance of '[P1:T23:python3] trace: ---- return from write(...) = 0x30
[P1:T23:python3] trace: ---- write(2, 0x68000bf0, 0x11) ...
std::system_error[P1:T23:python3] trace: ---- return from write(...) = 0x11
[P1:T23:python3] trace: ---- write(2, 0xaa7a57c4, 0x2) ...
'
[P1:T23:python3] trace: ---- return from write(...) = 0x2
[P1:T23:python3] trace: ---- write(2, 0xaa7a57c7, 0xb) ...
what(): [P1:T23:python3] trace: ---- return from write(...) = 0xb
[P1:T23:python3] trace: ---- write(2, 0x68000dc8, 0x5d) ...
Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort[P1:T23:python3] trace: ---- return from write(...) = 0x5d
[P1:T23:python3] trace: ---- write(2, 0xabdd5723, 0x1) ...

[P1:T23:python3] trace: ---- return from write(...) = 0x1
[P1:T23:python3] trace: ---- rt_sigprocmask(UNBLOCK, [SIGABRT,], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- gettid() = 0x17
[P1:T23:python3] trace: ---- getpid() = 0x1
[P1:T23:python3] trace: ---- tgkill(1, 23, [SIGABRT]) = 0x0
[P1:T23:python3] debug: killed by signal 6
[P1:T1:python3] trace: ---- return from futex(...) = -512
[P1:T8:python3] trace: ---- return from futex(...) = -512
[P1:T9:python3] trace: ---- return from futex(...) = -512
[P1:T10:python3] trace: ---- return from futex(...) = -512
[P1:T11:python3] trace: ---- return from futex(...) = -512
[P1:T12:python3] trace: ---- return from futex(...) = -512
[P1:T13:python3] trace: ---- return from futex(...) = -512
[P1:T14:python3] trace: ---- return from futex(...) = -512
[P1:T15:python3] trace: ---- return from futex(...) = -512
[P1:T16:python3] trace: ---- return from futex(...) = -512
[P1:T17:python3] trace: ---- return from futex(...) = -512
[P1:T18:python3] trace: ---- return from futex(...) = -512
[P1:T19:python3] trace: ---- return from futex(...) = -512
[P1:T20:python3] trace: ---- return from futex(...) = -512
[P1:T21:python3] trace: ---- return from futex(...) = -512
[P1:T20:python3] debug: Installed async event at 1665197427656149
[P1:T22:python3] trace: ---- return from futex(...) = -512
[P1:T21:python3] debug: Installed async event at 1665197427664681
[P1:T1:python3] debug: Installed async event at 1665197427666847
[P1:T8:python3] debug: Installed async event at 1665197427673831
[P1:T9:python3] debug: Installed async event at 1665197427682326
[P1:T22:python3] debug: Installed async event at 1665197427683469
[P1:T10:python3] debug: Installed async event at 1665197427685189
[P1:T11:python3] debug: Installed async event at 1665197427689259
[P1:T12:python3] debug: Installed async event at 1665197427692952
[P1:T13:python3] debug: Installed async event at 1665197427696282

@llly
Copy link
Contributor

llly commented Oct 10, 2022

[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95

I'll trying to fix this issue in #936

@sampleyang
Copy link
Author

sampleyang commented Oct 10, 2022

[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95

I'll trying to fix this issue in #936

@llly thanks.
I am not sure if they are the same problem. i just see some abnormal information, such as:

[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97

[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG

terminate called after throwing an instance of 'std::system_error'
what(): Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort
[P1:T23:python3] trace: ---- tgkill(1, 23, [SIGABRT]) = 0x0
[P1:T23:python3] debug: killed by signal 6

And it seems that my model app does not execute the main function,just be killed when python3 load(libos will load related library). Hope that can help.

@boryspoplawski
Copy link
Contributor

[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97

Gramine does not support netlink sockets, so if it's hard requirement by your app, then it won't work. But I don't see why pytorch would need those.

Unexpected poll revent on the control pipe's reading fd: 24

These error is weird, but it might be the original issue here. Unfortunately it's hard to say anything more without any details. You can try debugging it yourself (gdb would be handy), for that I would recommend a debug build and trying gramine-direct first

@llly
Copy link
Contributor

llly commented Oct 11, 2022

But I don't see why pytorch would need those.

Pytorch use netlink to get local IP address. It's not hard requirement.

Unexpected poll revent on the control pipe's reading fd: 24

It's caused by sendto(MSG_MORE), I have investigated this failure.

@dimakuv
Copy link
Contributor

dimakuv commented Oct 11, 2022

@llly So your #966 fixes this issue as well? If yes, can you add Fixes #948 to the PR description?

@boryspoplawski
Copy link
Contributor

Unexpected poll revent on the control pipe's reading fd: 24

It's caused by sendto(MSG_MORE), I have investigated this failure.

Could you elaborate more? [P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95 is caused by that, but I don't see how's that related to poll on a different fd.

@sampleyang
Copy link
Author

sampleyang commented Oct 12, 2022

But I don't see why pytorch would need those.

Pytorch use netlink to get local IP address. It's not hard requirement.

Unexpected poll revent on the control pipe's reading fd: 24

It's caused by sendto(MSG_MORE), I have investigated this failure.

@llly @boryspoplawski
I just try #966, the sendto error be solved. The AF_NETLINK error is still exist. Gramine can call my app main function, but a new error about socket family happened.

[P1:T1:python3] warning: Unsupported system call clone3
[P1:T1:python3] debug: Creating pipe: pipe.srv:7eebbace09c80cace771ad22fbb4133d01ecbd9318a895770ede199bd20eb16d
[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] warning: [ai-debug] cmd = [1]

[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: [ai-debug] cmd = [9]

[P1:T24:python3] debug: ---- exit (returning 0)
[P1:T24:python3] debug: Installed async event at 1665575334066389
[P1:libos] debug: Thread exited, cleaning up
[P1:T1:python3] warning: [ai-debug] ret_tmp = [0]

=>[thumt-debug] call main
=>[thumt-debug] call cli_main
=>[thumt-debug] load configs

=>[thumt-debug] init_method = tcp://localhost:54485, local_rank = 0

[P1:T1:python3] debug: glibc register library /usr/lib/python3/dist-packages/apt_pkg.cpython-38-x86_64-linux-gnu.so loaded at 0x177902000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libapt-pkg.so.6.0 loaded at 0x177725000
[P1:T1:python3] debug: glibc register library /lib/libresolv.so.2 loaded at 0x18941c000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/liblz4.so.1 loaded at 0x177704000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libzstd.so.1 loaded at 0x17765b000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libudev.so.1 loaded at 0x17762e000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libsystemd.so.0 loaded at 0x17757f000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libgcrypt.so.20 loaded at 0x177461000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libgpg-error.so.0 loaded at 0x17743e000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_decimal.cpython-38-x86_64-linux-gnu.so loaded at 0x1772d1000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libmpdec.so.2 loaded at 0x177299000
[P1:T1:python3] debug: glibc register library /usr/lib/python3/dist-packages/simplejson/_speedups.cpython-38-x86_64-linux-gnu.so loaded at 0x19dcf3000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_multibytecodec.cpython-38-x86_64-linux-gnu.so loaded at 0x19dc8e000
Traceback (most recent call last):
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 356, in <module>
    cli_main()
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 351, in cli_main
    process_fn(0, parsed_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 325, in process_fn
    main(local_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 169, in main
    dist.init_process_group("gloo",
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 503, in init_process_group
    _update_default_pg(_new_process_group_helper(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:179] rv != -1. -1 vs -1. Address family not supported by protocol
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: [ai-debug] cmd = [9]

terminate called after throwing an instance of 'std::system_error'
  what():  Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort
[P1:T23:python3] debug: killed by signal 6

This python callstack error is also errno 97, and the pytorch code as follow:

gloo/gloo/transport/tcp/device.cc

const std::string sockaddrToInterfaceName(const struct attr& attr) {
  struct ifaddrs* ifap;
  std::string iface;
  auto rv = getifaddrs(&ifap);
  GLOO_ENFORCE_NE(rv, -1, strerror(errno));
  auto addrIsLocalhost = isLocalhostAddr((struct sockaddr*)&attr.ai_addr);

gloo/gloo/transport/tcp/device.cc

static void lookupAddrForIface(struct attr& attr) {
  struct ifaddrs* ifap;
  auto rv = getifaddrs(&ifap);
  GLOO_ENFORCE_NE(rv, -1, strerror(errno));

I just have a test for function "getifaddrs" with and without gramine, and maybe AF_NETFLINK and AF_PACKET all not supported in gramine.

#include <errno.h>
#include <stdio.h>
#include <ifaddrs.h>

int main(int argc, char** argv)
{
  struct ifaddrs *addrs, *ent;

  if (getifaddrs(&addrs))
  {
      printf("errno = %d\n", errno);
      perror("getifaddrs()");
      return 1;
  }

  int count = 0;

  /* Count the number of interfaces */
  for (ent = addrs; ent; ent = ent->ifa_next) 
  {
    count++;
    printf("\"%s\" af_family = %d\n", ent->ifa_name, ent->ifa_addr->sa_family);
  }
  freeifaddrs(addrs);
}

"lo" af_family = 17     // AF_PACKET
"eth0" af_family = 17
"docker0" af_family = 17
"lo" af_family = 2    // AF_INET
"eth0" af_family = 2
"docker0" af_family = 2
"lo" af_family = 10    // AF_INET6
"eth0" af_family = 10
"docker0" af_family = 10

And run above code in gramine can reproduce the same error:

[P1:T1:if] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:if] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97         // NETLINK Error
[P1:T1:if] trace: ---- newfstatat(1, "", 0xdf7e490, 4096) = 0x0
[P1:T1:if] trace: ---- ioctl(1, TCGETS, 0xdf7e400) ...
[P1:T1:if] trace: ---- return from ioctl(...) = -38
[P1:T1:if] trace: ---- getrandom(0x19eda198, 0x8, GRND_NONBLOCK) = 0x8
[P1:T1:if] trace: ---- brk(0) = 0x1b1c9000
[P1:T1:if] trace: ---- brk(0x1b1ea000) = 0x1b1ea000
[P1:T1:if] trace: ---- dup(2) = 0x3
[P1:T1:if] trace: ---- fcntl(3, F_GETFL, 0x19e9459c) = 0x401
[P1:T1:if] trace: ---- close(3) = 0x0
[P1:T1:if] trace: ---- write(2, 0xdf7c0d0, 0x37) ...
getifaddrs(): Address family not supported by protocol
[P1:T1:if] trace: ---- return from write(...) = 0x37
[P1:T1:if] trace: ---- write(1, 0x1b1c92a0, 0xb) ...
errno = 97
[P1:T1:if] trace: ---- return from write(...) = 0xb
[P1:T1:if] debug: ---- exit_group (returning 1)
[P1:T1:if] debug: clearing POSIX locks for pid 1
[P1:T1:if] debug: sync client shutdown: closing handles
[P1:T1:if] debug: sync client shutdown: waiting for confirmation
[P1:T1:if] debug: sync client shutdown: finished
[P1:libos] debug: Async worker thread terminated
[P1:libos] debug: IPC worker: exiting worker thread
[P1:T1:if] debug: process 1 exited with status 1
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1

My application trace info:

[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97
[P1:T1:python3] trace: ---- futex(0xcf6ab1e0, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0x1, 202) ...
[P1:T1:python3] warning: [ai-debug] cmd = [1]

[P1:T1:python3] trace: ---- return from futex(...) = 0x0
[P1:T1:python3] trace: ---- write(1, 0xcfdfacb0, 0x9b) ...
=>[thumt-debug] call main
=>[thumt-debug] call cli_main
=>[thumt-debug] load configs

=>[thumt-debug] init_method = tcp://localhost:54071, local_rank = 0
......
[P1:T1:python3] trace: ---- stat("/etc/apt/apt.conf", 0xa33f7520) = -2
[P1:T1:python3] trace: ---- stat("/var/lib/dpkg/status", 0xa33f7500) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/bin/dpkg", 0xa33f7500) = 0x0
[P1:T1:python3] trace: ---- stat("/etc/debian_version", 0xa33f7500) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/share/dpkg/cputable", O_RDONLY, 0000) = 0x8
[P1:T1:python3] trace: ---- read(8, 0x93126f50, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x7b3
[P1:T1:python3] trace: ---- read(8, 0x93126f50, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/share/dpkg/tupletable", 0xa33f6f00) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/share/dpkg/tupletable", O_RDONLY, 0000) = 0x9
[P1:T1:python3] trace: ---- read(9, 0x93129770, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x8c9
[P1:T1:python3] trace: ---- read(9, 0x93129770, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- stat("/root/ailab/THUMT/thumt/bin", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/root/ailab/THUMT", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3.8", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3.8/lib-dynload", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/local/lib/python3.8/dist-packages", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/packaging.py", 0xa33f6f40) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/packaging.py", 0xa33f7740) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/lib/python3/dist-packages/apport/__pycache__/packaging.cpython-38.pyc", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33f7490) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33f74f0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fstat(8, 0xa33f7840) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x931283e0, 0x2dcf) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2dce
[P1:T1:python3] trace: ---- read(8, 0x9312b1ae, 0x1) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport", 0xa33f8ea0) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/hookutils.py", 0xa33f8b80) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/hookutils.py", 0xa33f9380) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/lib/python3/dist-packages/apport/__pycache__/hookutils.cpython-38.pyc", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33f90d0) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33f9130) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fstat(8, 0xa33f9480) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93142d40, 0x6df2) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x6df1
[P1:T1:python3] trace: ---- read(8, 0x93149b31, 0x1) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- getcwd(0x932b10c0, 0x400) = 0x2
[P1:T1:python3] trace: ---- lstat("/root", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT/thumt", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT/thumt/bin", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT/thumt/bin/translator.py", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- access("/root/ailab/THUMT/thumt/bin/translator.py", F_OK|R_OK) = -13
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x907
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x907
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x907
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- write(2, 0xcfdfbcc0, 0x389) ...
Traceback (most recent call last):
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 353, in <module>
    cli_main()
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 348, in cli_main
    process_fn(0, parsed_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 325, in process_fn
    main(local_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 169, in main
    dist.init_process_group("gloo",
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 503, in init_process_group
    _update_default_pg(_new_process_group_helper(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:29] rv != -1. -1 vs -1. Address family not supported by protocol
[P1:T1:python3] trace: ---- return from write(...) = 0x389
[P1:T1:python3] trace: ---- rt_sigaction([SIGINT], 0xa33fc5f0, 0xa33fc690, 0x8) = 0x0
[P1:T1:python3] trace: ---- close(6) = 0x0
[P1:T23:python3] trace: ---- return from poll(...) = 0x1
[P1:T1:python3] trace: ---- close(5) = 0x0
[P1:T23:python3] trace: ---- recvfrom(7, 0x924c3d06, 0x1, 0, 0, 0) = 0x0
[P1:T1:python3] trace: ---- futex(0x924c4910, FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 23, 0, 0, -1) ...
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: [ai-debug] cmd = [9]

[P1:T23:python3] trace: ---- close(7) = 0x0
[P1:T23:python3] trace: ---- poll(0x8c000bf0, 2, -1) ...
[P1:T23:python3] trace: ---- return from poll(...) = 0x1
[P1:T23:python3] trace: ---- write(2, 0xceba57d8, 0x30) ...
terminate called after throwing an instance of '[P1:T23:python3] trace: ---- return from write(...) = 0x30
[P1:T23:python3] trace: ---- write(2, 0x8c000bf0, 0x11) ...
std::system_error[P1:T23:python3] trace: ---- return from write(...) = 0x11
[P1:T23:python3] trace: ---- write(2, 0xceba57c4, 0x2) ...
'
[P1:T23:python3] trace: ---- return from write(...) = 0x2
[P1:T23:python3] trace: ---- write(2, 0xceba57c7, 0xb) ...
  what():  [P1:T23:python3] trace: ---- return from write(...) = 0xb
[P1:T23:python3] trace: ---- write(2, 0x8c000f18, 0x5d) ...
Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort[P1:T23:python3] trace: ---- return from write(...) = 0x5d
[P1:T23:python3] trace: ---- write(2, 0xd022f723, 0x1) ...

[P1:T23:python3] trace: ---- return from write(...) = 0x1
[P1:T23:python3] trace: ---- rt_sigprocmask(UNBLOCK, [SIGABRT,], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- gettid() = 0x17
[P1:T23:python3] trace: ---- getpid() = 0x1
[P1:T23:python3] trace: ---- tgkill(1, 23, [SIGABRT]) = 0x0
[P1:T23:python3] debug: killed by signal 6
[P1:T1:python3] warning: [ai-debug] ret = [0]

@boryspoplawski
Copy link
Contributor

Yes, getifaddrs uses netlink sockets and as such is not supported in Gramine

@boryspoplawski
Copy link
Contributor

Also, please don't past screenshots, use proper markdown (like you did below).

@sampleyang
Copy link
Author

sampleyang commented Oct 13, 2022

Yes, getifaddrs uses netlink sockets and as such is not supported in Gramine

@boryspoplawski
Thanks. So will gramine support this socket option or will #966 solve this problem?
For pytorch i think that the distribute process is a important and common feature. 'gloo' is the backend for Pytorch on cpu mode which is a collective communications library, use pytorch distribute feature and gloo as backend that will call getifaddrs function during initialization, it shoud be a common scenario.

@boryspoplawski
Copy link
Contributor

boryspoplawski commented Oct 13, 2022

This (netlink) is not an socket option, it's entirely different socket type (no, #966 is unrelated to this).

Please note that using TCP on localhost is insecure in SGX threat model (because malicious host can modify these packets) without any additional encryption, so using gloo might not be a good idea, though I'm not familiar with it and have no idea what exactly it does with that TCP, but consider yourself warned.

@dimakuv
Copy link
Contributor

dimakuv commented Oct 13, 2022

@Simon-aikuier I agree with everything @boryspoplawski said. Gramine still has higher-priority TODOs than analyzing the support for AF_NETLINK (even if we'll look into this, maybe we'll consider this out of scope for Gramine for security reasons! no promises here).

However, it feels like you can set up gloo to use TLS -- please check the envvar GLOO_DEVICE_TRANSPORT and these links that I googled and found relevant:

I highly encourage to try GLOO_DEVICE_TRANSPORT=TCP_TLS and seeing if this circumvents your Gramine problem of AF_NETLINK. It also circumvents the problem of using unprotected TCP connections as Borys mentioned (but for production, you'll have to figure out how to create and distribute TLS keys and certs securely).

@sampleyang
Copy link
Author

@Simon-aikuier I agree with everything @boryspoplawski said. Gramine still has higher-priority TODOs than analyzing the support for AF_NETLINK (even if we'll look into this, maybe we'll consider this out of scope for Gramine for security reasons! no promises here).

However, it feels like you can set up gloo to use TLS -- please check the envvar GLOO_DEVICE_TRANSPORT and these links that I googled and found relevant:

I highly encourage to try GLOO_DEVICE_TRANSPORT=TCP_TLS and seeing if this circumvents your Gramine problem of AF_NETLINK. It also circumvents the problem of using unprotected TCP connections as Borys mentioned (but for production, you'll have to figure out how to create and distribute TLS keys and certs securely).

@dimakuv @boryspoplawski @llly
Thanks all for helping. For the netlink problem i have solved by modify the model framework code. And use single node instead of distribute node, it can work in gramine now. I think this issue can be closed. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants