AI models base pytorch run failed #948

sampleyang · 2022-09-30T07:57:47Z

Description of the problem

I have a model for machine translation system，it will use torch.distributed.init_process_group to fork some processes. It will report error when i run it on gramine v1.3

Debug info

The 2nd enclave start failed. The trace infos as follow:

[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P1:T1:python3] error: Internal memory fault at 0x00000000 (0x3fa02ae44, VMID = 1, TID = 1)
debug: PalProcessExit: Returning exit code 1
warning: PalVirtualMemoryProtect is unimplemented in Linux-SGX PAL
debug: Gramine was built from commit: 00e91a0
debug: Host: Linux-SGX
debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7
debug: Initial VMA region 0x3fa000000-0x3fa1a3000 (LibOS) bookkeeped
debug: Initial VMA region 0x3ffffc000-0x400000000 (manifest) bookkeeped
debug: ASLR top address adjusted to 0x136997000
debug: host is Linux-SGX and remote attestation type is 'dcap', adding SGX-specific /dev/attestation/ files: report, quote, etc.
debug: LibOS loaded at 0x3fa000000, ready to initialize
error: libos_init: failed to read the whole checkpoint header: -61
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1

My template

loader.entrypoint = "file:{{ gramine.libos }}"
loader.log_level = "all"

loader.env.LD_LIBRARY_PATH = "/lib:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu"
loader.env.PATH = "/usr/bin"

loader.insecure__use_cmdline_argv = true

fs.root.type = "chroot"
fs.root.path = "/"
fs.root.uri = "file:/"

fs.mounts = [
{ path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
]

sgx.nonpie_binary = true
sgx.enclave_size = "16G"
sgx.thread_num = 512
sgx.remote_attestation = "dcap"

sgx.trusted_files = [
"file:{{ gramine.runtimedir() }}/",
]

sgx.allowed_files = [
"file:/",
]

The text was updated successfully, but these errors were encountered:

dimakuv · 2022-09-30T09:08:46Z

[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)

Based on this, looks like you have a huge checkpoint (that is prepared by the parent enclave and then sent to the child enclave). Have you tried increasing sgx.enclave_size even more, to 32G?

boryspoplawski · 2022-09-30T18:24:55Z

@dimakuv no, these are normal numbers. Try running fork_and_exec from libos/test/regression you will get the same.

boryspoplawski · 2022-09-30T18:27:02Z

@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway).

sampleyang · 2022-10-01T01:07:50Z

[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)

Based on this, looks like you have a huge checkpoint (that is prepared by the parent enclave and then sent to the child enclave). Have you tried increasing sgx.enclave_size even more, to 32G?

32G also have the same problem.

sampleyang · 2022-10-01T01:16:50Z

@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway).

commit: 00e91a0

I try to open trace, but find that only the parent enclave can print trace info, the child enclave can not(only little debug info)。

I try gramine v1.1， it can pass on this point, but will happen another problem that program hung on futex.
gramine v1.1 infos:

hung infos:

boryspoplawski · 2022-10-01T02:26:39Z

@Simon-aikuier the 2nd log is definitely neither from 00e91a0 nor v1.3.

sampleyang · 2022-10-01T05:12:43Z

the 2nd log is definitely neither from 00e91a0 nor v1.3.

The 2nd log is from gramine v1.1. I try it on different gramine version.

sampleyang · 2022-10-08T03:41:33Z

@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway).

@boryspoplawski
I try it on gramine v1.3.1. It seems that the internal memory error can work ok. And a new error happens:

=> New Error:
[P1:T1:python3] debug: glibc register library /usr/local/lib/python3.8/dist-packages/google/protobuf/pyext/_message.cpython-38-x86_64-linux-gnu.so loaded at 0x12ff30000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/termios.cpython-38-x86_64-linux-gnu.so loaded at 0x1690ce000
[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] debug: Creating pipe: pipe.srv:e924eb0954ddf3874f12e6cf370c07918d2be989e154a5e2d63b1d0659b72771
[P1:T1:python3] warning: Unsupported system call clone3
[P1:T1:python3] debug: Creating pipe: pipe.srv:9ea56ac046f16173d0da37af637a436f6f6bad67f70288ae4cb51e8f7e34b529
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
terminate called after throwing an instance of 'std::system_error'
what(): Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort
[P1:T23:python3] debug: killed by signal 6
[P1:T1:python3] debug: Installed async event at 1665199760214444

=> Trace Info:
[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/etc/gai.conf", O_RDONLY|0x80000, 0000) = 0x3
[P1:T1:python3] trace: ---- newfstatat(3, "", 0x8480dca0, 4096) = 0x0
[P1:T1:python3] trace: ---- newfstatat(3, "", 0x8480daa0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(3, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0xa18
[P1:T1:python3] trace: ---- read(3, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(3) = 0x0
[P1:T1:python3] trace: ---- futex(0xabddce84, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0xf59048, 7) ...
[P1:T1:python3] trace: ---- return from futex(...) = 0x0
[P1:T1:python3] trace: ---- socket(INET6, SOCK_CLOEXEC|DGRAM, 0) = 0x3
[P1:T1:python3] trace: ---- connect(3, {family=IPv6,ip=[0:0:0:0:0:0:0:0],port=40379}, 28) ...
[P1:T1:python3] trace: ---- return from connect(...) = 0x0
[P1:T1:python3] trace: ---- getsockname(3, 0x8480dd58, 0x8480de70) = 0x0
[P1:T1:python3] trace: ---- connect(3, UNKNOWN, 16) ...
[P1:T1:python3] trace: ---- return from connect(...) = 0x0
[P1:T1:python3] trace: ---- connect(3, {family=IPv4,ip=0.0.0.0,port=40379}, 16) ...
[P1:T1:python3] trace: ---- return from connect(...) = -22
[P1:T1:python3] trace: ---- close(3) = 0x0
[P1:T1:python3] trace: ---- socket(INET6, STREAM, 6) = 0x3
[P1:T1:python3] trace: ---- setsockopt(3, 1, 2, 0x8480e420, 4) = 0x0
[P1:T1:python3] trace: ---- bind(3, {family=IPv6,ip=[0:0:0:0:0:0:0:0],port=40379}, 28) = 0x0
[P1:T1:python3] trace: ---- listen(3, 2048) = 0x0
[P1:T1:python3] trace: ---- getsockname(3, 0x8480e420, 0x8480e3c4) = 0x0
[P1:T1:python3] debug: Creating pipe: pipe.srv:91b61487213987a6ecae039250272dcd7b083e7fac834042be0d1b239511c7f7
[P1:T1:python3] trace: ---- pipe2(0x6f1395e0, 0) = 0x0
[P1:T1:python3] trace: ---- mmap(0, 0x401000, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0x0) ...
[P1:T1:python3] trace: ---- return from mmap(...) = 0x6dc4e000
[P1:T1:python3] trace: ---- mprotect(0x6dc4f000, 0x400000, PROT_READ|PROT_WRITE) ...
[P1:T1:python3] trace: ---- return from mprotect(...) = 0x0
[P1:T1:python3] trace: ---- rt_sigprocmask(BLOCK, [SIGHUP,SIGINT,SIGQUIT,SIGILL,SIGTRAP,SIGABRT,SIGBUS,SIGFPE,SIGKILL,SIGUSR1,SIGSEGV,SIGUSR2,SIGPIPE,SIGALRM,SIGTERM,SIGSTKFLT,SIGCHLD,SIGCONT,SIGSTOP,SIGTSTP,SIGTTIN,SIGTTOU,SIGURG,SIGXCPU,SIGXFSZ,SIGVTALRM,SIGPROF,SIGWINCH,SIGIO
[P1:T1:python3] trace: ,SIGPWR,SIGSYS,SIG32,SIG33,SIG34,SIG35,SIG36,SIG37,SIG38,SIG39,SIG40,SIG41,SIG42,SIG43,SIG44,SIG45,SIG46,SIG47,SIG48,SIG49,SIG50,SIG51,SIG52,SIG53,SIG54,SIG55,SIG56,SIG57,SIG58,SIG59,SIG60,SIG61,SIG62,SIG63,SIG64,], [], 0x8) = 0x0
[P1:T1:python3] warning: Unsupported system call clone3
[P1:T1:python3] trace: ---- clone(CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, 0x6e04def0, 0x6e04e910, 0x6e04e910, 0x6e04e640) ...
[P1:T1:python3] debug: Creating pipe: pipe.srv:e5efe11f5bd73f6a0d34c8e5ed5a14ae4b18d0ab54939d9f3404627f4015b491
[P1:T1:python3] trace: ---- return from clone(...) = 0x17
[P1:T23:python3] trace: ---- set_robust_list(0x6e04e920, 0x18) = 0x0
[P1:T1:python3] trace: ---- rt_sigprocmask(SETMASK, [], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- rt_sigprocmask(SETMASK, [], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- mmap(0, 0x8000000, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0x0) ...
[P1:T1:python3] trace: ---- newfstatat(AT_FDCWD, "/etc/nsswitch.conf", 0x8480d930, 0) = 0x0
[P1:T1:python3] trace: ---- newfstatat(AT_FDCWD, "/etc/resolv.conf", 0x8480da70, 0) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/etc/hosts", O_RDONLY|0x80000, 0000) = 0x6
[P1:T1:python3] trace: ---- newfstatat(6, "", 0x8480d900, 4096) = 0x0
[P1:T1:python3] trace: ---- lseek(6, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(6, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x24e
[P1:T1:python3] trace: ---- read(6, 0x6f038b50, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(6) = 0x0
[P1:T1:python3] trace: ---- clock_gettime(0, 0x8480e3a0) = 0x0
[P1:T1:python3] trace: ---- socket(INET, STREAM, 6) = 0x6
[P1:T1:python3] trace: ---- fcntl(6, F_SETFL, 0x800) = 0x0
[P1:T1:python3] trace: ---- connect(6, {family=IPv4,ip=127.0.0.1,port=40379}, 16) ...
[P1:T1:python3] trace: ---- return from connect(...) = 0x0
[P1:T1:python3] trace: ---- clock_gettime(0, 0x8480e3a0) = 0x0
[P1:T1:python3] trace: ---- poll(0x8480e428, 1, 1800000) ...
[P1:T1:python3] trace: ---- return from poll(...) = 0x1
[P1:T1:python3] trace: ---- getsockopt(6, 1, 4, 0xabaaab00, 0x8480e3f4) = 0x0
[P1:T1:python3] trace: ---- fcntl(6, F_GETFL, 0x4) = 0x802
[P1:T1:python3] trace: ---- fcntl(6, F_SETFL, 0x2) = 0x0
[P1:T1:python3] trace: ---- setsockopt(6, 6, 1, 0x8480e394, 4) = 0x0
[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x1, 0, 0, 0) = 0x1
[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95
[P1:T1:python3] trace: ---- futex(0xab28b1e0, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0x1, 202) ...
[P1:T1:python3] trace: ---- return from futex(...) = 0x0
[P1:T1:python3] trace: ---- close(5) = 0x0
[P1:T1:python3] trace: ---- futex(0x6e04e910, FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 23, 0, 0, -1) ...
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T23:python3] trace: ---- return from mmap(...) = 0x65c4e000
[P1:T23:python3] trace: ---- munmap(0x65c4e000, 0x23b2000) ...
[P1:T23:python3] trace: ---- return from munmap(...) = 0x0
[P1:T23:python3] trace: ---- munmap(0x6c000000, 0x1c4e000) ...
[P1:T23:python3] trace: ---- return from munmap(...) = 0x0
[P1:T23:python3] trace: ---- mprotect(0x68000000, 0x21000, PROT_READ|PROT_WRITE) ...
[P1:T23:python3] trace: ---- return from mprotect(...) = 0x0
[P1:T23:python3] trace: ---- poll(0x68000b90, 2, -1) ...
[P1:T23:python3] trace: ---- return from poll(...) = 0x2
[P1:T23:python3] trace: ---- poll(0x68000b70, 1, -1) ...
[P1:T23:python3] trace: ---- return from poll(...) = 0x1
[P1:T23:python3] trace: ---- accept(3, 0, 0) ...
[P1:T23:python3] trace: ---- return from accept(...) = 0x5
[P1:T23:python3] trace: ---- getpeername(5, 0x6e04dce0, 0x6e04dcbc) = 0x0
[P1:T23:python3] trace: ---- setsockopt(5, 6, 1, 0x6e04dc74, 4) = 0x0
[P1:T23:python3] trace: ---- write(2, 0xaa7a57d8, 0x30) ...
terminate called after throwing an instance of '[P1:T23:python3] trace: ---- return from write(...) = 0x30
[P1:T23:python3] trace: ---- write(2, 0x68000bf0, 0x11) ...
std::system_error[P1:T23:python3] trace: ---- return from write(...) = 0x11
[P1:T23:python3] trace: ---- write(2, 0xaa7a57c4, 0x2) ...
'
[P1:T23:python3] trace: ---- return from write(...) = 0x2
[P1:T23:python3] trace: ---- write(2, 0xaa7a57c7, 0xb) ...
what(): [P1:T23:python3] trace: ---- return from write(...) = 0xb
[P1:T23:python3] trace: ---- write(2, 0x68000dc8, 0x5d) ...
Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort[P1:T23:python3] trace: ---- return from write(...) = 0x5d
[P1:T23:python3] trace: ---- write(2, 0xabdd5723, 0x1) ...

[P1:T23:python3] trace: ---- return from write(...) = 0x1
[P1:T23:python3] trace: ---- rt_sigprocmask(UNBLOCK, [SIGABRT,], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- gettid() = 0x17
[P1:T23:python3] trace: ---- getpid() = 0x1
[P1:T23:python3] trace: ---- tgkill(1, 23, [SIGABRT]) = 0x0
[P1:T23:python3] debug: killed by signal 6
[P1:T1:python3] trace: ---- return from futex(...) = -512
[P1:T8:python3] trace: ---- return from futex(...) = -512
[P1:T9:python3] trace: ---- return from futex(...) = -512
[P1:T10:python3] trace: ---- return from futex(...) = -512
[P1:T11:python3] trace: ---- return from futex(...) = -512
[P1:T12:python3] trace: ---- return from futex(...) = -512
[P1:T13:python3] trace: ---- return from futex(...) = -512
[P1:T14:python3] trace: ---- return from futex(...) = -512
[P1:T15:python3] trace: ---- return from futex(...) = -512
[P1:T16:python3] trace: ---- return from futex(...) = -512
[P1:T17:python3] trace: ---- return from futex(...) = -512
[P1:T18:python3] trace: ---- return from futex(...) = -512
[P1:T19:python3] trace: ---- return from futex(...) = -512
[P1:T20:python3] trace: ---- return from futex(...) = -512
[P1:T21:python3] trace: ---- return from futex(...) = -512
[P1:T20:python3] debug: Installed async event at 1665197427656149
[P1:T22:python3] trace: ---- return from futex(...) = -512
[P1:T21:python3] debug: Installed async event at 1665197427664681
[P1:T1:python3] debug: Installed async event at 1665197427666847
[P1:T8:python3] debug: Installed async event at 1665197427673831
[P1:T9:python3] debug: Installed async event at 1665197427682326
[P1:T22:python3] debug: Installed async event at 1665197427683469
[P1:T10:python3] debug: Installed async event at 1665197427685189
[P1:T11:python3] debug: Installed async event at 1665197427689259
[P1:T12:python3] debug: Installed async event at 1665197427692952
[P1:T13:python3] debug: Installed async event at 1665197427696282

llly · 2022-10-10T02:02:04Z

[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95

I'll trying to fix this issue in #936

sampleyang · 2022-10-10T03:45:59Z

[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95

I'll trying to fix this issue in #936

@llly thanks.
I am not sure if they are the same problem. i just see some abnormal information, such as:

[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97

[P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG

terminate called after throwing an instance of 'std::system_error'
what(): Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort
[P1:T23:python3] trace: ---- tgkill(1, 23, [SIGABRT]) = 0x0
[P1:T23:python3] debug: killed by signal 6

And it seems that my model app does not execute the main function，just be killed when python3 load(libos will load related library). Hope that can help.

boryspoplawski · 2022-10-10T15:23:53Z

[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97

Gramine does not support netlink sockets, so if it's hard requirement by your app, then it won't work. But I don't see why pytorch would need those.

Unexpected poll revent on the control pipe's reading fd: 24

These error is weird, but it might be the original issue here. Unfortunately it's hard to say anything more without any details. You can try debugging it yourself (gdb would be handy), for that I would recommend a debug build and trying gramine-direct first

llly · 2022-10-11T09:06:52Z

But I don't see why pytorch would need those.

Pytorch use netlink to get local IP address. It's not hard requirement.

Unexpected poll revent on the control pipe's reading fd: 24

It's caused by sendto(MSG_MORE), I have investigated this failure.

dimakuv · 2022-10-11T09:21:09Z

@llly So your #966 fixes this issue as well? If yes, can you add Fixes #948 to the PR description?

boryspoplawski · 2022-10-11T15:25:45Z

Unexpected poll revent on the control pipe's reading fd: 24

It's caused by sendto(MSG_MORE), I have investigated this failure.

Could you elaborate more? [P1:T1:python3] trace: ---- sendto(6, 0x8480e3f8, 0x8, 32768, 0, 0) = -95 is caused by that, but I don't see how's that related to poll on a different fd.

sampleyang · 2022-10-12T09:04:12Z

But I don't see why pytorch would need those.

Pytorch use netlink to get local IP address. It's not hard requirement.

Unexpected poll revent on the control pipe's reading fd: 24

It's caused by sendto(MSG_MORE), I have investigated this failure.

@llly @boryspoplawski
I just try #966, the sendto error be solved. The AF_NETLINK error is still exist. Gramine can call my app main function, but a new error about socket family happened.

[P1:T1:python3] warning: Unsupported system call clone3
[P1:T1:python3] debug: Creating pipe: pipe.srv:7eebbace09c80cace771ad22fbb4133d01ecbd9318a895770ede199bd20eb16d
[P1:T1:python3] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:python3] warning: [ai-debug] cmd = [1]

[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: [ai-debug] cmd = [9]

[P1:T24:python3] debug: ---- exit (returning 0)
[P1:T24:python3] debug: Installed async event at 1665575334066389
[P1:libos] debug: Thread exited, cleaning up
[P1:T1:python3] warning: [ai-debug] ret_tmp = [0]

=>[thumt-debug] call main
=>[thumt-debug] call cli_main
=>[thumt-debug] load configs

=>[thumt-debug] init_method = tcp://localhost:54485, local_rank = 0

[P1:T1:python3] debug: glibc register library /usr/lib/python3/dist-packages/apt_pkg.cpython-38-x86_64-linux-gnu.so loaded at 0x177902000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libapt-pkg.so.6.0 loaded at 0x177725000
[P1:T1:python3] debug: glibc register library /lib/libresolv.so.2 loaded at 0x18941c000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/liblz4.so.1 loaded at 0x177704000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libzstd.so.1 loaded at 0x17765b000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libudev.so.1 loaded at 0x17762e000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libsystemd.so.0 loaded at 0x17757f000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libgcrypt.so.20 loaded at 0x177461000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libgpg-error.so.0 loaded at 0x17743e000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_decimal.cpython-38-x86_64-linux-gnu.so loaded at 0x1772d1000
[P1:T1:python3] debug: glibc register library /usr/lib/x86_64-linux-gnu/libmpdec.so.2 loaded at 0x177299000
[P1:T1:python3] debug: glibc register library /usr/lib/python3/dist-packages/simplejson/_speedups.cpython-38-x86_64-linux-gnu.so loaded at 0x19dcf3000
[P1:T1:python3] debug: glibc register library /usr/lib/python3.8/lib-dynload/_multibytecodec.cpython-38-x86_64-linux-gnu.so loaded at 0x19dc8e000
Traceback (most recent call last):
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 356, in <module>
    cli_main()
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 351, in cli_main
    process_fn(0, parsed_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 325, in process_fn
    main(local_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 169, in main
    dist.init_process_group("gloo",
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 503, in init_process_group
    _update_default_pg(_new_process_group_helper(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:179] rv != -1. -1 vs -1. Address family not supported by protocol
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: [ai-debug] cmd = [9]

terminate called after throwing an instance of 'std::system_error'
  what():  Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort
[P1:T23:python3] debug: killed by signal 6

This python callstack error is also errno 97, and the pytorch code as follow:

gloo/gloo/transport/tcp/device.cc

const std::string sockaddrToInterfaceName(const struct attr& attr) {
  struct ifaddrs* ifap;
  std::string iface;
  auto rv = getifaddrs(&ifap);
  GLOO_ENFORCE_NE(rv, -1, strerror(errno));
  auto addrIsLocalhost = isLocalhostAddr((struct sockaddr*)&attr.ai_addr);

gloo/gloo/transport/tcp/device.cc

static void lookupAddrForIface(struct attr& attr) {
  struct ifaddrs* ifap;
  auto rv = getifaddrs(&ifap);
  GLOO_ENFORCE_NE(rv, -1, strerror(errno));

I just have a test for function "getifaddrs" with and without gramine, and maybe AF_NETFLINK and AF_PACKET all not supported in gramine.

#include <errno.h>
#include <stdio.h>
#include <ifaddrs.h>

int main(int argc, char** argv)
{
  struct ifaddrs *addrs, *ent;

  if (getifaddrs(&addrs))
  {
      printf("errno = %d\n", errno);
      perror("getifaddrs()");
      return 1;
  }

  int count = 0;

  /* Count the number of interfaces */
  for (ent = addrs; ent; ent = ent->ifa_next) 
  {
    count++;
    printf("\"%s\" af_family = %d\n", ent->ifa_name, ent->ifa_addr->sa_family);
  }
  freeifaddrs(addrs);
}

"lo" af_family = 17     // AF_PACKET
"eth0" af_family = 17
"docker0" af_family = 17
"lo" af_family = 2    // AF_INET
"eth0" af_family = 2
"docker0" af_family = 2
"lo" af_family = 10    // AF_INET6
"eth0" af_family = 10
"docker0" af_family = 10

And run above code in gramine can reproduce the same error:

[P1:T1:if] warning: libos_syscall_socket: unsupported socket domain 16
[P1:T1:if] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97         // NETLINK Error
[P1:T1:if] trace: ---- newfstatat(1, "", 0xdf7e490, 4096) = 0x0
[P1:T1:if] trace: ---- ioctl(1, TCGETS, 0xdf7e400) ...
[P1:T1:if] trace: ---- return from ioctl(...) = -38
[P1:T1:if] trace: ---- getrandom(0x19eda198, 0x8, GRND_NONBLOCK) = 0x8
[P1:T1:if] trace: ---- brk(0) = 0x1b1c9000
[P1:T1:if] trace: ---- brk(0x1b1ea000) = 0x1b1ea000
[P1:T1:if] trace: ---- dup(2) = 0x3
[P1:T1:if] trace: ---- fcntl(3, F_GETFL, 0x19e9459c) = 0x401
[P1:T1:if] trace: ---- close(3) = 0x0
[P1:T1:if] trace: ---- write(2, 0xdf7c0d0, 0x37) ...
getifaddrs(): Address family not supported by protocol
[P1:T1:if] trace: ---- return from write(...) = 0x37
[P1:T1:if] trace: ---- write(1, 0x1b1c92a0, 0xb) ...
errno = 97
[P1:T1:if] trace: ---- return from write(...) = 0xb
[P1:T1:if] debug: ---- exit_group (returning 1)
[P1:T1:if] debug: clearing POSIX locks for pid 1
[P1:T1:if] debug: sync client shutdown: closing handles
[P1:T1:if] debug: sync client shutdown: waiting for confirmation
[P1:T1:if] debug: sync client shutdown: finished
[P1:libos] debug: Async worker thread terminated
[P1:libos] debug: IPC worker: exiting worker thread
[P1:T1:if] debug: process 1 exited with status 1
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1

My application trace info:

[P1:T1:python3] trace: ---- socket(NETLINK, SOCK_CLOEXEC|RAW, 0) = -97
[P1:T1:python3] trace: ---- futex(0xcf6ab1e0, FUTEX_PRIVATE|FUTEX_WAKE, 2147483647, 0, 0x1, 202) ...
[P1:T1:python3] warning: [ai-debug] cmd = [1]

[P1:T1:python3] trace: ---- return from futex(...) = 0x0
[P1:T1:python3] trace: ---- write(1, 0xcfdfacb0, 0x9b) ...
=>[thumt-debug] call main
=>[thumt-debug] call cli_main
=>[thumt-debug] load configs

=>[thumt-debug] init_method = tcp://localhost:54071, local_rank = 0
......
[P1:T1:python3] trace: ---- stat("/etc/apt/apt.conf", 0xa33f7520) = -2
[P1:T1:python3] trace: ---- stat("/var/lib/dpkg/status", 0xa33f7500) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/bin/dpkg", 0xa33f7500) = 0x0
[P1:T1:python3] trace: ---- stat("/etc/debian_version", 0xa33f7500) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/share/dpkg/cputable", O_RDONLY, 0000) = 0x8
[P1:T1:python3] trace: ---- read(8, 0x93126f50, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x7b3
[P1:T1:python3] trace: ---- read(8, 0x93126f50, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/share/dpkg/tupletable", 0xa33f6f00) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/share/dpkg/tupletable", O_RDONLY, 0000) = 0x9
[P1:T1:python3] trace: ---- read(9, 0x93129770, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x8c9
[P1:T1:python3] trace: ---- read(9, 0x93129770, 0x1fff) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- stat("/root/ailab/THUMT/thumt/bin", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/root/ailab/THUMT", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3.8", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3.8/lib-dynload", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/local/lib/python3.8/dist-packages", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport", 0xa33f7260) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/packaging.py", 0xa33f6f40) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/packaging.py", 0xa33f7740) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/lib/python3/dist-packages/apport/__pycache__/packaging.cpython-38.pyc", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33f7490) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33f74f0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fstat(8, 0xa33f7840) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x931283e0, 0x2dcf) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2dce
[P1:T1:python3] trace: ---- read(8, 0x9312b1ae, 0x1) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport", 0xa33f8ea0) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/hookutils.py", 0xa33f8b80) = 0x0
[P1:T1:python3] trace: ---- stat("/usr/lib/python3/dist-packages/apport/hookutils.py", 0xa33f9380) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/lib/python3/dist-packages/apport/__pycache__/hookutils.cpython-38.pyc", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33f90d0) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33f9130) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fstat(8, 0xa33f9480) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93142d40, 0x6df2) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x6df1
[P1:T1:python3] trace: ---- read(8, 0x93149b31, 0x1) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x0
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- getcwd(0x932b10c0, 0x400) = 0x2
[P1:T1:python3] trace: ---- lstat("/root", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT/thumt", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT/thumt/bin", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- lstat("/root/ailab/THUMT/thumt/bin/translator.py", 0xa33fc0b0) = 0x0
[P1:T1:python3] trace: ---- access("/root/ailab/THUMT/thumt/bin/translator.py", F_OK|R_OK) = -13
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x907
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x907
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x907
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/root/ailab/THUMT/thumt/bin/translator.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- openat(AT_FDCWD, "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", O_RDONLY|0x80000, 0000) = 0x8
[P1:T1:python3] trace: ---- fstat(8, 0xa33faf90) = 0x0
[P1:T1:python3] trace: ---- ioctl(8, TCGETS, 0xa33faff0) ...
[P1:T1:python3] trace: ---- return from ioctl(...) = -38
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_CUR) = 0x0
[P1:T1:python3] trace: ---- fcntl(8, OP 1030, 0) = 0x9
[P1:T1:python3] trace: ---- fcntl(9, F_GETFL, 0x802001) = 0x80000
[P1:T1:python3] trace: ---- newfstatat(9, "", 0xa33fa9e0, 4096) = 0x0
[P1:T1:python3] trace: ---- read(9, 0x93146530, 0x1000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x1000
[P1:T1:python3] trace: ---- close(9) = 0x0
[P1:T1:python3] trace: ---- lseek(8, 0x0, SEEK_SET) = 0x0
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93145d80, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- read(8, 0x93143d50, 0x2000) ...
[P1:T1:python3] trace: ---- return from read(...) = 0x2000
[P1:T1:python3] trace: ---- close(8) = 0x0
[P1:T1:python3] trace: ---- write(2, 0xcfdfbcc0, 0x389) ...
Traceback (most recent call last):
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 353, in <module>
    cli_main()
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 348, in cli_main
    process_fn(0, parsed_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 325, in process_fn
    main(local_args)
  File "/root/ailab/THUMT/thumt/bin/translator.py", line 169, in main
    dist.init_process_group("gloo",
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 503, in init_process_group
    _update_default_pg(_new_process_group_helper(
  File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 588, in _new_process_group_helper
    pg = ProcessGroupGloo(
RuntimeError: [enforce fail at /pytorch/third_party/gloo/gloo/transport/tcp/device.cc:29] rv != -1. -1 vs -1. Address family not supported by protocol
[P1:T1:python3] trace: ---- return from write(...) = 0x389
[P1:T1:python3] trace: ---- rt_sigaction([SIGINT], 0xa33fc5f0, 0xa33fc690, 0x8) = 0x0
[P1:T1:python3] trace: ---- close(6) = 0x0
[P1:T23:python3] trace: ---- return from poll(...) = 0x1
[P1:T1:python3] trace: ---- close(5) = 0x0
[P1:T23:python3] trace: ---- recvfrom(7, 0x924c3d06, 0x1, 0, 0, 0) = 0x0
[P1:T1:python3] trace: ---- futex(0x924c4910, FUTEX_CLOCK_REALTIME|FUTEX_WAIT_BITSET, 23, 0, 0, -1) ...
[P1:T1:python3] warning: Ignoring FUTEX_CLOCK_REALTIME flag
[P1:T1:python3] warning: Non-private futexes are not supported, assuming implicit FUTEX_PRIVATE_FLAG
[P1:T1:python3] warning: [ai-debug] cmd = [9]

[P1:T23:python3] trace: ---- close(7) = 0x0
[P1:T23:python3] trace: ---- poll(0x8c000bf0, 2, -1) ...
[P1:T23:python3] trace: ---- return from poll(...) = 0x1
[P1:T23:python3] trace: ---- write(2, 0xceba57d8, 0x30) ...
terminate called after throwing an instance of '[P1:T23:python3] trace: ---- return from write(...) = 0x30
[P1:T23:python3] trace: ---- write(2, 0x8c000bf0, 0x11) ...
std::system_error[P1:T23:python3] trace: ---- return from write(...) = 0x11
[P1:T23:python3] trace: ---- write(2, 0xceba57c4, 0x2) ...
'
[P1:T23:python3] trace: ---- return from write(...) = 0x2
[P1:T23:python3] trace: ---- write(2, 0xceba57c7, 0xb) ...
  what():  [P1:T23:python3] trace: ---- return from write(...) = 0xb
[P1:T23:python3] trace: ---- write(2, 0x8c000f18, 0x5d) ...
Unexpected poll revent on the control pipe's reading fd: 24: Software caused connection abort[P1:T23:python3] trace: ---- return from write(...) = 0x5d
[P1:T23:python3] trace: ---- write(2, 0xd022f723, 0x1) ...

[P1:T23:python3] trace: ---- return from write(...) = 0x1
[P1:T23:python3] trace: ---- rt_sigprocmask(UNBLOCK, [SIGABRT,], NULL, 0x8) = 0x0
[P1:T23:python3] trace: ---- gettid() = 0x17
[P1:T23:python3] trace: ---- getpid() = 0x1
[P1:T23:python3] trace: ---- tgkill(1, 23, [SIGABRT]) = 0x0
[P1:T23:python3] debug: killed by signal 6
[P1:T1:python3] warning: [ai-debug] ret = [0]

boryspoplawski · 2022-10-12T15:57:17Z

Yes, getifaddrs uses netlink sockets and as such is not supported in Gramine

boryspoplawski · 2022-10-12T16:00:31Z

Also, please don't past screenshots, use proper markdown (like you did below).

sampleyang · 2022-10-13T01:34:37Z

Yes, getifaddrs uses netlink sockets and as such is not supported in Gramine

@boryspoplawski
Thanks. So will gramine support this socket option or will #966 solve this problem?
For pytorch i think that the distribute process is a important and common feature. 'gloo' is the backend for Pytorch on cpu mode which is a collective communications library, use pytorch distribute feature and gloo as backend that will call getifaddrs function during initialization, it shoud be a common scenario.

boryspoplawski · 2022-10-13T01:44:08Z

This (netlink) is not an socket option, it's entirely different socket type (no, #966 is unrelated to this).

Please note that using TCP on localhost is insecure in SGX threat model (because malicious host can modify these packets) without any additional encryption, so using gloo might not be a good idea, though I'm not familiar with it and have no idea what exactly it does with that TCP, but consider yourself warned.

dimakuv · 2022-10-13T06:50:10Z

@Simon-aikuier I agree with everything @boryspoplawski said. Gramine still has higher-priority TODOs than analyzing the support for AF_NETLINK (even if we'll look into this, maybe we'll consider this out of scope for Gramine for security reasons! no promises here).

However, it feels like you can set up gloo to use TLS -- please check the envvar GLOO_DEVICE_TRANSPORT and these links that I googled and found relevant:

I highly encourage to try GLOO_DEVICE_TRANSPORT=TCP_TLS and seeing if this circumvents your Gramine problem of AF_NETLINK. It also circumvents the problem of using unprotected TCP connections as Borys mentioned (but for production, you'll have to figure out how to create and distribute TLS keys and certs securely).

sampleyang · 2022-10-13T09:23:24Z

@Simon-aikuier I agree with everything @boryspoplawski said. Gramine still has higher-priority TODOs than analyzing the support for AF_NETLINK (even if we'll look into this, maybe we'll consider this out of scope for Gramine for security reasons! no promises here).

However, it feels like you can set up gloo to use TLS -- please check the envvar GLOO_DEVICE_TRANSPORT and these links that I googled and found relevant:

https://github.com/pytorch/pytorch/blob/master/.jenkins/pytorch/run_glootls_test.sh

https://codebrowser.bddppq.com/pytorch/pytorch/torch/lib/c10d/GlooDeviceFactory.cpp.html

I highly encourage to try GLOO_DEVICE_TRANSPORT=TCP_TLS and seeing if this circumvents your Gramine problem of AF_NETLINK. It also circumvents the problem of using unprotected TCP connections as Borys mentioned (but for production, you'll have to figure out how to create and distribute TLS keys and certs securely).

@dimakuv @boryspoplawski @llly
Thanks all for helping. For the netlink problem i have solved by modify the model framework code. And use single node instead of distribute node, it can work in gramine now. I think this issue can be closed. Thanks.

llly mentioned this issue Oct 11, 2022

[LibOS] Allow and ignore MSG_MORE flag for TCP socket in sendto #966

Merged

dimakuv closed this as completed Oct 13, 2022

marchukv mentioned this issue May 8, 2023

Service Broker can't start in Intel SGX Enclave with Gramine moleculerjs/moleculer#1207

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AI models base pytorch run failed #948

AI models base pytorch run failed #948

sampleyang commented Sep 30, 2022 •

edited

Loading

dimakuv commented Sep 30, 2022

boryspoplawski commented Sep 30, 2022

boryspoplawski commented Sep 30, 2022

sampleyang commented Oct 1, 2022

sampleyang commented Oct 1, 2022 •

edited

Loading

boryspoplawski commented Oct 1, 2022

sampleyang commented Oct 1, 2022 •

edited

Loading

sampleyang commented Oct 8, 2022 •

edited

Loading

llly commented Oct 10, 2022

sampleyang commented Oct 10, 2022 •

edited

Loading

boryspoplawski commented Oct 10, 2022

llly commented Oct 11, 2022

dimakuv commented Oct 11, 2022

boryspoplawski commented Oct 11, 2022

sampleyang commented Oct 12, 2022 •

edited

Loading

boryspoplawski commented Oct 12, 2022

boryspoplawski commented Oct 12, 2022

sampleyang commented Oct 13, 2022 •

edited

Loading

boryspoplawski commented Oct 13, 2022 •

edited

Loading

dimakuv commented Oct 13, 2022

sampleyang commented Oct 13, 2022

AI models base pytorch run failed #948

AI models base pytorch run failed #948

Comments

sampleyang commented Sep 30, 2022 • edited Loading

Description of the problem

Debug info

My template

dimakuv commented Sep 30, 2022

boryspoplawski commented Sep 30, 2022

boryspoplawski commented Sep 30, 2022

sampleyang commented Oct 1, 2022

sampleyang commented Oct 1, 2022 • edited Loading

boryspoplawski commented Oct 1, 2022

sampleyang commented Oct 1, 2022 • edited Loading

sampleyang commented Oct 8, 2022 • edited Loading

llly commented Oct 10, 2022

sampleyang commented Oct 10, 2022 • edited Loading

boryspoplawski commented Oct 10, 2022

llly commented Oct 11, 2022

dimakuv commented Oct 11, 2022

boryspoplawski commented Oct 11, 2022

sampleyang commented Oct 12, 2022 • edited Loading

boryspoplawski commented Oct 12, 2022

boryspoplawski commented Oct 12, 2022

sampleyang commented Oct 13, 2022 • edited Loading

boryspoplawski commented Oct 13, 2022 • edited Loading

dimakuv commented Oct 13, 2022

sampleyang commented Oct 13, 2022

sampleyang commented Sep 30, 2022 •

edited

Loading

sampleyang commented Oct 1, 2022 •

edited

Loading

sampleyang commented Oct 1, 2022 •

edited

Loading

sampleyang commented Oct 8, 2022 •

edited

Loading

sampleyang commented Oct 10, 2022 •

edited

Loading

sampleyang commented Oct 12, 2022 •

edited

Loading

sampleyang commented Oct 13, 2022 •

edited

Loading

boryspoplawski commented Oct 13, 2022 •

edited

Loading