-
Notifications
You must be signed in to change notification settings - Fork 196
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AI models base pytorch run failed #948
Comments
Based on this, looks like you have a huge checkpoint (that is prepared by the parent enclave and then sent to the child enclave). Have you tried increasing |
@dimakuv no, these are normal numbers. Try running |
@Simon-aikuier we will need more details on how to reproduce the problem. Alternatively you can try debugging it or at least bisect at which commit you start to see the issue (since I see you build from source anyway). |
32G also have the same problem. |
commit: 00e91a0 I try to open trace, but find that only the parent enclave can print trace info, the child enclave can not(only little debug info)。 I try gramine v1.1, it can pass on this point, but will happen another problem that program hung on futex. |
@Simon-aikuier the 2nd log is definitely neither from 00e91a0 nor v1.3. |
The 2nd log is from gramine v1.1. I try it on different gramine version. |
@boryspoplawski => New Error: => Trace Info: [P1:T23:python3] trace: ---- return from write(...) = 0x1 |
I'll trying to fix this issue in #936 |
@llly thanks.
And it seems that my model app does not execute the main function,just be killed when python3 load(libos will load related library). Hope that can help. |
Gramine does not support netlink sockets, so if it's hard requirement by your app, then it won't work. But I don't see why pytorch would need those.
These error is weird, but it might be the original issue here. Unfortunately it's hard to say anything more without any details. You can try debugging it yourself (gdb would be handy), for that I would recommend a |
Pytorch use netlink to get local IP address. It's not hard requirement.
It's caused by |
Could you elaborate more? |
@llly @boryspoplawski
This python callstack error is also errno 97, and the pytorch code as follow: gloo/gloo/transport/tcp/device.cc
gloo/gloo/transport/tcp/device.cc
I just have a test for function "getifaddrs" with and without gramine, and maybe AF_NETFLINK and AF_PACKET all not supported in gramine.
And run above code in gramine can reproduce the same error:
My application trace info:
|
Yes, |
Also, please don't past screenshots, use proper markdown (like you did below). |
@boryspoplawski |
This (netlink) is not an socket option, it's entirely different socket type (no, #966 is unrelated to this). Please note that using TCP on localhost is insecure in SGX threat model (because malicious host can modify these packets) without any additional encryption, so using |
@Simon-aikuier I agree with everything @boryspoplawski said. Gramine still has higher-priority TODOs than analyzing the support for However, it feels like you can set up
I highly encourage to try |
@dimakuv @boryspoplawski @llly |
Description of the problem
I have a model for machine translation system,it will use torch.distributed.init_process_group to fork some processes. It will report error when i run it on gramine v1.3
Debug info
The 2nd enclave start failed. The trace infos as follow:
[P1:T1:python3] debug: allocating checkpoint store (size = 67108864, reserve = 33554432)
[P1:T1:python3] error: Internal memory fault at 0x00000000 (0x3fa02ae44, VMID = 1, TID = 1)
debug: PalProcessExit: Returning exit code 1
warning: PalVirtualMemoryProtect is unimplemented in Linux-SGX PAL
debug: Gramine was built from commit: 00e91a0
debug: Host: Linux-SGX
debug: LibOS xsave_enabled 1, xsave_size 0xa80(2688), xsave_features 0xe7
debug: Initial VMA region 0x3fa000000-0x3fa1a3000 (LibOS) bookkeeped
debug: Initial VMA region 0x3ffffc000-0x400000000 (manifest) bookkeeped
debug: ASLR top address adjusted to 0x136997000
debug: host is Linux-SGX and remote attestation type is 'dcap', adding SGX-specific /dev/attestation/ files: report, quote, etc.
debug: LibOS loaded at 0x3fa000000, ready to initialize
error: libos_init: failed to read the whole checkpoint header: -61
debug: PalProcessExit: Returning exit code 1
Run application failed: run cmd error, exit status 1
My template
loader.entrypoint = "file:{{ gramine.libos }}"
loader.log_level = "all"
loader.env.LD_LIBRARY_PATH = "/lib:/lib/x86_64-linux-gnu:/usr/lib/x86_64-linux-gnu"
loader.env.PATH = "/usr/bin"
loader.insecure__use_cmdline_argv = true
fs.root.type = "chroot"
fs.root.path = "/"
fs.root.uri = "file:/"
fs.mounts = [
{ path = "/lib", uri = "file:{{ gramine.runtimedir() }}" },
]
sgx.nonpie_binary = true
sgx.enclave_size = "16G"
sgx.thread_num = 512
sgx.remote_attestation = "dcap"
sgx.trusted_files = [
"file:{{ gramine.runtimedir() }}/",
]
sgx.allowed_files = [
"file:/",
]
The text was updated successfully, but these errors were encountered: