-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PAL/Linux-SGX] Read exitless-OCALL result before resetting the stack #1955
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
pal/src/host/linux-sgx/enclave_ocalls.c
line 143 at r1 (raw file):
long result = COPY_UNTRUSTED_VALUE(&req->result); sgx_reset_ustack(old_ustack);
Hmm, have you seen the comment at enclave_framework.c:69? How does it relate to this case?
https://github.com/gramineproject/gramine/blob/master/pal/src/host/linux-sgx/enclave_framework.c#L69-L89
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
pal/src/host/linux-sgx/enclave_ocalls.c
line 143 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Hmm, have you seen the comment at enclave_framework.c:69? How does it relate to this case?
https://github.com/gramineproject/gramine/blob/master/pal/src/host/linux-sgx/enclave_framework.c#L69-L89
Fixed link:
gramine/pal/src/host/linux-sgx/enclave_framework.c
Lines 69 to 89 in ca715b9
/* | |
* When DEBUG is enabled, we run sgx_profile_sample() during asynchronous enclave exit (AEX), which | |
* uses the stack. Make sure to update URSP so that the AEX handler does not overwrite the part of | |
* the stack that we just allocated. | |
* | |
* (Recall that URSP is an outside stack pointer, saved by EENTER and restored on AEX by the SGX | |
* hardware itself.) | |
*/ | |
#ifdef DEBUG | |
#define UPDATE_USTACK(_ustack) \ | |
do { \ | |
SET_ENCLAVE_TCB(ustack, _ustack); \ | |
GET_ENCLAVE_TCB(gpr)->ursp = (uint64_t)_ustack; \ | |
} while(0) | |
#else | |
#define UPDATE_USTACK(_ustack) SET_ENCLAVE_TCB(ustack, _ustack) | |
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (2 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)
pal/src/host/linux-sgx/enclave_ocalls.c
line 143 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Fixed link:
gramine/pal/src/host/linux-sgx/enclave_framework.c
Lines 69 to 89 in ca715b9
/* * When DEBUG is enabled, we run sgx_profile_sample() during asynchronous enclave exit (AEX), which * uses the stack. Make sure to update URSP so that the AEX handler does not overwrite the part of * the stack that we just allocated. * * (Recall that URSP is an outside stack pointer, saved by EENTER and restored on AEX by the SGX * hardware itself.) */ #ifdef DEBUG #define UPDATE_USTACK(_ustack) \ do { \ SET_ENCLAVE_TCB(ustack, _ustack); \ GET_ENCLAVE_TCB(gpr)->ursp = (uint64_t)_ustack; \ } while(0) #else #define UPDATE_USTACK(_ustack) SET_ENCLAVE_TCB(ustack, _ustack) #endif
Yes, I've seen that comment. If you look at my other PR (#1944), I modify those lines in that PR.
To answer your question, that code indeed relates to this bug fix here. The linked code modifies the Untrusted RSP value (ursp) that is restored during the AEX flow. So the bug is like this (in debug mode, as the linked code is only currently relevant in debug mode, but my PR #1944 changes that):
sgx_exitless_ocall()
is done with its logic and performssgx_reset_ustack()
.- The
UPDATE_STACK()
macro modifiesursp
as part of resetting the stack. - The enclave thread is interrupted, and the AEX flow is triggered.
- The AEX flow restores
ursp
, which points to the top of the stack (because of reset on step 2). - The AEX landing pad pushes stuff on this stack, overwriting the
req
object from step 1. - Enclave thread is ERESUME'd and performs the last line in this func:
return COPY_UNTRUSTED_VALUE(&req->result)
. Unfortunately,req->result
was overwritten in step 5, and the thread reads some garbage.
Hope this clarifies the problem. You can also notice that to trigger this bug, you need to be really unlucky:
a. Needs Exitless feature enabled (in the manifest)
b. Needs Gramine to be built in DEBUG mode
c. For faster reproducibility, needs SGX profiling enabled (in the manifest), otherwise not enough "stack push overwrite" events
d. For faster reproducibility, needs a lot of AEXes happening
Note that our CI tests a) only minimally, and doesn't test c) at all I believe. At the same time, that's a truly rare event, I don't think we want a regression test for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from maintainers (1 more required), not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @dimakuv)
pal/src/host/linux-sgx/enclave_ocalls.c
line 143 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
Yes, I've seen that comment. If you look at my other PR (#1944), I modify those lines in that PR.
To answer your question, that code indeed relates to this bug fix here. The linked code modifies the Untrusted RSP value (ursp) that is restored during the AEX flow. So the bug is like this (in debug mode, as the linked code is only currently relevant in debug mode, but my PR #1944 changes that):
sgx_exitless_ocall()
is done with its logic and performssgx_reset_ustack()
.- The
UPDATE_STACK()
macro modifiesursp
as part of resetting the stack.- The enclave thread is interrupted, and the AEX flow is triggered.
- The AEX flow restores
ursp
, which points to the top of the stack (because of reset on step 2).- The AEX landing pad pushes stuff on this stack, overwriting the
req
object from step 1.- Enclave thread is ERESUME'd and performs the last line in this func:
return COPY_UNTRUSTED_VALUE(&req->result)
. Unfortunately,req->result
was overwritten in step 5, and the thread reads some garbage.Hope this clarifies the problem. You can also notice that to trigger this bug, you need to be really unlucky:
a. Needs Exitless feature enabled (in the manifest)
b. Needs Gramine to be built in DEBUG mode
c. For faster reproducibility, needs SGX profiling enabled (in the manifest), otherwise not enough "stack push overwrite" events
d. For faster reproducibility, needs a lot of AEXes happeningNote that our CI tests a) only minimally, and doesn't test c) at all I believe. At the same time, that's a truly rare event, I don't think we want a regression test for this.
Yeah, I understand the race here, but I don't understand the comment I linked to. It seems to describe exactly this scenario which you're fixing in this PR, but saying that the codepath in DEBUG
handles it correctly. If that would be the case, then you wouldn't need a change here, just remove the ifdef
.
So, I'm not getting what scenario is that comment describing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 1 unresolved discussion, not enough approvals from different teams (1 more required, approved so far: Intel) (waiting on @mkow)
pal/src/host/linux-sgx/enclave_ocalls.c
line 143 at r1 (raw file):
Previously, mkow (Michał Kowalczyk) wrote…
Yeah, I understand the race here, but I don't understand the comment I linked to. It seems to describe exactly this scenario which you're fixing in this PR, but saying that the codepath in
DEBUG
handles it correctly. If that would be the case, then you wouldn't need a change here, just remove theifdef
.So, I'm not getting what scenario is that comment describing.
No, the linked comment describes a general fix, not the exact (buggy) scenario that I described.
The linked comment explains that there are two actors that use uRSP (untrusted stack of the thread): the AEX event and the OCALL sgx_alloc_on_ustack()
family of functions.
As a side note, currently in Gramine, only the DEBUG mode cares about the untrusted stack during the AEX events, because currently the RELEASE mode does not use the stack (it executes very simple assembly).
So, the linked comment explains that in DEBUG mode, the uRSP value must be modified in OCALL ustack-allocating functions, otherwise AEX events would overwrite the prepared-for-OCALL objects and arguments. That's all that the linked comment does.
What this comment and corresponding code don't account for is the case when AEX happens while OCALL objects/arguments are still used. That's just a bug in the code, and the comment talks about a related scenario, but doesn't cover this bug (and it shouldn't cover this bug, because well it's a bug).
Also note that we can trigger this bug in DEBUG mode, it's just that we need to enable Exitless, have a lot of AEXes, and need SGX profiling enabled, and run it for several hours in a loop.
Previously, there was a data race that the exitless-OCALL logic read the result of the OCALL from the stack *after* the code reset the stack. However, this stack is shared with the AEX flows. As of now, the AEX flows use the stack only in debug mode, so this data race flew under the radar for a long time. What can happen is that right-before reading the result of the exitless OCALL, the enclave thread is interrupted, an AEX logic is executed and modifies the values on the stack, then the exitless-OCALL logic is resumed, and the enclave thread reads an AEX-modified OCALL result value. Future commits (e.g., AEX-Notify) will introduce more AEX flows and expose this data race. So let's fix it now. Signed-off-by: Dmitrii Kuvaiskii <dmitrii.kuvaiskii@intel.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! all files reviewed, all discussions resolved
pal/src/host/linux-sgx/enclave_ocalls.c
line 143 at r1 (raw file):
Previously, dimakuv (Dmitrii Kuvaiskii) wrote…
No, the linked comment describes a general fix, not the exact (buggy) scenario that I described.
The linked comment explains that there are two actors that use uRSP (untrusted stack of the thread): the AEX event and the OCALL
sgx_alloc_on_ustack()
family of functions.As a side note, currently in Gramine, only the DEBUG mode cares about the untrusted stack during the AEX events, because currently the RELEASE mode does not use the stack (it executes very simple assembly).
So, the linked comment explains that in DEBUG mode, the uRSP value must be modified in OCALL ustack-allocating functions, otherwise AEX events would overwrite the prepared-for-OCALL objects and arguments. That's all that the linked comment does.
What this comment and corresponding code don't account for is the case when AEX happens while OCALL objects/arguments are still used. That's just a bug in the code, and the comment talks about a related scenario, but doesn't cover this bug (and it shouldn't cover this bug, because well it's a bug).
Also note that we can trigger this bug in DEBUG mode, it's just that we need to enable Exitless, have a lot of AEXes, and need SGX profiling enabled, and run it for several hours in a loop.
Ok, I see, thanks!
2ddaae0
to
79ce80f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! all files reviewed, all discussions resolved
Description of the changes
Previously, there was a data race that the exitless-OCALL logic read the result of the OCALL from the stack after the code reset the stack. However, this stack is shared with the AEX flows. As of now, the AEX flows use the stack only in debug mode, so this data race flew under the radar for a long time. What can happen is that right-before reading the result of the exitless OCALL, the enclave thread is interrupted, an AEX logic is executed and modifies the values on the stack, then the exitless-OCALL logic is resumed, and the enclave thread reads an AEX-modified OCALL result value.
Future commits (e.g., AEX-Notify) will introduce more AEX flows and expose this data race. So let's fix it now.
This was detected while working on #1857 and #1944. The same problem is exhibited also in AEX-Notify PR: #1531.
How to test this PR?
Run PAL and LibOS regression tests in a loop on e.g. PR #1944 (commenting out this fix first). They have Exitless tests, so they fail periodically.
This change is