Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c18n: Rework implementation to be interrupt-safe #2090

Merged
merged 3 commits into from
Jun 1, 2024
Merged

c18n: Rework implementation to be interrupt-safe #2090

merged 3 commits into from
Jun 1, 2024

Conversation

dpgao
Copy link
Contributor

@dpgao dpgao commented Apr 24, 2024

Note: This is a cleaned-up version of #2079 minus the bits about c18n statistics.

This PR completely refactors the trampoline and how stack switching works. The purecap and benchmark ABI implementations now both use a dedicated register to store the trusted stack (ddc and rddc respectively). This makes the trampolines look identical (modulo register names) on both ABIs. No metadata recording the current top of the stack is stored at the bottom of each compartment's stack. Instead, the stack lookup table now stores that information.

The signal handling mechanism has been rewritten to handle (rare) cases where c18n code, in particular trampolines, is interrupted. All c18n code paths that could be interrupted have been audited and it is believed that they can all be handled correctly, although testing for that is hard.

@jrtc27
Copy link
Member

jrtc27 commented May 7, 2024

Why is the second commit in here?

@jrtc27
Copy link
Member

jrtc27 commented May 7, 2024

For the first commit, 1384 insertions(+), 1259 deletions(-) is a big pile of diff to review in one chunk. Is there really nothing you can do to split it out into multiple commits that clearly explain what they're actually doing? I don't really know how anyone can be expected to review it by just staring at literally hundreds of lines of assembly diffs mixed in with other diffs all over the place.

@dpgao
Copy link
Contributor Author

dpgao commented May 7, 2024

Why is the second commit in here?

It is not related to the first commit but bundled here because it is a very minor change.

@jrtc27
Copy link
Member

jrtc27 commented May 7, 2024

For the first commit, 1384 insertions(+), 1259 deletions(-) is a big pile of diff to review in one chunk. Is there really nothing you can do to split it out into multiple commits that clearly explain what they're actually doing? I don't really know how anyone can be expected to review it by just staring at literally hundreds of lines of assembly diffs mixed in with other diffs all over the place.

(Generally speaking, commits should either be large and boringly repetitive, or small and complex, not both)

@jrtc27
Copy link
Member

jrtc27 commented May 7, 2024

Why is the second commit in here?

It is not related to the first commit but bundled here because it is a very minor change.

Well, don't? As far as I can tell it can be easily rebased onto dev, and if it's a good idea it could have gone in weeks ago with little review effort needed.

@dpgao
Copy link
Contributor Author

dpgao commented May 7, 2024

For the first commit, 1384 insertions(+), 1259 deletions(-) is a big pile of diff to review in one chunk. Is there really nothing you can do to split it out into multiple commits that clearly explain what they're actually doing? I don't really know how anyone can be expected to review it by just staring at literally hundreds of lines of assembly diffs mixed in with other diffs all over the place.

I'm afraid that this change is more of a rewrite of entire parts of the implementation than a collection of incremental improvements. The change in the stack look-up mechanism propagates all over the place, including

  • Stack unwinding
  • Signal handling and delivery
  • The trusted frame layout
  • Trampolines and miscellaneous assembly code
  • Allocation and de-allocation of compartment stacks
  • Compartment transition tracing (due to changes in trampolines and the trusted frame layout)

The unchanged parts are tramp_intern and policy enforcement. Everything in the 'back-end' has been affected. It is perhaps better to treat them as new code rather than a diff upon the original.

I have added extensive comments to the assembly code and signal handling code, which are a bit tricky. Please let me know if there's anything unclear.

@dpgao
Copy link
Contributor Author

dpgao commented May 7, 2024

Why is the second commit in here?

It is not related to the first commit but bundled here because it is a very minor change.

Well, don't? As far as I can tell it can be easily rebased onto dev, and if it's a good idea it could have gone in weeks ago with little review effort needed.

There is actually some non-essential dependency between this commit and the previous one (c18n_init2() did not exist before the refactor). Rebasing it on dev will just further enlarge the refactoring commit.

@bsdjhb
Copy link
Collaborator

bsdjhb commented May 8, 2024

I think the superpages commit can certainly wait for the refactoring to land. The only bit I think can maybe be easily extracted as a separate commit perhaps is the utrace refactoring.

libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Show resolved Hide resolved
@dpgao dpgao force-pushed the c18n-ng branch 2 times, most recently from 714b3d0 to 5f8abe8 Compare May 15, 2024 16:19
libexec/rtld-elf/aarch64/rtld_c18n_asm.S Outdated Show resolved Hide resolved
@gvnn3 gvnn3 requested a review from markjdb May 15, 2024 16:30
Copy link
Contributor

@gvnn3 gvnn3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of nits.

libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
/*
* Add 1 to offset due to capmode.
*/
PATCH_ADR(cookie_off, size + 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This deserves a better comment to explain why the + 1 is there even for capmode (and please call it C64; capmode is the RISC-V name). If I understand correctly, it's because you use LR == cookie to detect a tail call, and since the benchmark ABI only clears the LSB of LR when it returns you need the cookie to match that, even though that's not the actual address it would use for a return.

There's probably a better name for this than cookie, too, that conveys what it's for. Otherwise it sounds like it's just an arbitrary bit of identifying data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed it to landing to denote the address in the trampoline that the called function should 'land' back to.

libexec/rtld-elf/rtld_c18n.c Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.c Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.h Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.h Outdated Show resolved Hide resolved
libexec/rtld-elf/rtld_c18n.h Outdated Show resolved Hide resolved
@dpgao dpgao requested a review from jrtc27 May 21, 2024 21:15
.sig = (struct func_sig) {
.valid = true,
.reg_args = 3, .mem_args = false, .ret_args = NONE
for (int n = npagesizes - 1; n >= 0; --n)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion included curly braces for a reason... if the inner block uses curly braces so should the outer

_Pragma("clang diagnostic pop"); \
_tf; \
})
struct stk_table_sizes_data {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name still doesn't make sense to me given what it is

@dpgao dpgao requested a review from jrtc27 May 21, 2024 22:49
size_t capacity;
struct tcb_wrapper *wrap;
struct stk_table_stk_info trusted_stk;
struct stk_table_stk_info data[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Surely you can do better than "data" here?

struct stk_table_data {
void *stack;
void *reserved;
} data[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto

@dpgao dpgao requested a review from jrtc27 May 25, 2024 14:44
Comment on lines 71 to 72
struct stk_table_stk_info trusted;
struct stk_table_stk_info compart[];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, just name them what they are and put something like stack in the name of the field. stk_table_metadata->compart is not accurate, that sounds like it's getting the compartment itself, not its stack.

Though I have to say I really do not understand why we have stacks here but also a stk_table_entry per compartment in stk_table itself. Which says to me that this needs some non-empty subset of better naming, comments and a different design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed the names and added some comments in the struct definitions.

@brooksdavis brooksdavis linked an issue May 28, 2024 that may be closed by this pull request
@dpgao dpgao requested a review from jrtc27 May 29, 2024 16:18
Copy link
Member

@jrtc27 jrtc27 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit and one comment to think about for the future. I can't say I've extensively reviewed the whole design, but whilst it remains off-by-default I'm happy enough for it to land once the nit has been addressed.

libexec/rtld-elf/rtld_c18n.c Show resolved Hide resolved
@@ -832,8 +878,7 @@ _rtld_unw_getcontext_unsealed(uintptr_t ret, void **buf)
struct jmp_args { uintptr_t ret1; uintptr_t ret2; };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI this is specific to Morello's calling convention, so doesn't really belong in an MI file, but that's a pre-existing problem

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I'll sort this out while porting to RISC-V.

dpgao added 3 commits May 31, 2024 20:33
The trampoline and other parts of RTLD are refactored to be
interrupt-safe. The trusted frame is redesigned to allow trampolines to
perform tail-calls that do not push a trusted frame.

The new design also no longer relies on a region of metadata at the
bottom of each compartment's stack.
@dpgao dpgao merged commit 05c9a04 into dev Jun 1, 2024
27 checks passed
@dpgao dpgao deleted the c18n-ng branch June 1, 2024 09:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

c18n: Data corruption when trampolines are interrupted
4 participants