Switch stack on SYSCALL instruction #808

nyh · 2016-11-03T10:05:45Z

When an application calls a function (including the C library's "system call" functions), it needs to make sure that there is a reasonable amount of stack available for this function to work. However, the x86 SYSCALL instruction is different: an application which calls it thinks it is a single X86 instruction, and it doesn't expect it to use the stack stack at all - and indeed in Linux it doesn't (the system call switches to kernel space, where the kernel's stack is used).

In our existing syscall implementation (see entry.S's syscall_entry), we realized the caller didn't prepare the stack for us, and for this reason we had to skip the 128-byte "red zone" (an area of the stack that the caller is allowed to use without moving the stack pointer) and to align the stack (a requirement when we call other functions). This is all great IF the application's stack is big enough. But it may not be...

Some applications have tiny stacks. Notably, the Go language (see #522) uses a separate tiny stack for each "go-routine", and only grows this stack when needed. But at the point a SYSCALL instruction is used, the remaining available stack can be small, and it will not further grow during the SYSCALL call. So the OSv code can overflow the application's stack and potentially write to unrelated objects in memory.

So instead of adjusting the caller's stack (the red-zone and alignment fixes we do in syscall_entry), we should switch to a completely different stack for the duration of the system call.

@benoit-canet reports crashes in golang tests, which may be caused by this issue, although we are not sure that they actually are. Go's tiny stacks do not have a stack guard in the Posix sense (an unmapped page to cause a segv when used, to clearly detect such stack overflows), but looking at the Go source code (with which I'm not familiar) it seems it does have its own notion of a stack guard, and code to check whether a canary value in this stack-guard area has been overwritten - so Go should have been able to detect such an overflow had it occurred, so if it wasn't detected maybe we have a different problem. Alternatively, it is possible we did have a stack overflow but it overwrote something belonging to a different thread, causing a crash before this thread got around to checking the stack canary.

nyh · 2016-11-08T12:11:05Z

The biggest problem in switching the stack is where to save the user's stack pointer, so we can restore it later. The issue is that we don't have any register to save it in, because the SYSCALL instruction is allowed to overwrite only the RAX, RCX and R11 registers, and at the point of the stack switch, we still need the value of all three (RAX holds the system call number, RCX holds the return address, and R11 the saved rflags).

One hack we could use to gain a register is to take the value of RAX, which is supposed to be a small integer, and save it in unused bits of RCX or R11. We could then (hopefully, I didn't work out the details yet) use just this single register to calculate the position of the system call stack (the per-thread system call stack can be stored in a TLS variable), and then save in the beginning of that stack the position of the previous (user) stack.

Linux actually does a somewhat similar thing ("similar" only in the sense it uses TLS), but without a hack for gaining a register: Its kernel/arch/x86/syscall.S starts with the following instructions:

    swapgs
    mov     %rsp, %gs:PERCPU_SAVED_USER_SP_OFFSET
    mov     %gs:PERCPU_KERNEL_SP_OFFSET, %rsp

Let's try to explain what this does: The Linux kernel uses %gs to store per-cpu variables in a similar way to how we use %fs to store thread-local variables; This code first switches to the kernel's %gs, and then uses various fixed offsets into the per-cpu storage area %gs to first save the old user stack pointer in one of these offsets, and then load the kernel stack from a second offset.

We can actually do exactly the same thing, using %fs (for per-thread, not per-cpu, storage): Note that unlike the Linux kernel which doesn't control the user's %gs (and therefore needs to swap it in the first "swapgs" instruction), OSv does control the application thread's %fs. This %fs always points to the thread's TCB ("thread control block") which contains, in negative offsets, the application and the kernel TLS areas, and in positive offset, the thread_control_block structure itself. We could put the syscall stack pointer and/or the saved user stack pointer in either of these (as a normal "__thread" variable ending up with a negative but unknown offset, or as a field in the thread_control_block and therefore a positive and known offset).

We could use two entries in the TLS/TCB as Linux does (for the syscall stack and the application stack) or just one (for the syscall stack - and save the application's stack pointer inside the syscall stack), I don't know which is better.

If we can use TLS and not TCB, it will be cleaner (it will be the same as all other "__thread" variables in the code - so far we did not put anything in the TCB itself), but it will not be trivial to do this in practice (it will require getting the correct offset using the linker).

HawxChen · 2017-07-28T04:13:27Z

Hi @nyh,
I prefer using only the TLS and leaving _tcb nothing. It might finish this work sooner because it follows the existing OSv's _tcb management. I am preparing to allocate two 4KB stacks, app stack and syscall stack, in TLS and then switching them before and after syscalls. About these two stacks' pointers, I will store them in the thread_state struct defined in arch/x64/arch-thread-state.hh so that it is possible to avoid offset calculation.

May I ask to you to explain the following more:

... but it will not be trivial to do this in practice (it will require getting the correct offset using the linker).

In arch/x86/loader.ld, Should I define two stacks, .tappstack and .tsyscallstack, as well as enlarge .tls_template_size?
What kinds of other essential definitions should I add in the x86's loader.ld?

Thank you in advance for the discussion.

nyh · 2017-07-30T14:09:50Z

@HawxChen I didn't think all the details through or test any of them, so please don't take everything I say as being scripture, but I'll try my best to make suggestions:

I don't think the loader.ld is relevant here. When I mentioned "the linker" I meant something else. OSv (like any other application) has many thread-local variables. Each of the object files (*.o) may have one or several of these variables. The linker takes all of these variables and puts them next to each other in one segment of the resulting executable file, and modifies the executable to know at which offset each of the variables were saved. So the question becomes, say that we add a new thread-local variable "syscall_stack". When C code uses "syscall_stack", the compiler and linker cooperate so that if this variable is stored in position 112 of the TLS area, the generated code will look at %fs - 112.

So now the question becomes, how do we do the same in assembly code. You can't hard code %fs - 112 for any constant number "112", because this number only becomes a constant during compilation. The question is how to figure it out. I think what you need to do is exactly the same thing that the C compiler does when it generates references to syscall_stack - you need to cooperate with the linker to ensure that the right offset is found and inserted into the code. You also need to do it in a way that avoids run-time calculation (which we don't have registers to do!). The more I think about this, I think this is doable, but it will not be easy. You'll probably need to read https://www.akkadia.org/drepper/tls.pdf on how TLS works, although not all the details there will be relevant - If I remember correctly, the TLS model which will be relevant here is the "local exec" model (where OSv accesses its own TLS variables).

The reason I suggested to put something in _tcb, is that the area there (positive offsets into %fs) is a structure explicitly under our control and not reorganized by the linker, so we don't need to know any super-complicated TLS techniques to know that we put the syscall stack in offset 16 in this structure, so that's where it is. But you're write that the TLS approach - if we get it to work - will look much nicer than the tcb approach, because it's kind of ugly to add stuff to the TCB instead of making them ordinary thread-local variables. However, if we do give up and decide to put it in the TCB, we wouldn't be the first do such a thing (putting stuff in TCB) - glibc also puts stuff in the TCB instead of ordinary TLS - as noted in issue #589.

HawxChen · 2017-08-01T03:21:15Z

@nyh, I appreciate your suggestions.

I just submitted the patch to put something in _tcb as the solution. Although it is not nicer than the other solution, which uses only TLS, you provided and I tried to develope originally, at least, in this patch verified by "make check" successfully, I tried to minimize the size of variables used in thread_control_block and provided equivalent function. Although it is not the most elegant solution, I hope that it is acceptable.

Thank you for your advices.

@HawxChen

This patch implements separate syscall call stack needed when runtimes like Golang use SYSCALL instruction to execute system calls. More specifically in case of Golang tiny stacks used by coroutines are not deep enough to execute system calls on OSv which handles them as regular function calls. In case of OS like Linux tiny Golang stacks are not a problem as the system calls are executed on separate kernel stack. More specifically each application thread pre-alllocates "tiny" (1024-bytes deep) syscall call stack. When SYSCALL instruction is called first time for given thread the call stack is switched to the tiny syscall stack in order to setup the "large" stack and switch to. Eventually execution of syscall continues on large syscall stack. From this point, second and subsequent executions of SYSCALL instruction are handled on large syscall stack. Please note that because the stack is switched from original caller stack to the syscall stack that is allocated lazily the stack pointers "jumps" to higher addresses. From gdb perspective this results in "Backtrace stopped: previous frame inner to this frame (corrupt stack?)" message. Gdb assumes that stack pointer should decrease in value as the stack "grows" which is not the case when stack is switched to syscall stack. This patch is based on the original patch provided by @HawxChen. Fixes cloudius-systems#808 Message-Id: <1521168974-17372-1-git-send-email-jwkozaczuk@gmail.com>

This was referenced Feb 6, 2017

apps/golang-example hangs after printing Go version #850

Closed

sigaltstack() and SA_ONSTACK are stubbed #809

Closed

wkozaczuk added the patch submitted label May 3, 2018

nyh mentioned this issue May 15, 2018

Idea for alternative implementation of syscall stack #972

Open

nyh closed this as completed in 3f2ca0c May 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch stack on SYSCALL instruction #808

Switch stack on SYSCALL instruction #808

nyh commented Nov 3, 2016

nyh commented Nov 8, 2016 •

edited

Loading

HawxChen commented Jul 28, 2017

nyh commented Jul 30, 2017

HawxChen commented Aug 1, 2017 •

edited

Loading

Switch stack on SYSCALL instruction #808

Switch stack on SYSCALL instruction #808

Comments

nyh commented Nov 3, 2016

nyh commented Nov 8, 2016 • edited Loading

HawxChen commented Jul 28, 2017

nyh commented Jul 30, 2017

HawxChen commented Aug 1, 2017 • edited Loading

nyh commented Nov 8, 2016 •

edited

Loading

HawxChen commented Aug 1, 2017 •

edited

Loading