lkl: automatic syscall threads creation #97

ghost · 2016-02-26T15:38:20Z

Move the host thread creation in lkl_create_syscall_thread so that the
host thread is now properly synchronized with the main thread.

This avoid a potential deadlock where the new host thread will issue a
system call to create a new kernel syscall thread while the main host
has already issued a blocking system call which can only be unblock by
a system call issued from the new host thread, e.g.:

main thread:
write(pipe[1], ...) /* fill pipe /
create new thread
write(pipe[1], ...) / blocked due to full pipe */

new thread:
lkl_create_syscall_thread

/* we potentially dealock here because the main syscall thread is
  executing the write system call from main thread */

read(pipe[0])

lkl_stop_syscall_thread() has been removed while
lkl_create_syscall_thread() has been changed to receive a function and
argument to run in the new host thread. It now must be called from the
main thread.

Signed-off-by: Octavian Purdila octavian.purdila@intel.com

ghost · 2016-02-26T15:41:10Z

@pscollins @phhusson it implements what we discussed in #92, can you please take a look?

tommie-lie · 2016-02-27T09:26:54Z

I'm not too fond of the lkl_create_syscall_thread(void (*fn)(void*), void *arg) signatue. It will cause trouble for people that don't use pthread-like APIs directly but use some higher level abstraction like glib, boost::thread or (like I do) C++ std::thread. Calling a function at the beginning and end of a thread function is easy and should be fairly portable across tread abstractions. Creating threads through lkl_create_syscall_thread would force you to mix APIs and idioms or hack the abstraction.

My deadline just disappeared, so I will have time to have a more thorough look at this this weekend.

pscollins · 2016-02-28T01:02:57Z

@tommie-lie That makes sense.

I suppose the start_syscall_thread() interface might be easiest to use to hijack clone(2), where we could just do it as:

HOST_CALL(clone);
int clone(/* args */)
{
    start_syscall_thread();
    host_clone(/* args */);
}

What I'd like to do is keep the data stuff in a threadlocal variable so the lkl_host_ops->{get,set}_tls calls are no longer needed, and start up a new syscall thread on demand if a host thread tries to make a syscall and discovers that its __thread struct lkl_syscall_thread_data *data is NULL, so the process would be transparent to both hijacked and non-hijacked code.

I think it can be done nicely but I'm not totally certain how to deal with the syscall(__NR_syscall_thread_create) call that needs to be done in the original thread, rather than in the newly-created thread. I suppose we could have one thread that exists only to call syscall(__NR_syscall_thread_create) when asked to by newly-created threads, which I think is similar to what @phhusson suggested earlier up in this thread. There probably wouldn't be too much overhead as long as it slept on a condition variable until needed, or something like that. I'll try to put together an implementation this week sometime.

tommie-lie · 2016-02-28T19:20:03Z

Pardon my ignorance on the threading model of LKL, I spent this afternoon tracing the calls and finding my way through LKL's interrupts handlers, but there are probably still things I'm missing. So please correct me if I'm wrong in my musings below.

I don't really understand why it is necessary to explicitly create host-side threads to run interrupts in parallel.
As far as I understand the current syscall mechanism (in master, not in this PR), there is one IRQ line per host thread. lkl_syscall is executed directly in the host application's context, i.e. the thread that called the syscall, and puts a syscall struct into a queue, triggers an IRQ and waits for the syscall to complete.
In the meantime, the syscall thread dequeues the syscall and calls run_syscall which blocks as long as the syscall itself is still running (even if it waits for I/O). As there can be many different syscall threads (as of #86), the individual syscall handlers have to be reentrant (which is already a kernel feature).
The IRQ handler itself syscall_irq_handler just notifies the queue of a new item and returns immediately.

The only thing that really blocks for a longer period of time is the actuall call into the syscall table in run_syscall. The function itself can currently be called from different threads (the syscall threads) and blocks.
Why can't we simply spawn a new thread at syscalls.c:107 and pass s and the completion semaphore as arg to the thread_fn? The thread_fn would then consist of

run_syscall(s);
lkl_ops->sem_up(data->completion);

This would cause the while loop in syscalls.c:107 to immediately continue after the thread has been started. The syscall itself runs in a seperate thread. As the caller (on the host-side) has a down() on the completeion semaphore, the host is blocked as long as the syscall has not completed. Other host threads can call lkl_syscall again to add items to the wqh, trigger the interrupt and spawn yet another thread. As these actual syscall threads are bound to the host-side threads, there can be no more syscall threads than there are host threads. As they vanish as soon as they are completed, no resources are wasted. The syscall metadata (current data variable, syscall no. and parameters) is accessed only from lkl_syscall and the syscall demultiplexer (the function currently called syscall_thread) and is immediately passed to the thread. We should pthread_detach/CloseHandle the spawned threads or keep a global list of them to join them on on lkl_halt. We'd have to put a write lock on the wqh (or is it already thread-safe?) to serialize simultaneous calls to lkl_syscalls, but the duration of these mutexes would be very short in terms of execution time.

This would also get rid of the lkl_create_syscall_thread/lkl_stop_syscall_thread functions and some amount of other management infrastructure introduced by #86, i.e. reduce code size and complexity.

Is there a big issue I am not seeing here? (There probably is)

tommie-lie · 2016-02-28T19:38:45Z

This is imprecise language:

and puts a syscall struct into a queue, triggers an IRQ and waits for the syscall to complete.

There is no real queue data structure that the syscall is added to but a queue of (in our case only one) waiters that are woken up and that then process the syscall which has (almost) the same semantics. This can as easily be converted into a real queue of syscalls.

ghost · 2016-02-28T22:16:37Z

Hi @tommie-lie ,

Spawning new kernel threads works OK if we want to run all syscalls in the same process. However, if we want to support multiple processes then we do need long lived kernel threads / tasks. The cost of creating and terminating a thread needs to be taken in consideration as well.

I am slowly working on creating a uml implementation based on LKL and I want to keep supporting multiple processing.

Also, I don't think that spawning a new thread for each syscall is really required. I am working on a new PR based on @phhusson 's idea: reserve the default syscall thread to spawn new syscall thread and spawn a new syscall thread at the first syscall executed from a thread - we can use TLS to check that. I hope to have it ready in a couple of days.

ghost · 2016-02-29T01:27:58Z

Hi @tommie-lie , @pscollins , @phhusson ,

I have pushed a new patch based on @phhusson 's idea.

phhusson · 2016-02-29T11:02:44Z

Sounds good to me.
I'll check it works with my code

pscollins · 2016-02-29T19:23:42Z

@tommie-lie

The syscall metadata (current data variable, syscall no. and parameters) is accessed only from lkl_syscall and the syscall demultiplexer (the function currently called syscall_thread) and is immediately passed to the thread.

The data pointer is associated with the IRQ handler, so in order to make it from lkl_syscall in a host thread to run_syscall in a kernel thread, it needs to bounce through a lot of machinery and nothing else can overwrite it until run_syscall lets lkl_syscall finish and lkl_syscall returns (which is why there is one IRQ line per host thread, I think), so it's tricky to set things up in a way where you know that data is no longer being accessed by anyone else --- in your suggested implementation, wouldn't data->s be clobbered by the next host thread that makes a syscall?

This patch adds support for creating kernel syscall threads dynamically for each host thread that issues LKL system calls. This behavior is controled by the lkl_auto_syscall_threads kernel command line parameter and it is enabled by default. The application can choose to disable automatic kernel syscall threads creation and manually use lkl_create_syscall_thread / lkl_stop_sycall_thread to optimize the used resources. When automatic mode is used the default syscall kernel thread is reserved for creating new syscall threads to avoid the following deadlock scenario: the main thread is issusing a blocking system call that depends on another thread to unblock before the other thread is getting to issue the system call that creates the associated kernel system thread. Note that automatically created kernel system call threads are stopped only when the kernel is shutdown. If the application is using a large number of short lived host threads it should call lkl_stop_syscall_thread before the host thread is terminated. Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>

ghost · 2016-03-01T13:02:27Z

@pscollins @phhusson @tommie-lie new version (v3) which fixes the race condition @pscollins found. We now access the threads_list only from kernel context.

I have also removed the waitid calls since from commit v2 we don't create full processes but threads (see the CLONE_THREAD in kernel_thread call) and thus we don't need to wait for children before we exit the main syscall thread.

pscollins · 2016-03-05T00:11:18Z

@opurdila LGTM. I benchmarked this version with a netperf-like tool and it is very good, within an order of magnitude of host performance over /dev/lo.

lkl: automatic syscall threads creation

In pppoe_sendmsg(), reserving dev->hard_header_len bytes of headroom was probably fine before the introduction of ->needed_headroom in commit f5184d2 ("net: Allow netdevices to specify needed head/tailroom"). But now, virtual devices typically advertise the size of their overhead in dev->needed_headroom, so we must also take it into account in skb_reserve(). Allocation size of skb is also updated to take dev->needed_tailroom into account and replace the arbitrary 32 bytes with the real size of a PPPoE header. This issue was discovered by syzbot, who connected a pppoe socket to a gre device which had dev->header_ops->create == ipgre_header and dev->hard_header_len == 0. Therefore, PPPoE didn't reserve any headroom, and dev_hard_header() crashed when ipgre_header() tried to prepend its header to skb->data. skbuff: skb_under_panic: text:000000001d390b3a len:31 put:24 head:00000000d8ed776f data:000000008150e823 tail:0x7 end:0xc0 dev:gre0 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:104! invalid opcode: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 1 PID: 3670 Comm: syzkaller801466 Not tainted 4.15.0-rc7-next-20180115+ lkl#97 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:skb_panic+0x162/0x1f0 net/core/skbuff.c:100 RSP: 0018:ffff8801d9bd7840 EFLAGS: 00010282 RAX: 0000000000000083 RBX: ffff8801d4f083c0 RCX: 0000000000000000 RDX: 0000000000000083 RSI: 1ffff1003b37ae92 RDI: ffffed003b37aefc RBP: ffff8801d9bd78a8 R08: 1ffff1003b37ae8a R09: 0000000000000000 R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff86200de0 R13: ffffffff84a981ad R14: 0000000000000018 R15: ffff8801d2d34180 FS: 00000000019c4880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000208bc000 CR3: 00000001d9111001 CR4: 00000000001606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: skb_under_panic net/core/skbuff.c:114 [inline] skb_push+0xce/0xf0 net/core/skbuff.c:1714 ipgre_header+0x6d/0x4e0 net/ipv4/ip_gre.c:879 dev_hard_header include/linux/netdevice.h:2723 [inline] pppoe_sendmsg+0x58e/0x8b0 drivers/net/ppp/pppoe.c:890 sock_sendmsg_nosec net/socket.c:630 [inline] sock_sendmsg+0xca/0x110 net/socket.c:640 sock_write_iter+0x31a/0x5d0 net/socket.c:909 call_write_iter include/linux/fs.h:1775 [inline] do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653 do_iter_write+0x154/0x540 fs/read_write.c:932 vfs_writev+0x18a/0x340 fs/read_write.c:977 do_writev+0xfc/0x2a0 fs/read_write.c:1012 SYSC_writev fs/read_write.c:1085 [inline] SyS_writev+0x27/0x30 fs/read_write.c:1082 entry_SYSCALL_64_fastpath+0x29/0xa0 Admittedly PPPoE shouldn't be allowed to run on non Ethernet-like interfaces, but reserving space for ->needed_headroom is a more fundamental issue that needs to be addressed first. Same problem exists for __pppoe_xmit(), which also needs to take dev->needed_headroom into account in skb_cow_head(). Fixes: f5184d2 ("net: Allow netdevices to specify needed head/tailroom") Reported-by: syzbot+ed0838d0fa4c4f2b528e20286e6dc63effc7c14d@syzkaller.appspotmail.com Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Reviewed-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>

The kernel may be built with multiple LSMs, but only a subset may be enabled on the boot command line by specifying "lsm=". Not including "integrity" on the ordered LSM list may result in a NULL deref. As reported by Dmitry Vyukov: in qemu: qemu-system-x86_64 -enable-kvm -machine q35,nvdimm -cpu max,migratable=off -smp 4 -m 4G,slots=4,maxmem=16G -hda wheezy.img -kernel arch/x86/boot/bzImage -nographic -vga std -soundhw all -usb -usbdevice tablet -bt hci -bt device:keyboard -net user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net nic,model=virtio-net-pci -object memory-backend-file,id=pmem1,share=off,mem-path=/dev/zero,size=64M -device nvdimm,id=nvdimm1,memdev=pmem1 -append "console=ttyS0 root=/dev/sda earlyprintk=serial rodata=n oops=panic panic_on_warn=1 panic=86400 lsm=smack numa=fake=2 nopcid dummy_hcd.num=8" -pidfile vm_pid -m 2G -cpu host But it crashes on NULL deref in integrity_inode_get during boot: Run /sbin/init as init process BUG: kernel NULL pointer dereference, address: 000000000000001c PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP KASAN CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc2+ lkl#97 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-44-g88ab0c15525c-prebuilt.qemu.org 04/01/2014 RIP: 0010:kmem_cache_alloc+0x2b/0x370 mm/slub.c:2920 Code: 57 41 56 41 55 41 54 41 89 f4 55 48 89 fd 53 48 83 ec 10 44 8b 3d d9 1f 90 0b 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <8b> 5f 1c 4cf RSP: 0000:ffffc9000032f9d8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff888017fc4f00 RCX: 0000000000000000 RDX: ffff888040220000 RSI: 0000000000000c40 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: ffff888019263627 R10: ffffffff83937cd1 R11: 0000000000000000 R12: 0000000000000c40 R13: ffff888019263538 R14: 0000000000000000 R15: 0000000000ffffff FS: 0000000000000000(0000) GS:ffff88802d180000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000001c CR3: 000000000b48e000 CR4: 0000000000750ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: integrity_inode_get+0x47/0x260 security/integrity/iint.c:105 process_measurement+0x33d/0x17e0 security/integrity/ima/ima_main.c:237 ima_bprm_check+0xde/0x210 security/integrity/ima/ima_main.c:474 security_bprm_check+0x7d/0xa0 security/security.c:845 search_binary_handler fs/exec.c:1708 [inline] exec_binprm fs/exec.c:1761 [inline] bprm_execve fs/exec.c:1830 [inline] bprm_execve+0x764/0x19a0 fs/exec.c:1792 kernel_execve+0x370/0x460 fs/exec.c:1973 try_to_run_init_process+0x14/0x4e init/main.c:1366 kernel_init+0x11d/0x1b8 init/main.c:1477 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294 Modules linked in: CR2: 000000000000001c ---[ end trace 22d601a500de7d79 ]--- Since LSMs and IMA may be configured at build time, but not enabled at run time, panic the system if "integrity" was not initialized before use. Reported-by: Dmitry Vyukov <dvyukov@google.com> Fixes: 79f7865 ("LSM: Introduce "lsm=" for boottime LSM selection") Cc: stable@vger.kernel.org Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>

ghost changed the title ~~lkl: change syscall thread API~~ lkl: automatic syscall threads creation Feb 29, 2016

ghost mentioned this pull request Mar 1, 2016

lkl: Fix double free when calling lkl_sys_halt(). #110

Closed

ghost pushed a commit that referenced this pull request Mar 5, 2016

Merge pull request #97 from opurdila/syscall_threads

520c1f2

lkl: automatic syscall threads creation

ghost merged commit 520c1f2 into lkl:master Mar 5, 2016

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lkl: automatic syscall threads creation #97

lkl: automatic syscall threads creation #97

ghost commented Feb 26, 2016

ghost commented Feb 26, 2016

tommie-lie commented Feb 27, 2016

pscollins commented Feb 28, 2016

tommie-lie commented Feb 28, 2016

tommie-lie commented Feb 28, 2016

ghost commented Feb 28, 2016

ghost commented Feb 29, 2016

phhusson commented Feb 29, 2016

pscollins commented Feb 29, 2016

ghost commented Mar 1, 2016

pscollins commented Mar 5, 2016

lkl: automatic syscall threads creation #97

lkl: automatic syscall threads creation #97

Conversation

ghost commented Feb 26, 2016

ghost commented Feb 26, 2016

tommie-lie commented Feb 27, 2016

pscollins commented Feb 28, 2016

tommie-lie commented Feb 28, 2016

tommie-lie commented Feb 28, 2016

ghost commented Feb 28, 2016

ghost commented Feb 29, 2016

phhusson commented Feb 29, 2016

pscollins commented Feb 29, 2016

ghost commented Mar 1, 2016

pscollins commented Mar 5, 2016