Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lkl: automatic syscall threads creation #97

Merged
1 commit merged into from Mar 5, 2016
Merged

lkl: automatic syscall threads creation #97

1 commit merged into from Mar 5, 2016

Conversation

ghost
Copy link

@ghost ghost commented Feb 26, 2016

Move the host thread creation in lkl_create_syscall_thread so that the
host thread is now properly synchronized with the main thread.

This avoid a potential deadlock where the new host thread will issue a
system call to create a new kernel syscall thread while the main host
has already issued a blocking system call which can only be unblock by
a system call issued from the new host thread, e.g.:

main thread:
write(pipe[1], ...) /* fill pipe /
create new thread
write(pipe[1], ...) /
blocked due to full pipe */

new thread:
lkl_create_syscall_thread

/* we potentially dealock here because the main syscall thread is
  executing the write system call from main thread */

read(pipe[0])

lkl_stop_syscall_thread() has been removed while
lkl_create_syscall_thread() has been changed to receive a function and
argument to run in the new host thread. It now must be called from the
main thread.

Signed-off-by: Octavian Purdila octavian.purdila@intel.com

Review on Reviewable

@ghost
Copy link
Author

ghost commented Feb 26, 2016

@pscollins @phhusson it implements what we discussed in #92, can you please take a look?

@tommie-lie
Copy link

I'm not too fond of the lkl_create_syscall_thread(void (*fn)(void*), void *arg) signatue. It will cause trouble for people that don't use pthread-like APIs directly but use some higher level abstraction like glib, boost::thread or (like I do) C++ std::thread. Calling a function at the beginning and end of a thread function is easy and should be fairly portable across tread abstractions. Creating threads through lkl_create_syscall_thread would force you to mix APIs and idioms or hack the abstraction.

My deadline just disappeared, so I will have time to have a more thorough look at this this weekend.

@pscollins
Copy link
Member

@tommie-lie That makes sense.

I suppose the start_syscall_thread() interface might be easiest to use to hijack clone(2), where we could just do it as:

HOST_CALL(clone);
int clone(/* args */)
{
    start_syscall_thread();
    host_clone(/* args */);
}

What I'd like to do is keep the data stuff in a threadlocal variable so the lkl_host_ops->{get,set}_tls calls are no longer needed, and start up a new syscall thread on demand if a host thread tries to make a syscall and discovers that its __thread struct lkl_syscall_thread_data *data is NULL, so the process would be transparent to both hijacked and non-hijacked code.

I think it can be done nicely but I'm not totally certain how to deal with the syscall(__NR_syscall_thread_create) call that needs to be done in the original thread, rather than in the newly-created thread. I suppose we could have one thread that exists only to call syscall(__NR_syscall_thread_create) when asked to by newly-created threads, which I think is similar to what @phhusson suggested earlier up in this thread. There probably wouldn't be too much overhead as long as it slept on a condition variable until needed, or something like that. I'll try to put together an implementation this week sometime.

@tommie-lie
Copy link

Pardon my ignorance on the threading model of LKL, I spent this afternoon tracing the calls and finding my way through LKL's interrupts handlers, but there are probably still things I'm missing. So please correct me if I'm wrong in my musings below.

I don't really understand why it is necessary to explicitly create host-side threads to run interrupts in parallel.
As far as I understand the current syscall mechanism (in master, not in this PR), there is one IRQ line per host thread. lkl_syscall is executed directly in the host application's context, i.e. the thread that called the syscall, and puts a syscall struct into a queue, triggers an IRQ and waits for the syscall to complete.
In the meantime, the syscall thread dequeues the syscall and calls run_syscall which blocks as long as the syscall itself is still running (even if it waits for I/O). As there can be many different syscall threads (as of #86), the individual syscall handlers have to be reentrant (which is already a kernel feature).
The IRQ handler itself syscall_irq_handler just notifies the queue of a new item and returns immediately.

The only thing that really blocks for a longer period of time is the actuall call into the syscall table in run_syscall. The function itself can currently be called from different threads (the syscall threads) and blocks.
Why can't we simply spawn a new thread at syscalls.c:107 and pass s and the completion semaphore as arg to the thread_fn? The thread_fn would then consist of

run_syscall(s);
lkl_ops->sem_up(data->completion);

This would cause the while loop in syscalls.c:107 to immediately continue after the thread has been started. The syscall itself runs in a seperate thread. As the caller (on the host-side) has a down() on the completeion semaphore, the host is blocked as long as the syscall has not completed. Other host threads can call lkl_syscall again to add items to the wqh, trigger the interrupt and spawn yet another thread. As these actual syscall threads are bound to the host-side threads, there can be no more syscall threads than there are host threads. As they vanish as soon as they are completed, no resources are wasted. The syscall metadata (current data variable, syscall no. and parameters) is accessed only from lkl_syscall and the syscall demultiplexer (the function currently called syscall_thread) and is immediately passed to the thread. We should pthread_detach/CloseHandle the spawned threads or keep a global list of them to join them on on lkl_halt. We'd have to put a write lock on the wqh (or is it already thread-safe?) to serialize simultaneous calls to lkl_syscalls, but the duration of these mutexes would be very short in terms of execution time.

This would also get rid of the lkl_create_syscall_thread/lkl_stop_syscall_thread functions and some amount of other management infrastructure introduced by #86, i.e. reduce code size and complexity.

Is there a big issue I am not seeing here? (There probably is)

@tommie-lie
Copy link

This is imprecise language:

and puts a syscall struct into a queue, triggers an IRQ and waits for the syscall to complete.

There is no real queue data structure that the syscall is added to but a queue of (in our case only one) waiters that are woken up and that then process the syscall which has (almost) the same semantics. This can as easily be converted into a real queue of syscalls.

@ghost
Copy link
Author

ghost commented Feb 28, 2016

Hi @tommie-lie ,

Spawning new kernel threads works OK if we want to run all syscalls in the same process. However, if we want to support multiple processes then we do need long lived kernel threads / tasks. The cost of creating and terminating a thread needs to be taken in consideration as well.

I am slowly working on creating a uml implementation based on LKL and I want to keep supporting multiple processing.

Also, I don't think that spawning a new thread for each syscall is really required. I am working on a new PR based on @phhusson 's idea: reserve the default syscall thread to spawn new syscall thread and spawn a new syscall thread at the first syscall executed from a thread - we can use TLS to check that. I hope to have it ready in a couple of days.

@ghost ghost changed the title lkl: change syscall thread API lkl: automatic syscall threads creation Feb 29, 2016
@ghost
Copy link
Author

ghost commented Feb 29, 2016

Hi @tommie-lie , @pscollins , @phhusson ,

I have pushed a new patch based on @phhusson 's idea.

@phhusson
Copy link

Sounds good to me.
I'll check it works with my code

@pscollins
Copy link
Member

@tommie-lie

The syscall metadata (current data variable, syscall no. and parameters) is accessed only from lkl_syscall and the syscall demultiplexer (the function currently called syscall_thread) and is immediately passed to the thread.

The data pointer is associated with the IRQ handler, so in order to make it from lkl_syscall in a host thread to run_syscall in a kernel thread, it needs to bounce through a lot of machinery and nothing else can overwrite it until run_syscall lets lkl_syscall finish and lkl_syscall returns (which is why there is one IRQ line per host thread, I think), so it's tricky to set things up in a way where you know that data is no longer being accessed by anyone else --- in your suggested implementation, wouldn't data->s be clobbered by the next host thread that makes a syscall?

This patch adds support for creating kernel syscall threads dynamically
for each host thread that issues LKL system calls. This behavior is
controled by the lkl_auto_syscall_threads kernel command line parameter
and it is enabled by default.

The application can choose to disable automatic kernel syscall threads
creation and manually use lkl_create_syscall_thread /
lkl_stop_sycall_thread to optimize the used resources.

When automatic mode is used the default syscall kernel thread is
reserved for creating new syscall threads to avoid the following
deadlock scenario: the main thread is issusing a blocking system call
that depends on another thread to unblock before the other thread is
getting to issue the system call that creates the associated kernel
system thread.

Note that automatically created kernel system call threads are stopped
only when the kernel is shutdown. If the application is using a large
number of short lived host threads it should call
lkl_stop_syscall_thread before the host thread is terminated.

Signed-off-by: Octavian Purdila <octavian.purdila@intel.com>
@ghost
Copy link
Author

ghost commented Mar 1, 2016

@pscollins @phhusson @tommie-lie new version (v3) which fixes the race condition @pscollins found. We now access the threads_list only from kernel context.

I have also removed the waitid calls since from commit v2 we don't create full processes but threads (see the CLONE_THREAD in kernel_thread call) and thus we don't need to wait for children before we exit the main syscall thread.

@pscollins
Copy link
Member

@opurdila LGTM. I benchmarked this version with a netperf-like tool and it is very good, within an order of magnitude of host performance over /dev/lo.

ghost pushed a commit that referenced this pull request Mar 5, 2016
lkl: automatic syscall threads creation
@ghost ghost merged commit 520c1f2 into lkl:master Mar 5, 2016
thehajime pushed a commit to libos-nuse/lkl-linux that referenced this pull request Mar 19, 2018
In pppoe_sendmsg(), reserving dev->hard_header_len bytes of headroom
was probably fine before the introduction of ->needed_headroom in
commit f5184d2 ("net: Allow netdevices to specify needed head/tailroom").

But now, virtual devices typically advertise the size of their overhead
in dev->needed_headroom, so we must also take it into account in
skb_reserve().
Allocation size of skb is also updated to take dev->needed_tailroom
into account and replace the arbitrary 32 bytes with the real size of
a PPPoE header.

This issue was discovered by syzbot, who connected a pppoe socket to a
gre device which had dev->header_ops->create == ipgre_header and
dev->hard_header_len == 0. Therefore, PPPoE didn't reserve any
headroom, and dev_hard_header() crashed when ipgre_header() tried to
prepend its header to skb->data.

skbuff: skb_under_panic: text:000000001d390b3a len:31 put:24
head:00000000d8ed776f data:000000008150e823 tail:0x7 end:0xc0 dev:gre0
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:104!
invalid opcode: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
    (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 3670 Comm: syzkaller801466 Not tainted
4.15.0-rc7-next-20180115+ lkl#97
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:skb_panic+0x162/0x1f0 net/core/skbuff.c:100
RSP: 0018:ffff8801d9bd7840 EFLAGS: 00010282
RAX: 0000000000000083 RBX: ffff8801d4f083c0 RCX: 0000000000000000
RDX: 0000000000000083 RSI: 1ffff1003b37ae92 RDI: ffffed003b37aefc
RBP: ffff8801d9bd78a8 R08: 1ffff1003b37ae8a R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: ffffffff86200de0
R13: ffffffff84a981ad R14: 0000000000000018 R15: ffff8801d2d34180
FS:  00000000019c4880(0000) GS:ffff8801db300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000208bc000 CR3: 00000001d9111001 CR4: 00000000001606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  skb_under_panic net/core/skbuff.c:114 [inline]
  skb_push+0xce/0xf0 net/core/skbuff.c:1714
  ipgre_header+0x6d/0x4e0 net/ipv4/ip_gre.c:879
  dev_hard_header include/linux/netdevice.h:2723 [inline]
  pppoe_sendmsg+0x58e/0x8b0 drivers/net/ppp/pppoe.c:890
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg+0xca/0x110 net/socket.c:640
  sock_write_iter+0x31a/0x5d0 net/socket.c:909
  call_write_iter include/linux/fs.h:1775 [inline]
  do_iter_readv_writev+0x525/0x7f0 fs/read_write.c:653
  do_iter_write+0x154/0x540 fs/read_write.c:932
  vfs_writev+0x18a/0x340 fs/read_write.c:977
  do_writev+0xfc/0x2a0 fs/read_write.c:1012
  SYSC_writev fs/read_write.c:1085 [inline]
  SyS_writev+0x27/0x30 fs/read_write.c:1082
  entry_SYSCALL_64_fastpath+0x29/0xa0

Admittedly PPPoE shouldn't be allowed to run on non Ethernet-like
interfaces, but reserving space for ->needed_headroom is a more
fundamental issue that needs to be addressed first.

Same problem exists for __pppoe_xmit(), which also needs to take
dev->needed_headroom into account in skb_cow_head().

Fixes: f5184d2 ("net: Allow netdevices to specify needed head/tailroom")
Reported-by: syzbot+ed0838d0fa4c4f2b528e20286e6dc63effc7c14d@syzkaller.appspotmail.com
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
thehajime pushed a commit to thehajime/linux that referenced this pull request Aug 4, 2022
The kernel may be built with multiple LSMs, but only a subset may be
enabled on the boot command line by specifying "lsm=".  Not including
"integrity" on the ordered LSM list may result in a NULL deref.

As reported by Dmitry Vyukov:
in qemu:
qemu-system-x86_64       -enable-kvm     -machine q35,nvdimm -cpu
max,migratable=off -smp 4       -m 4G,slots=4,maxmem=16G        -hda
wheezy.img      -kernel arch/x86/boot/bzImage   -nographic -vga std
 -soundhw all     -usb -usbdevice tablet  -bt hci -bt device:keyboard
   -net user,host=10.0.2.10,hostfwd=tcp::10022-:22 -net
nic,model=virtio-net-pci   -object
memory-backend-file,id=pmem1,share=off,mem-path=/dev/zero,size=64M
  -device nvdimm,id=nvdimm1,memdev=pmem1  -append "console=ttyS0
root=/dev/sda earlyprintk=serial rodata=n oops=panic panic_on_warn=1
panic=86400 lsm=smack numa=fake=2 nopcid dummy_hcd.num=8"   -pidfile
vm_pid -m 2G -cpu host

But it crashes on NULL deref in integrity_inode_get during boot:

Run /sbin/init as init process
BUG: kernel NULL pointer dereference, address: 000000000000001c
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP KASAN
CPU: 3 PID: 1 Comm: swapper/0 Not tainted 5.12.0-rc2+ lkl#97
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.13.0-44-g88ab0c15525c-prebuilt.qemu.org 04/01/2014
RIP: 0010:kmem_cache_alloc+0x2b/0x370 mm/slub.c:2920
Code: 57 41 56 41 55 41 54 41 89 f4 55 48 89 fd 53 48 83 ec 10 44 8b
3d d9 1f 90 0b 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 31 c0 <8b> 5f
1c 4cf
RSP: 0000:ffffc9000032f9d8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff888017fc4f00 RCX: 0000000000000000
RDX: ffff888040220000 RSI: 0000000000000c40 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: ffff888019263627
R10: ffffffff83937cd1 R11: 0000000000000000 R12: 0000000000000c40
R13: ffff888019263538 R14: 0000000000000000 R15: 0000000000ffffff
FS:  0000000000000000(0000) GS:ffff88802d180000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000001c CR3: 000000000b48e000 CR4: 0000000000750ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 integrity_inode_get+0x47/0x260 security/integrity/iint.c:105
 process_measurement+0x33d/0x17e0 security/integrity/ima/ima_main.c:237
 ima_bprm_check+0xde/0x210 security/integrity/ima/ima_main.c:474
 security_bprm_check+0x7d/0xa0 security/security.c:845
 search_binary_handler fs/exec.c:1708 [inline]
 exec_binprm fs/exec.c:1761 [inline]
 bprm_execve fs/exec.c:1830 [inline]
 bprm_execve+0x764/0x19a0 fs/exec.c:1792
 kernel_execve+0x370/0x460 fs/exec.c:1973
 try_to_run_init_process+0x14/0x4e init/main.c:1366
 kernel_init+0x11d/0x1b8 init/main.c:1477
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
Modules linked in:
CR2: 000000000000001c
---[ end trace 22d601a500de7d79 ]---

Since LSMs and IMA may be configured at build time, but not enabled at
run time, panic the system if "integrity" was not initialized before use.

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 79f7865 ("LSM: Introduce "lsm=" for boottime LSM selection")
Cc: stable@vger.kernel.org
Signed-off-by: Mimi Zohar <zohar@linux.ibm.com>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants