scheduler: add support for CPU affinity (thread pinning) #2024

francescolavra · 2024-05-17T15:56:31Z

This change set enhances the scheduler so that it honors the CPU affinity of user threads (set via the sched_setaffinity syscall).
In addition, it amends the CircleCI configuration so that automated tests are run on 2-vCPU instances, in order to increase test
coverage of SMP-related kernel code. This increased test coverage revealed a few issues in the RISC-V-specific code and in the Unix code (Unix domain sockets, poll/select, and robust futexes), which have been fixed as part of this change set.

francescolavra · 2024-05-29T16:32:31Z

I added 2 commits that fix SMP-related issues found in the poll/select implementation and the robust futex handling logic.

A given klib may make modifications to the implementation of Unix syscalls, and for this it needs the Unix subsystem to be initialized (an example of this is the sandbox klib). This change moves the initialization of the Unix subsystem from startup() to create_init() in stage3.c (which is called immediately after the root filesystem is loaded). This fixes an "assertion m[n].handler == 0 failed at /usr/src/nanos/src/unix/syscall.c:2828 (IP 0xffffffffa9ca5100) in _register_syscall()" which can happen in SMP instances if the sandbox klib is loaded before the startup() function in stage3.c is called.

These functions allow retrieving and removing from a priority queue elements that are at a given position in the queue. They will be used by the scheduler when adding support for CPU affinity.

If the CPU set supplied to sched_setaffinity() does not include any online CPU in the machine, this syscall should return -EINVAL. In addition, the copy from user memory is being fixed so that it does not overrun the user buffer when cpusetsize is not a multiple of 8 bytes. The affinity bitmap is being moved from struct thread to struct sched_task, in preparation for changing the scheduler so that it honors CPU affinity.

This change enhances the scheduler so that it honors the CPU affinity of user threads (set via the sched_setaffinity syscall). The migrate_to_self() function is being amended so that it does not fail to wake up an idle CPU if the current CPU could not migrate a thread from the idle CPU (because threads may be non-migratable due to their CPU affinity). Since a task dequeued from a scheduling queue may not be the highest priority (lowest runtime) task in that queue, its runtime field may be non-zero; the thread_cputime_update() function is being updated to reflect that.

Notify set interface functions such as notify_dispatch() and notify_remove() (which internally acquire the notify set lock) should never be called while holding the same lock that is acquired in the notify handler (which is invoked with the notify set lock held), because this can lead to a lock inversion and ensuing deadlock if the notify handler is invoked on a given vCPU at the same time as another vCPU calls a notify set function. This change fixes the Unix domain socket connect code to avoid the above issue. Note: in order to prevent notify set functions from being called with a notify entry that has been deallocated (by another vCPU), notify entries should be refcounted.

In the RISC-V architecture, secondary cores are started in parallel (i.e. the next core is started without waiting for the previous core startup process to complete), and the boot core enters the runloop without waiting for secondary cores to start. This means that the scheduler can run before all cores are online. Since secondary cores can start in an arbitrary order, the cpuinfos array is also populated in an arbitrary order; The scheduler assumes that all vCPUs with id from 0 to (`total_processors` - 1) are online, therefore if each vCPU increments `total_processors` by 1 when starting up it may happen that the scheduler tries to access a null cpuinfo. This issue is fixed by amending `ap_start_newstack()` so that it updates `total_processors` based on which vCPUs have completed the startup process, so that the above assumption in the scheduler is always valid. In addition, if the cpuinfos vector is initially allocated with length=1 and then extended as secondory cores are brought online, the scheduler may access invalid vector elements, because the backing buffer of the vector may change at any time due to a vector extension. This issue is fixed by amending `init_kernel_contexts()` so that it allocates the vector with a capacity set to `present_processors`: this ensures that the backing buffer never changes. In `kernel_runtime_init()`, `count_cpus_present()` is now called before `init_kernel_contexts()`, so that the number of present vCPUs is known when allocating the cpuinfos vector. In the x86 architecture, the ACPI tables are now initialized in `count_cpus_present()` instead of `init_interrupts()`, because they are needed in order to be able to enumerate the vCPUs.

When poll() is invoked by the user program, the kernel allocates a new epoll_blocked struct and associates it to the thread-specific epoll instance for poll/select; then, the set of supplied file descriptors is processed, and for each file descriptor the relevant event conditions are evaluated; finally, the thread is blocked if no events of interest are pending. This logic presents a race condition when one or more of the supplied file descriptors is already registered in the epoll instance: from the moment the epoll_blocked struct is associated to the epoll instance, the notify entry handler (which can be invoked at any time for all currently registered file descriptors) can modify the user buffers referenced by the epoll_blocked struct, which may still be in the process of being initialized by the thread calling poll(); this can lead to inconsistencies between the contents of these buffers and the return value stored in the epoll_blocked struct. A possible symptom of the above issue is the assertion failure reported in #1963; another possible symptom is a polling thread that remains stuck indefinitely even though the events being polled for have been notified (this has been observed when running the inotify runtime test on a 2-vCPU instance). A similar race condition affect the select() implementation. Ths change fixes the above issues by moving the allocation of the epoll_blocked struct after the user buffers passed to the poll() or select() syscall have been processed; the check for pre-existing event conditions is now done right after allocating the epoll_blocked struct, via a newly defined function epoll_check_epollfds() which consolidates code shared between poll(), select() and epoll_wait(). In addition, the lock that protects the epollfd list in an epoll instance in not taken anymore when poll() or select() is called, because these syscalls operate on a thread-local epoll instance, whose epollfd list is never touched concurrently by other threads.

When a thread terminates while holding a mutex, the kernel signals to any mutex waiters that the mutex holder has terminated, by processing the robust list of the terminated thread and executing a FUTEX_WAKE operation for each futex in the list. When the mutex waiter thread resumes, the NPTL code in glibc frees the relevant userspace mutex data structure, which means that the next pointer in the robust list may not be valid anymore; if the CPU that is handling the thread termination accesses the next pointer after another CPU has started executing the woken up thread, it may retrieve a bogus pointer value, which can lead to a crash in userspace; this issue has been observed when running the futexrobust runtime test on RISC-V instances with multiple vCPUs. This change fixes the above issue by retrieving the pointer to the next entry in the robust list before waking up any waiter on the current entry. In addition, handling of the `list_op_pending` pointer is being fixed so that the mutex in the pending state, if present, is not processed while walking the robust list; this fix prevents the mutex from being processed twice.

This change amends the CircleCI configuration so that automated tests are run on 2-vCPU instances, in order to increase test coverage of SMP-related kernel code. The platform-specific Makefiles now support specifying a vCPU count (via the `VCPUS` variable) in the command line when running a test program; example command line to run the thread_test program on a 2-vCPU instance: `make VCPUS=2 TARGET=thread_test run`

francescolavra force-pushed the feature/cpu-affinity branch 4 times, most recently from 2161052 to 5779988 Compare May 29, 2024 16:06

rinor mentioned this pull request Jun 1, 2024

issue(smp): tokio-rs/sled deadlock (probable) on cpu count > 1 #1930

Closed

rinor mentioned this pull request Jun 10, 2024

question(gcp/network/dhcp): sporadic failures to obtain internal IPv4 from dhcp on GCP #2029

Open

eyberg approved these changes Jun 13, 2024

View reviewed changes

francescolavra added 9 commits June 13, 2024 17:28

Priority queue: add pqueue_peek_at() and pqueue_remove_at()

6262f9d

These functions allow retrieving and removing from a priority queue elements that are at a given position in the queue. They will be used by the scheduler when adding support for CPU affinity.

francescolavra force-pushed the feature/cpu-affinity branch from 5779988 to ca4a3cb Compare June 13, 2024 15:31

francescolavra merged commit ca4a3cb into master Jun 13, 2024
5 checks passed

francescolavra deleted the feature/cpu-affinity branch June 13, 2024 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scheduler: add support for CPU affinity (thread pinning) #2024

scheduler: add support for CPU affinity (thread pinning) #2024

francescolavra commented May 17, 2024 •

edited

Loading

francescolavra commented May 29, 2024

scheduler: add support for CPU affinity (thread pinning) #2024

scheduler: add support for CPU affinity (thread pinning) #2024

Conversation

francescolavra commented May 17, 2024 • edited Loading

francescolavra commented May 29, 2024

francescolavra commented May 17, 2024 •

edited

Loading