-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
scheduler: add support for CPU affinity (thread pinning) #2024
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
francescolavra
force-pushed
the
feature/cpu-affinity
branch
4 times, most recently
from
May 29, 2024 16:06
2161052
to
5779988
Compare
I added 2 commits that fix SMP-related issues found in the poll/select implementation and the robust futex handling logic. |
eyberg
approved these changes
Jun 13, 2024
A given klib may make modifications to the implementation of Unix syscalls, and for this it needs the Unix subsystem to be initialized (an example of this is the sandbox klib). This change moves the initialization of the Unix subsystem from startup() to create_init() in stage3.c (which is called immediately after the root filesystem is loaded). This fixes an "assertion m[n].handler == 0 failed at /usr/src/nanos/src/unix/syscall.c:2828 (IP 0xffffffffa9ca5100) in _register_syscall()" which can happen in SMP instances if the sandbox klib is loaded before the startup() function in stage3.c is called.
These functions allow retrieving and removing from a priority queue elements that are at a given position in the queue. They will be used by the scheduler when adding support for CPU affinity.
If the CPU set supplied to sched_setaffinity() does not include any online CPU in the machine, this syscall should return -EINVAL. In addition, the copy from user memory is being fixed so that it does not overrun the user buffer when cpusetsize is not a multiple of 8 bytes. The affinity bitmap is being moved from struct thread to struct sched_task, in preparation for changing the scheduler so that it honors CPU affinity.
This change enhances the scheduler so that it honors the CPU affinity of user threads (set via the sched_setaffinity syscall). The migrate_to_self() function is being amended so that it does not fail to wake up an idle CPU if the current CPU could not migrate a thread from the idle CPU (because threads may be non-migratable due to their CPU affinity). Since a task dequeued from a scheduling queue may not be the highest priority (lowest runtime) task in that queue, its runtime field may be non-zero; the thread_cputime_update() function is being updated to reflect that.
Notify set interface functions such as notify_dispatch() and notify_remove() (which internally acquire the notify set lock) should never be called while holding the same lock that is acquired in the notify handler (which is invoked with the notify set lock held), because this can lead to a lock inversion and ensuing deadlock if the notify handler is invoked on a given vCPU at the same time as another vCPU calls a notify set function. This change fixes the Unix domain socket connect code to avoid the above issue. Note: in order to prevent notify set functions from being called with a notify entry that has been deallocated (by another vCPU), notify entries should be refcounted.
In the RISC-V architecture, secondary cores are started in parallel (i.e. the next core is started without waiting for the previous core startup process to complete), and the boot core enters the runloop without waiting for secondary cores to start. This means that the scheduler can run before all cores are online. Since secondary cores can start in an arbitrary order, the cpuinfos array is also populated in an arbitrary order; The scheduler assumes that all vCPUs with id from 0 to (`total_processors` - 1) are online, therefore if each vCPU increments `total_processors` by 1 when starting up it may happen that the scheduler tries to access a null cpuinfo. This issue is fixed by amending `ap_start_newstack()` so that it updates `total_processors` based on which vCPUs have completed the startup process, so that the above assumption in the scheduler is always valid. In addition, if the cpuinfos vector is initially allocated with length=1 and then extended as secondory cores are brought online, the scheduler may access invalid vector elements, because the backing buffer of the vector may change at any time due to a vector extension. This issue is fixed by amending `init_kernel_contexts()` so that it allocates the vector with a capacity set to `present_processors`: this ensures that the backing buffer never changes. In `kernel_runtime_init()`, `count_cpus_present()` is now called before `init_kernel_contexts()`, so that the number of present vCPUs is known when allocating the cpuinfos vector. In the x86 architecture, the ACPI tables are now initialized in `count_cpus_present()` instead of `init_interrupts()`, because they are needed in order to be able to enumerate the vCPUs.
When poll() is invoked by the user program, the kernel allocates a new epoll_blocked struct and associates it to the thread-specific epoll instance for poll/select; then, the set of supplied file descriptors is processed, and for each file descriptor the relevant event conditions are evaluated; finally, the thread is blocked if no events of interest are pending. This logic presents a race condition when one or more of the supplied file descriptors is already registered in the epoll instance: from the moment the epoll_blocked struct is associated to the epoll instance, the notify entry handler (which can be invoked at any time for all currently registered file descriptors) can modify the user buffers referenced by the epoll_blocked struct, which may still be in the process of being initialized by the thread calling poll(); this can lead to inconsistencies between the contents of these buffers and the return value stored in the epoll_blocked struct. A possible symptom of the above issue is the assertion failure reported in #1963; another possible symptom is a polling thread that remains stuck indefinitely even though the events being polled for have been notified (this has been observed when running the inotify runtime test on a 2-vCPU instance). A similar race condition affect the select() implementation. Ths change fixes the above issues by moving the allocation of the epoll_blocked struct after the user buffers passed to the poll() or select() syscall have been processed; the check for pre-existing event conditions is now done right after allocating the epoll_blocked struct, via a newly defined function epoll_check_epollfds() which consolidates code shared between poll(), select() and epoll_wait(). In addition, the lock that protects the epollfd list in an epoll instance in not taken anymore when poll() or select() is called, because these syscalls operate on a thread-local epoll instance, whose epollfd list is never touched concurrently by other threads.
When a thread terminates while holding a mutex, the kernel signals to any mutex waiters that the mutex holder has terminated, by processing the robust list of the terminated thread and executing a FUTEX_WAKE operation for each futex in the list. When the mutex waiter thread resumes, the NPTL code in glibc frees the relevant userspace mutex data structure, which means that the next pointer in the robust list may not be valid anymore; if the CPU that is handling the thread termination accesses the next pointer after another CPU has started executing the woken up thread, it may retrieve a bogus pointer value, which can lead to a crash in userspace; this issue has been observed when running the futexrobust runtime test on RISC-V instances with multiple vCPUs. This change fixes the above issue by retrieving the pointer to the next entry in the robust list before waking up any waiter on the current entry. In addition, handling of the `list_op_pending` pointer is being fixed so that the mutex in the pending state, if present, is not processed while walking the robust list; this fix prevents the mutex from being processed twice.
This change amends the CircleCI configuration so that automated tests are run on 2-vCPU instances, in order to increase test coverage of SMP-related kernel code. The platform-specific Makefiles now support specifying a vCPU count (via the `VCPUS` variable) in the command line when running a test program; example command line to run the thread_test program on a 2-vCPU instance: `make VCPUS=2 TARGET=thread_test run`
francescolavra
force-pushed
the
feature/cpu-affinity
branch
from
June 13, 2024 15:31
5779988
to
ca4a3cb
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This change set enhances the scheduler so that it honors the CPU affinity of user threads (set via the sched_setaffinity syscall).
In addition, it amends the CircleCI configuration so that automated tests are run on 2-vCPU instances, in order to increase test
coverage of SMP-related kernel code. This increased test coverage revealed a few issues in the RISC-V-specific code and in the Unix code (Unix domain sockets, poll/select, and robust futexes), which have been fixed as part of this change set.