Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

scheduler: add support for CPU affinity (thread pinning) #2024

Merged
merged 9 commits into from
Jun 13, 2024

Conversation

francescolavra
Copy link
Member

@francescolavra francescolavra commented May 17, 2024

This change set enhances the scheduler so that it honors the CPU affinity of user threads (set via the sched_setaffinity syscall).
In addition, it amends the CircleCI configuration so that automated tests are run on 2-vCPU instances, in order to increase test
coverage of SMP-related kernel code. This increased test coverage revealed a few issues in the RISC-V-specific code and in the Unix code (Unix domain sockets, poll/select, and robust futexes), which have been fixed as part of this change set.

@francescolavra francescolavra force-pushed the feature/cpu-affinity branch 4 times, most recently from 2161052 to 5779988 Compare May 29, 2024 16:06
@francescolavra
Copy link
Member Author

I added 2 commits that fix SMP-related issues found in the poll/select implementation and the robust futex handling logic.

A given klib may make modifications to the implementation of Unix
syscalls, and for this it needs the Unix subsystem to be
initialized (an example of this is the sandbox klib).
This change moves the initialization of the Unix subsystem from
startup() to create_init() in stage3.c (which is called immediately
after the root filesystem is loaded). This fixes an "assertion
m[n].handler == 0 failed at /usr/src/nanos/src/unix/syscall.c:2828
(IP 0xffffffffa9ca5100) in _register_syscall()" which can happen in
SMP instances if the sandbox klib is loaded before the startup()
function in stage3.c is called.
These functions allow retrieving and removing from a priority queue
elements that are at a given position in the queue. They will be
used by the scheduler when adding support for CPU affinity.
If the CPU set supplied to sched_setaffinity() does not include any
online CPU in the machine, this syscall should return -EINVAL.
In addition, the copy from user memory is being fixed so that it
does not overrun the user buffer when cpusetsize is not a multiple
of 8 bytes.
The affinity bitmap is being moved from struct thread to struct
sched_task, in preparation for changing the scheduler so that it
honors CPU affinity.
This change enhances the scheduler so that it honors the CPU
affinity of user threads (set via the sched_setaffinity syscall).

The migrate_to_self() function is being amended so that it does not
fail to wake up an idle CPU if the current CPU could not migrate a
thread from the idle CPU (because threads may be non-migratable due
to their CPU affinity).
Since a task dequeued from a scheduling queue may not be the
highest priority (lowest runtime) task in that queue, its runtime
field may be non-zero; the thread_cputime_update() function is
being updated to reflect that.
Notify set interface functions such as notify_dispatch() and
notify_remove() (which internally acquire the notify set lock)
should never be called while holding the same lock that is acquired
in the notify handler (which is invoked with the notify set lock
held), because this can lead to a lock inversion and ensuing
deadlock if the notify handler is invoked on a given vCPU at the
same time as another vCPU calls a notify set function.
This change fixes the Unix domain socket connect code to avoid the
above issue.
Note: in order to prevent notify set functions from being called
with a notify entry that has been deallocated (by another vCPU),
notify entries should be refcounted.
In the RISC-V architecture, secondary cores are started in parallel
(i.e. the next core is started without waiting for the previous
core startup process to complete), and the boot core enters the
runloop without waiting for secondary cores to start. This means
that the scheduler can run before all cores are online. Since
secondary cores can start in an arbitrary order, the cpuinfos array
is also populated in an arbitrary order; The scheduler assumes that
all vCPUs with id from 0 to (`total_processors` - 1) are online,
therefore if each vCPU increments `total_processors` by 1 when
starting up it may happen that the scheduler tries to access a null
cpuinfo. This issue is fixed by amending `ap_start_newstack()` so
that it updates `total_processors` based on which vCPUs have
completed the startup process, so that the above assumption in the
scheduler is always valid.
In addition, if the cpuinfos vector is initially allocated with
length=1 and then extended as secondory cores are brought online,
the scheduler may access invalid vector elements, because the
backing buffer of the vector may change at any time due to a vector
extension. This issue is fixed by amending `init_kernel_contexts()`
so that it allocates the vector with a capacity set to
`present_processors`: this ensures that the backing buffer never
changes.
In `kernel_runtime_init()`, `count_cpus_present()` is now called
before `init_kernel_contexts()`, so that the number of present
vCPUs is known when allocating the cpuinfos vector.
In the x86 architecture, the ACPI tables are now initialized in
`count_cpus_present()` instead of `init_interrupts()`, because they
are needed in order to be able to enumerate the vCPUs.
When poll() is invoked by the user program, the kernel allocates a
new epoll_blocked struct and associates it to the thread-specific
epoll instance for poll/select; then, the set of supplied file
descriptors is processed, and for each file descriptor the relevant
event conditions are evaluated; finally, the thread is blocked if
no events of interest are pending. This logic presents a race
condition when one or more of the supplied file descriptors is
already registered in the epoll instance: from the moment the
epoll_blocked struct is associated to the epoll instance, the
notify entry handler (which can be invoked at any time for all
currently registered file descriptors) can modify the user buffers
referenced by the epoll_blocked struct, which may still be in the
process of being initialized by the thread calling poll(); this can
lead to inconsistencies between the contents of these buffers and
the return value stored in the epoll_blocked struct.
A possible symptom of the above issue is the assertion failure
reported in #1963; another possible symptom is a polling thread
that remains stuck indefinitely even though the events being polled
for have been notified (this has been observed when running the
inotify runtime test on a 2-vCPU instance).

A similar race condition affect the select() implementation.

Ths change fixes the above issues by moving the allocation of the
epoll_blocked struct after the user buffers passed to the poll() or
select() syscall have been processed; the check for pre-existing
event conditions is now done right after allocating the
epoll_blocked struct, via a newly defined function
epoll_check_epollfds() which consolidates code shared between
poll(), select() and epoll_wait().

In addition, the lock that protects the epollfd list in an epoll
instance in not taken anymore when poll() or select() is called,
because these syscalls operate on a thread-local epoll instance,
whose epollfd list is never touched concurrently by other threads.
When a thread terminates while holding a mutex, the kernel signals
to any mutex waiters that the mutex holder has terminated, by
processing the robust list of the terminated thread and executing a
FUTEX_WAKE operation for each futex in the list. When the mutex
waiter thread resumes, the NPTL code in glibc frees the relevant
userspace mutex data structure, which means that the next pointer
in the robust list may not be valid anymore; if the CPU that is
handling the thread termination accesses the next pointer after
another CPU has started executing the woken up thread, it may
retrieve a bogus pointer value, which can lead to a crash in
userspace; this issue has been observed when running the
futexrobust runtime test on RISC-V instances with multiple vCPUs.

This change fixes the above issue by retrieving the pointer to the
next entry in the robust list before waking up any waiter on the
current entry.
In addition, handling of the `list_op_pending` pointer is being
fixed so that the mutex in the pending state, if present, is not
processed while walking the robust list; this fix prevents the
mutex from being processed twice.
This change amends the CircleCI configuration so that automated
tests are run on 2-vCPU instances, in order to increase test
coverage of SMP-related kernel code.
The platform-specific Makefiles now support specifying a vCPU
count (via the `VCPUS` variable) in the command line when running a
test program; example command line to run the thread_test program
on a 2-vCPU instance: `make VCPUS=2 TARGET=thread_test run`
@francescolavra francescolavra merged commit ca4a3cb into master Jun 13, 2024
5 checks passed
@francescolavra francescolavra deleted the feature/cpu-affinity branch June 13, 2024 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants