Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/sched.c: fix _runqueue_pop() removing wrong thread #20938

Merged
merged 1 commit into from
Nov 14, 2024

Conversation

derMihai
Copy link
Contributor

@derMihai derMihai commented Oct 23, 2024

Contribution description

_runqueue_pop() calls clist_lpop(), which pops the leftmost entry in the thread's priority runqueue, but not necessarily the one passed as argument.

I understand the original logic behind this: if a thread is to be removed, then it is so because it is about to block on something or exit. And a thread can only block or exit if it's currently running. As a result, it must be the leftmost entry on the runqueue. But this isn't always true. Consider sched_change_priority(), which does the following:

    if (thread_is_active(thread)) { 
        _runqueue_pop(thread);
        _runqueue_push(thread, priority);
    }

thread_is_active() returns true if a thread is either running or queued to be running. So the _runqueue_pop() might retrieve any thread. Let's extend _runqueue_pop() with the following assertion:

    ...
    clist_node_t *removed = clist_lpop(&sched_runqueues[thread->priority]);
    assume(container_of(removed, thread_t, rq_entry) == thread);
    ...

Running this will trigger the assertion:

    int thread_ids[WAITERS_CNT];
    for (unsigned i = 0; i < WAITERS_CNT; i++) {
       // set a lower priority than the current thread's to make sure they are all in the runqueue when
       // changing priorities below
        thread_ids[i] = thread_create(stacks[i], sizeof(stacks[0]), THREAD_PRIORITY_MAIN + 1,
                      THREAD_CREATE_STACKTEST, waiter, NULL, "waiter");
    }

    thread_t *thread2 = thread_get(thread_ids[1]);

    irq_disable(); // sched_change_priority() wants interrupts disabled
    sched_change_priority(thread2, THREAD_PRIORITY_MAIN);
    irq_enable();

If we change _runqueue_pop() to search for the specific thread:

    ...
    clist_node_t *removed = clist_remove(&sched_runqueues[thread->priority], &thread->rq_entry);
    assume(container_of(removed, thread_t, rq_entry) == thread);
    ....

Then everything is all right.

The only ugly thing is that, when the thread is indeed running (vast majority of cases), due to how clists are implemented the whole list is walked to remove the thread.

Testing procedure

I briefly tested this code on a SAM0 board.

@derMihai derMihai requested a review from kaspar030 as a code owner October 23, 2024 13:38
@github-actions github-actions bot added the Area: core Area: RIOT kernel. Handle PRs marked with this with care! label Oct 23, 2024
@benpicco benpicco added Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels Oct 23, 2024
@benpicco benpicco requested a review from maribu October 23, 2024 14:09
@riot-ci
Copy link

riot-ci commented Oct 23, 2024

Murdock results

✔️ PASSED

ba71ba7 core/sched.c: fix _runqueue_pop() removing wrong thread

Success Failures Total Runtime
10251 0 10251 19m:18s

Artifacts

@benpicco
Copy link
Contributor

The only ugly thing is that, when the thread is indeed running (vast majority of cases), due to how clists are implemented the whole list is walked to remove the thread.

clist_remove() already contains an optimisation there:

static inline clist_node_t *clist_remove(clist_node_t *list, clist_node_t *node)
{
    if (list->next) {
        if (list->next->next == node) {
            return clist_lpop(list);
        }
        else {

So in the common case, this just adds two if conditions.

@derMihai
Copy link
Contributor Author

So in the common case, this just adds two if conditions

If the common case is that there are exactly one or two threads, and we want to remove the first. If there are more, it will start looping from the second. But probably good enough.

@benpicco benpicco added this pull request to the merge queue Oct 23, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 23, 2024
@benpicco
Copy link
Contributor

looks like Makefile.ci needs an update - can you check by how much this increases code size?

It's probably just a few bytes and tests_xbee was already at the very limit for nucleo-l031k6

@derMihai
Copy link
Contributor Author

derMihai commented Nov 7, 2024

looks like Makefile.ci needs an update - can you check by how much this increases code size?

It's probably just a few bytes and tests_xbee was already at the very limit for nucleo-l031k6

seems to be working again.

@benpicco benpicco added this pull request to the merge queue Nov 7, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 7, 2024
@benpicco
Copy link
Contributor

No, CI still failed

@derMihai
Copy link
Contributor Author

make BOARD=nucleo-l031k6 for tests/drivers/xbee fails on master too:

/usr/lib/gcc/arm-none-eabi/10.3.1/../../../arm-none-eabi/bin/ld: region `rom' overflowed by 3596 bytes

on my branch:

/usr/lib/gcc/arm-none-eabi/10.3.1/../../../arm-none-eabi/bin/ld: region `rom' overflowed by 3680 bytes

So a code-size increase by 84 bytes. I suppose this happens because clist_remove() is very rarely used and most probably not in this application (grep returns a few results, but hard to tell for sure as neither master nor my branch produce a binary), but now it's in.

Also, why is All checks have passed active if some build failed?

@maribu
Copy link
Member

maribu commented Nov 14, 2024

With the toolchain used in the CI the binary get's smaller. The increase by 84 B was enough to tip it over the limit, as it is now 8 B larger than flash is available.

That is fine. We want things to be correct first and small/efficient second. So just add the board to the list in the Makefile.ci to record it is no longer expected to be linkable with the CI toolchain, and we can merge this. (Note: You can use make generate-Makefile.ci to build it.)

@derMihai derMihai force-pushed the mir/fix_runqueue_pop branch from 13d9b9c to ba71ba7 Compare November 14, 2024 10:05
@github-actions github-actions bot added the Area: tests Area: tests and testing framework label Nov 14, 2024
@benpicco benpicco enabled auto-merge November 14, 2024 10:49
@benpicco benpicco added this pull request to the merge queue Nov 14, 2024
Merged via the queue into RIOT-OS:master with commit 77c4a24 Nov 14, 2024
25 checks passed
@MrKevinWeiss MrKevinWeiss added this to the Release 2025.01 milestone Jan 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: core Area: RIOT kernel. Handle PRs marked with this with care! Area: tests Area: tests and testing framework CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants