Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kernel: k_queue_get return NULL before timeout #25904

Closed
joerchan opened this issue Jun 2, 2020 · 10 comments · Fixed by #25906
Closed

kernel: k_queue_get return NULL before timeout #25904

joerchan opened this issue Jun 2, 2020 · 10 comments · Fixed by #25906
Assignees
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug

Comments

@joerchan
Copy link
Contributor

joerchan commented Jun 2, 2020

Describe the bug
The API call k_queue_get return no data element before timeout has expired.
Calling either k_queue_get(queue, K_FOREVER); or k_queue_get(queue, K_SECONDS(20)) return NULL within one second.

This appears to occur because two threads are both waiting for an element in the queue.
Once an element is posted to the queue, both threads are woken, one will retrieve the new element, while the other will discover an empty queue.

This is a regression from: 7832738
Specifically this change:

-static void *k_queue_poll(struct k_queue *queue, s32_t timeout)
+static void *k_queue_poll(struct k_queue *queue, k_timeout_t timeout)
 {
        struct k_poll_event event;
-       int err, elapsed = 0, done = 0;
+       int err;
        k_spinlock_key_t key;
        void *val;
-       u32_t start;
 
        k_poll_event_init(&event, K_POLL_TYPE_FIFO_DATA_AVAILABLE,
                          K_POLL_MODE_NOTIFY_ONLY, queue);
 
-       if (timeout != K_FOREVER) {
-               start = k_uptime_get_32();
-       }
-
-       do {
-               event.state = K_POLL_STATE_NOT_READY;
-
-               err = k_poll(&event, 1, timeout - elapsed);
-
-               if (err && err != -EAGAIN) {
-                       return NULL;
-               }
+       event.state = K_POLL_STATE_NOT_READY;
+       err = k_poll(&event, 1, timeout);
 
-               key = k_spin_lock(&queue->lock);
-               val = z_queue_node_peek(sys_sflist_get(&queue->data_q), true);
-               k_spin_unlock(&queue->lock, key);
+       if (err && err != -EAGAIN) {
+               return NULL;
+       }
 
-               if ((val == NULL) && (timeout != K_FOREVER)) {
-                       elapsed = k_uptime_get_32() - start;
-                       done = elapsed > timeout;
-               }
-       } while (!val && !done);
+       key = k_spin_lock(&queue->lock);
+       val = z_queue_node_peek(sys_sflist_get(&queue->data_q), true);
+       k_spin_unlock(&queue->lock, key);
 
        return val;

Reverting this change (using legacy timeout API) fixes the issue.
The commit message says this:

    Also in queue.c, a (when POLL was
    enabled) a similar loop was needlessly used to try to retry the
    k_poll() call after a spurious failure.  But k_poll() does not fail
    spuriously, so the loop was removed.

The loop appears to have been removed for the wrong reason.

To Reproduce
The current steps involves 2 nRF52 dev-kits and a few manual steps.
I can try to make a more minimal failing test if needed, otherwise I can verify using my current setup.

Use branch: https://github.com/joerchan/zephyr/tree/bt-recv-deadlock-debug

export BOARD=nrf52840dk
west build test/bluetooth/shell 
west flash --erase 
west build samples/bluetooth/peripheral
west flash --erase
# Attach one terminal to each

# shell terminal:
bt init
bt connect <id addr>
bt gatt exchange-mtu

Console output
peripheral sample:

Advertising successfully started
Connected: 4e:1a:6f:9b:b8:00 (random)
Write: count= 122, len= 2440, rate= 19523 bps.
Write: count= 1266, len= 25320, rate= 202578 bps.
Write: count= 1815, len= 36300, rate= 290408 bps.
Write: count= 1768, len= 35360, rate= 282880 bps.
Write cmd failed (-55).
Write: count= 1222, len= 37964, rate= 303758 bps.
[00:00:13.090,026] <err> bt_conn: k_is_in_isr 0
[00:00:13.090,057] <err> bt_conn: conn_tx_alloc timeout 0 secs.

[00:00:13.090,087] <err> bt_conn: Unable to allocate TX context
Write: count= 603, len= 37386, rate= 299444 bps.
LE conn param updated: int 0x0028 lat 0 to 42
Write: count= 529, len= 32798, rate= 262680 bps.
Write: count= 594, len= 36828, rate= 294957 bps.
Write: count= 654, len= 40548, rate= 324750 bps.
Write: count= 671, len= 41602, rate= 333049 bps.
Disconnected (reason 0x13)
Write cmd failed (-57).

Expected behavior
k_queue_get should not return NULL before timeout has passed.

Impact
This breaks the current flow-control behavior of Bluetooth, the attempted k_queue_get from BT RX thread will drop the attempt to answer the ATT request and will result in a disconnected ATT channel.

Additional context
CONFIG_POLL is enable
Blocker to fix: #23364

@joerchan joerchan added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Jun 2, 2020
@carlescufi
Copy link
Member

carlescufi commented Jun 2, 2020

As a reminder, the loop was introduced by @andyross to fix the exact same symptom we are seeing now:
b173e43

@nashif
Copy link
Member

nashif commented Jun 2, 2020

As a reminder, the loop was introduced by @andyross to fix the exact same symptom we are seeing now:
b173e43

we need some test coverage for this... "fool me once, ..."

@andyross
Copy link
Contributor

andyross commented Jun 2, 2020

Heh, oops. FWIW: the right fix, if I have time, is to remove that usage of poll within queue. There's no reason to sleeping using poll (as discovered it's racy and subtle) per se, you do it when you need to sleep on more than one thing. And this code doesn't.

@nashif
Copy link
Member

nashif commented Jun 2, 2020

@andyross apparently this blocks another High bug in the Bluetooth, so we need this fixed for 2.3, so we have time :)

@carlescufi
Copy link
Member

@andyross would it be possible to fix this first by re-adding the loop so we can continue the investigation of the other high-prio bug, and then defer the clean fix for 2.4?

@joerchan
Copy link
Contributor Author

joerchan commented Jun 2, 2020

Heh, oops. FWIW: the right fix, if I have time, is to remove that usage of poll within queue. There's no reason to sleeping using poll (as discovered it's racy and subtle) per se, you do it when you need to sleep on more than one thing. And this code doesn't.

Is this behaviour documented for users of k_poll API? Having event.state == K_POLL_STATE_FIFO_DATA_AVAILABLE and still having an empty queue, or should this be handled by k_poll?
I'm thinking about the other high-prio bug for blueooth where k_poll could be used as a fix.

@andyross
Copy link
Contributor

andyross commented Jun 2, 2020

It's an internal thing, so it's not documented. Basically CONFIG_POLL means two things inside of k_queue: first it enables the use of the "DATA_AVAILAILABLE" poll event so users can block using k_poll() on a mix of queues/fifos/lifos and other objects like semaphores. That's good.

The other thing it does is that it uses k_poll() as the blocking mechanism inside of k_queue_get(). That is, it will call k_poll() on a list of just one event. That's bad, because it's racy and subtle (i.e. I fixed it once then came by a year later and thought my own code was needless and removed it). This is the part I'd like to remove. I think I have it working actually, give me a few more minutes to spot check and I'll throw a patch up for review.

FWIW: that race is sorta fundamental to k_poll(). Unlike the lower level wait_q/pend() API, a call to k_poll() can't atomically release a lock. So you can't use it as a condition variable, the events need to be treated as a stream. The code here in poll was doing exactly that: check the spot where it releases a spinlock and then calls k_poll(). Another thread or ISR coming in at that moment to modify the queue state can cause a deadlock.

@joerchan
Copy link
Contributor Author

joerchan commented Jun 2, 2020

@andyross Ok, I think it would be better to document it 2.3 release and postpone the fix for 2.4 though. And then put the loop back for the 2.3 release.

@andyross
Copy link
Contributor

andyross commented Jun 2, 2020

I suspect the code removal is going to be easier to review than the timeout loop, honestly. But I can do both and folks can pick.

andyross pushed a commit to andyross/zephyr that referenced this issue Jun 2, 2020
The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes zephyrproject-rtos#25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
@joerchan
Copy link
Contributor Author

joerchan commented Jun 2, 2020

I suspect the code removal is going to be easier to review than the timeout loop, honestly. But I can do both and folks can pick.

Ok, looking at the PR that is up you might be right. I don't understand all parts of it though, So I'll leave it to the other reviewers.

carlescufi pushed a commit that referenced this issue Jun 2, 2020
The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes #25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
andyross pushed a commit to andyross/zephyr that referenced this issue Jun 5, 2020
[This is a cherry pick from upstream Zephyr, needed because k_poll
isn't "coherence-safe" current.]

The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes zephyrproject-rtos#25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
lgirdwood pushed a commit to thesofproject/zephyr that referenced this issue Jun 9, 2020
[This is a cherry pick from upstream Zephyr, needed because k_poll
isn't "coherence-safe" current.]

The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes zephyrproject-rtos#25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
e-rk pushed a commit to e-rk/zephyr that referenced this issue Jun 18, 2020
The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes zephyrproject-rtos#25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
(cherry picked from commit 99c2d2d)
Signed-off-by: Joakim Andersson <joakim.andersson@nordicsemi.no>
joerchan pushed a commit to joerchan/zephyr that referenced this issue Jun 18, 2020
The k_queue data structure, when CONFIG_POLL was enabled, would
inexplicably use k_poll() as its blocking mechanism instead of the
original wait_q/pend() code.  This was actually racy, see commit
b173e43.  The code was structured as a condition variable: using
a spinlock around the queue data before deciding to block.  But unlike
pend_current_thread(), k_poll() cannot atomically release a lock.

A workaround had been in place for this, and then accidentally
reverted (both by me!) because the code looked "wrong".

This is just fragile, there's no reason to have two implementations of
k_queue_get().  Remove.

Note that this also removes a test case in the work_queue test where
(when CONFIG_POLL was enabled, but not otherwise) it was checking for
the ability to immediately cancel a delayed work item that was
submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately").
This DOES NOT work with the origina/non-poll queue backend, and has
never been a documented behavior of k_delayed_work_submit_to_queue()
under any circumstances.  I don't know why we were testing this.

Fixes zephyrproject-rtos#25904

Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
(cherry picked from commit 99c2d2d)
Signed-off-by: Joakim Andersson <joakim.andersson@nordicsemi.no>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants