kernel: k_queue_get return NULL before timeout #25904

joerchan · 2020-06-02T14:05:55Z

Describe the bug
The API call k_queue_get return no data element before timeout has expired.
Calling either k_queue_get(queue, K_FOREVER); or k_queue_get(queue, K_SECONDS(20)) return NULL within one second.

This appears to occur because two threads are both waiting for an element in the queue.
Once an element is posted to the queue, both threads are woken, one will retrieve the new element, while the other will discover an empty queue.

This is a regression from: 7832738
Specifically this change:

-static void *k_queue_poll(struct k_queue *queue, s32_t timeout)
+static void *k_queue_poll(struct k_queue *queue, k_timeout_t timeout)
 {
        struct k_poll_event event;
-       int err, elapsed = 0, done = 0;
+       int err;
        k_spinlock_key_t key;
        void *val;
-       u32_t start;
 
        k_poll_event_init(&event, K_POLL_TYPE_FIFO_DATA_AVAILABLE,
                          K_POLL_MODE_NOTIFY_ONLY, queue);
 
-       if (timeout != K_FOREVER) {
-               start = k_uptime_get_32();
-       }
-
-       do {
-               event.state = K_POLL_STATE_NOT_READY;
-
-               err = k_poll(&event, 1, timeout - elapsed);
-
-               if (err && err != -EAGAIN) {
-                       return NULL;
-               }
+       event.state = K_POLL_STATE_NOT_READY;
+       err = k_poll(&event, 1, timeout);
 
-               key = k_spin_lock(&queue->lock);
-               val = z_queue_node_peek(sys_sflist_get(&queue->data_q), true);
-               k_spin_unlock(&queue->lock, key);
+       if (err && err != -EAGAIN) {
+               return NULL;
+       }
 
-               if ((val == NULL) && (timeout != K_FOREVER)) {
-                       elapsed = k_uptime_get_32() - start;
-                       done = elapsed > timeout;
-               }
-       } while (!val && !done);
+       key = k_spin_lock(&queue->lock);
+       val = z_queue_node_peek(sys_sflist_get(&queue->data_q), true);
+       k_spin_unlock(&queue->lock, key);
 
        return val;

Reverting this change (using legacy timeout API) fixes the issue.
The commit message says this:

    Also in queue.c, a (when POLL was
    enabled) a similar loop was needlessly used to try to retry the
    k_poll() call after a spurious failure.  But k_poll() does not fail
    spuriously, so the loop was removed.

The loop appears to have been removed for the wrong reason.

To Reproduce
The current steps involves 2 nRF52 dev-kits and a few manual steps.
I can try to make a more minimal failing test if needed, otherwise I can verify using my current setup.

Use branch: https://github.com/joerchan/zephyr/tree/bt-recv-deadlock-debug

export BOARD=nrf52840dk
west build test/bluetooth/shell 
west flash --erase 
west build samples/bluetooth/peripheral
west flash --erase
# Attach one terminal to each

# shell terminal:
bt init
bt connect <id addr>
bt gatt exchange-mtu

Console output
peripheral sample:

Advertising successfully started
Connected: 4e:1a:6f:9b:b8:00 (random)
Write: count= 122, len= 2440, rate= 19523 bps.
Write: count= 1266, len= 25320, rate= 202578 bps.
Write: count= 1815, len= 36300, rate= 290408 bps.
Write: count= 1768, len= 35360, rate= 282880 bps.
Write cmd failed (-55).
Write: count= 1222, len= 37964, rate= 303758 bps.
[00:00:13.090,026] <err> bt_conn: k_is_in_isr 0
[00:00:13.090,057] <err> bt_conn: conn_tx_alloc timeout 0 secs.

[00:00:13.090,087] <err> bt_conn: Unable to allocate TX context
Write: count= 603, len= 37386, rate= 299444 bps.
LE conn param updated: int 0x0028 lat 0 to 42
Write: count= 529, len= 32798, rate= 262680 bps.
Write: count= 594, len= 36828, rate= 294957 bps.
Write: count= 654, len= 40548, rate= 324750 bps.
Write: count= 671, len= 41602, rate= 333049 bps.
Disconnected (reason 0x13)
Write cmd failed (-57).

Expected behavior
k_queue_get should not return NULL before timeout has passed.

Impact
This breaks the current flow-control behavior of Bluetooth, the attempted k_queue_get from BT RX thread will drop the attempt to answer the ATT request and will result in a disconnected ATT channel.

Additional context
CONFIG_POLL is enable
Blocker to fix: #23364

The text was updated successfully, but these errors were encountered:

carlescufi · 2020-06-02T14:22:55Z

As a reminder, the loop was introduced by @andyross to fix the exact same symptom we are seeing now:
b173e43

nashif · 2020-06-02T14:32:57Z

As a reminder, the loop was introduced by @andyross to fix the exact same symptom we are seeing now:
b173e43

we need some test coverage for this... "fool me once, ..."

andyross · 2020-06-02T14:49:23Z

Heh, oops. FWIW: the right fix, if I have time, is to remove that usage of poll within queue. There's no reason to sleeping using poll (as discovered it's racy and subtle) per se, you do it when you need to sleep on more than one thing. And this code doesn't.

nashif · 2020-06-02T14:51:53Z

@andyross apparently this blocks another High bug in the Bluetooth, so we need this fixed for 2.3, so we have time :)

carlescufi · 2020-06-02T14:55:54Z

@andyross would it be possible to fix this first by re-adding the loop so we can continue the investigation of the other high-prio bug, and then defer the clean fix for 2.4?

joerchan · 2020-06-02T15:03:06Z

Heh, oops. FWIW: the right fix, if I have time, is to remove that usage of poll within queue. There's no reason to sleeping using poll (as discovered it's racy and subtle) per se, you do it when you need to sleep on more than one thing. And this code doesn't.

Is this behaviour documented for users of k_poll API? Having event.state == K_POLL_STATE_FIFO_DATA_AVAILABLE and still having an empty queue, or should this be handled by k_poll?
I'm thinking about the other high-prio bug for blueooth where k_poll could be used as a fix.

andyross · 2020-06-02T15:27:28Z

It's an internal thing, so it's not documented. Basically CONFIG_POLL means two things inside of k_queue: first it enables the use of the "DATA_AVAILAILABLE" poll event so users can block using k_poll() on a mix of queues/fifos/lifos and other objects like semaphores. That's good.

The other thing it does is that it uses k_poll() as the blocking mechanism inside of k_queue_get(). That is, it will call k_poll() on a list of just one event. That's bad, because it's racy and subtle (i.e. I fixed it once then came by a year later and thought my own code was needless and removed it). This is the part I'd like to remove. I think I have it working actually, give me a few more minutes to spot check and I'll throw a patch up for review.

FWIW: that race is sorta fundamental to k_poll(). Unlike the lower level wait_q/pend() API, a call to k_poll() can't atomically release a lock. So you can't use it as a condition variable, the events need to be treated as a stream. The code here in poll was doing exactly that: check the spot where it releases a spinlock and then calls k_poll(). Another thread or ISR coming in at that moment to modify the queue state can cause a deadlock.

joerchan · 2020-06-02T15:53:44Z

@andyross Ok, I think it would be better to document it 2.3 release and postpone the fix for 2.4 though. And then put the loop back for the 2.3 release.

andyross · 2020-06-02T16:26:23Z

I suspect the code removal is going to be easier to review than the timeout loop, honestly. But I can do both and folks can pick.

The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

joerchan · 2020-06-02T17:25:41Z

I suspect the code removal is going to be easier to review than the timeout loop, honestly. But I can do both and folks can pick.

Ok, looking at the PR that is up you might be right. I don't understand all parts of it though, So I'll leave it to the other reviewers.

The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes #25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

[This is a cherry pick from upstream Zephyr, needed because k_poll isn't "coherence-safe" current.] The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>

The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com> (cherry picked from commit 99c2d2d) Signed-off-by: Joakim Andersson <joakim.andersson@nordicsemi.no>

joerchan added bug The issue is a bug, or the PR is fixing a bug priority: high High impact/importance bug labels Jun 2, 2020

joerchan assigned andyross Jun 2, 2020

carlescufi added the area: Kernel label Jun 2, 2020

This was referenced Jun 2, 2020

kernel/queue: Remove interior use of k_poll() #25906

Merged

kernel/queue: Re-introduce race workaround #25908

Closed

carlescufi closed this as completed in #25906 Jun 2, 2020

AndreyDodonov-EH mentioned this issue Jun 25, 2020

Bluetooth: can't close bt_driver log output #20297

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: k_queue_get return NULL before timeout #25904

kernel: k_queue_get return NULL before timeout #25904

joerchan commented Jun 2, 2020

carlescufi commented Jun 2, 2020 •

edited

Loading

nashif commented Jun 2, 2020

andyross commented Jun 2, 2020

nashif commented Jun 2, 2020

carlescufi commented Jun 2, 2020

joerchan commented Jun 2, 2020 •

edited

Loading

andyross commented Jun 2, 2020

joerchan commented Jun 2, 2020

andyross commented Jun 2, 2020

joerchan commented Jun 2, 2020

kernel: k_queue_get return NULL before timeout #25904

kernel: k_queue_get return NULL before timeout #25904

Comments

joerchan commented Jun 2, 2020

carlescufi commented Jun 2, 2020 • edited Loading

nashif commented Jun 2, 2020

andyross commented Jun 2, 2020

nashif commented Jun 2, 2020

carlescufi commented Jun 2, 2020

joerchan commented Jun 2, 2020 • edited Loading

andyross commented Jun 2, 2020

joerchan commented Jun 2, 2020

andyross commented Jun 2, 2020

joerchan commented Jun 2, 2020

carlescufi commented Jun 2, 2020 •

edited

Loading

joerchan commented Jun 2, 2020 •

edited

Loading