-
Notifications
You must be signed in to change notification settings - Fork 6.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kernel: k_queue_get return NULL before timeout #25904
Comments
Heh, oops. FWIW: the right fix, if I have time, is to remove that usage of poll within queue. There's no reason to sleeping using poll (as discovered it's racy and subtle) per se, you do it when you need to sleep on more than one thing. And this code doesn't. |
@andyross apparently this blocks another High bug in the Bluetooth, so we need this fixed for 2.3, so we have time :) |
@andyross would it be possible to fix this first by re-adding the loop so we can continue the investigation of the other high-prio bug, and then defer the clean fix for 2.4? |
Is this behaviour documented for users of k_poll API? Having event.state == K_POLL_STATE_FIFO_DATA_AVAILABLE and still having an empty queue, or should this be handled by k_poll? |
It's an internal thing, so it's not documented. Basically CONFIG_POLL means two things inside of k_queue: first it enables the use of the "DATA_AVAILAILABLE" poll event so users can block using k_poll() on a mix of queues/fifos/lifos and other objects like semaphores. That's good. The other thing it does is that it uses k_poll() as the blocking mechanism inside of k_queue_get(). That is, it will call k_poll() on a list of just one event. That's bad, because it's racy and subtle (i.e. I fixed it once then came by a year later and thought my own code was needless and removed it). This is the part I'd like to remove. I think I have it working actually, give me a few more minutes to spot check and I'll throw a patch up for review. FWIW: that race is sorta fundamental to k_poll(). Unlike the lower level wait_q/pend() API, a call to k_poll() can't atomically release a lock. So you can't use it as a condition variable, the events need to be treated as a stream. The code here in poll was doing exactly that: check the spot where it releases a spinlock and then calls k_poll(). Another thread or ISR coming in at that moment to modify the queue state can cause a deadlock. |
@andyross Ok, I think it would be better to document it 2.3 release and postpone the fix for 2.4 though. And then put the loop back for the 2.3 release. |
I suspect the code removal is going to be easier to review than the timeout loop, honestly. But I can do both and folks can pick. |
The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
Ok, looking at the PR that is up you might be right. I don't understand all parts of it though, So I'll leave it to the other reviewers. |
The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes #25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
[This is a cherry pick from upstream Zephyr, needed because k_poll isn't "coherence-safe" current.] The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
[This is a cherry pick from upstream Zephyr, needed because k_poll isn't "coherence-safe" current.] The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com>
The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com> (cherry picked from commit 99c2d2d) Signed-off-by: Joakim Andersson <joakim.andersson@nordicsemi.no>
The k_queue data structure, when CONFIG_POLL was enabled, would inexplicably use k_poll() as its blocking mechanism instead of the original wait_q/pend() code. This was actually racy, see commit b173e43. The code was structured as a condition variable: using a spinlock around the queue data before deciding to block. But unlike pend_current_thread(), k_poll() cannot atomically release a lock. A workaround had been in place for this, and then accidentally reverted (both by me!) because the code looked "wrong". This is just fragile, there's no reason to have two implementations of k_queue_get(). Remove. Note that this also removes a test case in the work_queue test where (when CONFIG_POLL was enabled, but not otherwise) it was checking for the ability to immediately cancel a delayed work item that was submitted with a timeout of K_NO_WAIT (i.e. "queue it immediately"). This DOES NOT work with the origina/non-poll queue backend, and has never been a documented behavior of k_delayed_work_submit_to_queue() under any circumstances. I don't know why we were testing this. Fixes zephyrproject-rtos#25904 Signed-off-by: Andy Ross <andrew.j.ross@intel.com> (cherry picked from commit 99c2d2d) Signed-off-by: Joakim Andersson <joakim.andersson@nordicsemi.no>
Describe the bug
The API call
k_queue_get
return no data element before timeout has expired.Calling either
k_queue_get(queue, K_FOREVER);
ork_queue_get(queue, K_SECONDS(20))
returnNULL
within one second.This appears to occur because two threads are both waiting for an element in the queue.
Once an element is posted to the queue, both threads are woken, one will retrieve the new element, while the other will discover an empty queue.
This is a regression from: 7832738
Specifically this change:
Reverting this change (using legacy timeout API) fixes the issue.
The commit message says this:
The loop appears to have been removed for the wrong reason.
To Reproduce
The current steps involves 2 nRF52 dev-kits and a few manual steps.
I can try to make a more minimal failing test if needed, otherwise I can verify using my current setup.
Use branch: https://github.com/joerchan/zephyr/tree/bt-recv-deadlock-debug
Console output
peripheral sample:
Expected behavior
k_queue_get should not return NULL before timeout has passed.
Impact
This breaks the current flow-control behavior of Bluetooth, the attempted k_queue_get from BT RX thread will drop the attempt to answer the ATT request and will result in a disconnected ATT channel.
Additional context
CONFIG_POLL is enable
Blocker to fix: #23364
The text was updated successfully, but these errors were encountered: