Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seeming deadlock on select! #51

Open
datdenkikniet opened this issue Mar 6, 2021 · 9 comments
Open

Seeming deadlock on select! #51

datdenkikniet opened this issue Mar 6, 2021 · 9 comments

Comments

@datdenkikniet
Copy link

datdenkikniet commented Mar 6, 2021

Hello,

When using the following snippet in my code (as part of a Wiegand reader program) on a Raspberry Pi 0:

let monitor_0: Next<AsyncLineEventHandle> = events0.next();
let monitor_1: Next<AsyncLineEventHandle> = events1.next();

pin!(monitor_0, monitor_1);

select! {
    res = monitor_0 => { data = data << 1; },
    res = monitor_1 => { data = (data << 1) | 1 as u64; },
}

it seems like the code ends up in a Deadlock in the select! macro after running this snippet for 128 times, after which the program uses 100% CPU but neither of the Nexts ever completes.

events0 and events1 are AsyncLineEventHandles that come from Lines that come from the same Chip.

Introducing a millisecond delay at the very top increases the amount of bytes that can be read to 256.

Introducing a larger delay seems to remove the deadlock all together, but the ordering of the events is lost causing the data to become garbage as the order of the events determines the final data output.

I'm not certain if this is a gpio-cdev problem, a kernel problem, or if I'm simply doing this incorrectly.

Any feedback is highly appreciated.

@ryankurte
Copy link

huh, interesting failure! i have a bit of async experience but, the internals of this are unfamiliar to me... @mgottschlag no obligations but, any ideas?

absolutely grasping at straws but:

  • have you tried reading from two pins on different chips to check whether it could be an interaction from sharing chips?
  • i would be interested to know whether the behaviour changes if you restrict the executor to a single thread (not sure this will catch anything but, just in case it was related to Is kernel event queueing per-process? #50 or similar to the classic rustc should suggest using async version of Mutex rust-lang/rust#71072 (comment))
  • have you tried to run it under gdb? it can be pretty rough where async is involved but will usually highlight the problem area (ime strace and cargo flamegraph can also be useful for getting a grasp of these things, though raspbian is missing a bunch of kernel tools so this may not be workable for you)

@datdenkikniet
Copy link
Author

datdenkikniet commented Mar 7, 2021

Thanks for your response :)

I've tried running it with strace, but then I get the same behaviour as with a longer-than-a-millisecond delay where deadlock seems to disappear but the ordering gets all messed up (presumably because of the overhead that strace induces). I'll be sure to try it with GDB, and the other two experiments you've mentioned.

A detail I forgot to mention: I'm running this in a Buildroot environment on kernel 4.19.97, in case anyone is trying to reproduce this issue. I've not quite figured out how to build a newer kernel this way, so I hope that this version isn't missing some important fixes to the gpio device driver that could affect this issue.

@datdenkikniet
Copy link
Author

Additionally: I'd completely missed it while trying to look for similar issues, but it seems that I'm effectively trying to accomplish what #18 is describing. Sadly no real solution to the problem/how to do it either, so this is just what I came up with.

@datdenkikniet
Copy link
Author

datdenkikniet commented Mar 8, 2021

When running this with gdb, it seems like the main thread doesn't actually start/get to the futex_wait syscall that it goes to when in "working" condition. The 2nd thread does successfully go to the epoll_wait() that it does under normal execution too. Alternatively, the futex_wait is completing without actually firing an event/something that tokio catches in the select!. (I tested this both with std's select + fuse and tokio's select, but same result).

Seems like this futex_wait is actually due to the block_on that I'm using to wait for the reading of the bits to finish, so this information might not be all that relevant.

When switching to a single thread of execution (using tokio::main(max_threads = 1, core_threads = 1) or tokio::main(basic_scheduler) the program stops working all together. The program gets to aforementioned futex_wait but that's never completed.

I'll also be trying this with the gpiomon tool.

@datdenkikniet
Copy link
Author

datdenkikniet commented Mar 9, 2021

I've tried this with gpiomon now (i.e. gpiomon -lf gpiochip0 20 21), and with that it seems to work just fine. I get the correct output consistently, and it doesn't hang. I'll see if I can somehow figure out what's going on.

@datdenkikniet
Copy link
Author

datdenkikniet commented Mar 9, 2021

After some more investigation, it seems that it might be an issue in Tokio itself or the way Streams are implemented, somehow. I made my own fork and updated it to Tokio 1 (master...datdenkikniet:tokio-update), but it seems to exhibit the exact same behaviour. What happens is: the poll_next function implemented for AsyncLineEventHandle is called many times, but seemingly never completes (none of the match arms are matched, nor is an error produced by the ready! call). In the background it must complete somehow, since the memory usage of the program does not change, but somewhere something is not going right.

Removing the select! in favour of an await on a single one of the events doesn't change anything, either. The amount of events that can be read is still the same.

I'm unsure if this issue still belongs in gpio-cdev (possibly an incorrect implementation of poll_event?) or in Tokio/Futures. Any guidance would be appreciated.

@mgottschlag
Copy link
Contributor

I am sorry I cannot contribute - I actually do not have much low-level tokio experience.

Anyways, so the summary right now is:

  • The problem is not caused by incorrect usage of select!? I had problems with select! in the past - the macro has non-obvious failure modes, requirements to the futures which trigger those failure modes when not met, and I kept confusing the two different macris from futures and tokio, so it would be nice to rule out select! as a reason for the problems.
  • The problem is not actually caused by waiting on multiple lines, as await on one line instead of select! on two does not fix the problem? This is weird, the code used to work fairly reliably for me for such use cases. I use tokio::select! with a single line without problems.

@datdenkikniet
Copy link
Author

datdenkikniet commented Mar 9, 2021

Whoops, pressed the wrong button.

Thanks for the feedback.

Yes, your summary seems to reflect what I've found so far. AFAIK I'm using select! correctly, and I'm definitely using Tokio's select!.

I'll re-try and remove all of the references to the 2nd line and re-try, maybe it starts working again in that case. Yeah, even when doing everything with a single line (i.e. only getting 1 line from the Chip, getting an async handle to its events, etc.) the issue still occurs.

@datdenkikniet
Copy link
Author

datdenkikniet commented Mar 11, 2021

What happens is: the poll_next function implemented for AsyncLineEventHandle is called many times, but seemingly never completes (none of the match arms are matched, nor is an error produced by the ready! call).

So, I wasn't entirely correct:

What happens is that the file.poll_read() in Tokio 1.2 (or poll_read_ready() in the older verison) is simply never ready. I'd missed that the ready! macro actually returns Poll::Pending if the result of the Poll that is passed as an argument is also Poll::Pending.

Very unsure why it happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants