Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeouts on newer RPyC versions #527

Closed
gazpachoking opened this issue Feb 16, 2023 · 12 comments
Closed

Timeouts on newer RPyC versions #527

gazpachoking opened this issue Feb 16, 2023 · 12 comments
Assignees
Labels
Done The issue discussion is exhausted and is closed w/ comment

Comments

@gazpachoking
Copy link

Describe the issue briefly here, including:

I'm having an issue on RPyC versions 5.2.1 and above which we didn't have earlier. (Flexget/Flexget#3601 for reference) It seems that after this change 94dbe36 we started hitting the timeout here

return self._recv_event.wait(timeout.timeleft())
rather than the program completing successfully. I've managed to fix it by swapping these lines to call release before dispatch here.

rpyc/rpyc/core/protocol.py

Lines 445 to 446 in 18a44e5

self._dispatch(data)
self._recvlock.release()

I didn't open a PR, because I have no idea the intent of these things, and if doing this swap breaks something somewhere else. This could also be an issue with how I'm using RPyC, because it's not quite standard, and I'm having trouble creating a minimal example to reproduce. I'm hoping maybe there's something obvious to someone else about what's going wrong, or that the fix I've done is obviously correct to someone in the know. If not, I'll keep banging away to dig further and try to make a minimal reproduction.

@comrumino
Copy link
Collaborator

Script used to validate my changes

#!/usr/bin/env bash
function ansi_wrap_msg() { printf "\e[1;${1}m${2}\e[m"; }
function blue_msg() { ansi_wrap_msg "34" "$1"; }
function red_msg() { ansi_wrap_msg "31" "$1"; }
function getpid() {
     printf "%s" "$(ps -aux | grep 'versions/3[^/]*/bin/flexget' | grep -v vim | awk '{print $2}' | tr -d '\s')"
}
function testflex() {
    timeout 5 flexget execute --tasks 'digest 1' && blue_msg "exit code $?: task-passed\n"
}
pid=$(getpid)
[[ "$pid" =~ ^[0-9]+$ ]] && kill -9 "$pid"
printf "" > "/tmp/rpyc.log"
flexget --loglevel DEBUG daemon start -d
testflex || red_msg "exit code $?: task-failed\n"
pid=$(getpid)
[[ "$pid" =~ ^[0-9]+$ ]] && kill -9 $pid

@comrumino comrumino self-assigned this Feb 21, 2023
@comrumino comrumino added the Done The issue discussion is exhausted and is closed w/ comment label Feb 21, 2023
@gazpachoking
Copy link
Author

Thanks so much!

comrumino added a commit that referenced this issue Feb 21, 2023
@comrumino
Copy link
Collaborator

Thank you! You gave me a much needed test case. I plan to do a release this week. Maybe tomorrow 👍

@notEvil
Copy link

notEvil commented Mar 7, 2023

Hi,

since I was experiencing this class of issues a while ago and put some effort into fixing it (#492), I thought it would make sense to shed some extra light on this because the "fix" actually unfixes a fixed issue.

Consider thread A at

data = self._channel.poll(timeout) and self._channel.recv()
, waiting for data to arrive, and thread B in
while not (self._is_ready or self.expired):
, waiting for a result by checking its own state and calling Connection.serve.

The current "fix" contains the following path

  1. thread B sees that its result hasn't arrived yet and enters Connection.serve
  2. thread A receives data containing the result for thread B and releases the lock
  3. thread B acquires the lock and waits for data to arrive
  4. thread A dispatches the data

If timeout is Timeout(None) and there is no third party sending things, thread B is stuck waiting for data that will never arrive because thread A did already.

#492 led to

self._dispatch(data)
(5.2.1 - 5.3.0) where data is dispatched before the lock is released, to kill paths like the above.

I'm having an issue on RPyC versions 5.2.1 and above which we didn't have earlier. (Flexget/Flexget#3601 for reference) It seems that after this change 94dbe36 we started hitting the timeout here

return self._recv_event.wait(timeout.timeleft())

on the other hand references a commit that isn't on the surface of any of the newer releases. This issue wouldn't exist if the release versions were used instead.*

So please revert the "fix" or tell me that I'm wrong :)

* I was wrong about that, sry! There are still many ways to lock up rpyc in a multithreaded environment because of its default behavior: any thread may process any request. You could try #507 and see if it helps. It is experimental for a few reasons, but not because it were unreliable. With an actual project using/needing it (other than my own) I would go the extra mile and write documentation, tests, benchmark, ...

@gazpachoking
Copy link
Author

on the other hand references a commit that isn't on the surface of any of the newer releases.

I was definitely on 5.2.1 and 5.3.0 release versions when experiencing the issue.

You could try #507 and see if it helps

I actually did try that. On 5.3.0 it crashes like this with bind_threads on, and just locks up with it off.

Traceback (most recent call last):
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget_vanilla.py", line 11, in <module>
    flexget.main()
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\__init__.py", line 55, in main
    manager.start()
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\manager.py", line 343, in start
    client = IPCClient(ipc_info['port'], ipc_info['password'])
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\ipc.py", line 188, in __init__
    self.conn = rpyc.utils.factory.connect_channel(
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\utils\factory.py", line 48, in connect_channel
    return service._connect(channel, config)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\service.py", line 106, in _connect
    self.on_connect(conn)
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\ipc.py", line 109, in on_connect
    daemon_version = self._conn.root.version()
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 751, in root
    self._remote_root = self.sync_request(consts.HANDLE_GETROOT)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 718, in sync_request
    return _async_res.value
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\async_.py", line 106, in value
    self.wait()
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\async_.py", line 51, in wait
    self._conn.serve(self._ttl)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 422, in serve
    return self._serve_bound(timeout, wait_for_lock)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 552, in _serve_bound
    remote_thread_id, local_thread_id = brine.I4I4.unpack(message[:16])
struct.error: unpack requires a buffer of 8 bytes

On 5.3.1, everything works fine with bind_threads on or off.

@notEvil
Copy link

notEvil commented Mar 7, 2023

Yes, I saw the code, didn't find it in the release commits and wrongly deduced that you must have used it. My bad!

5.3.1 with bind_threads off works, until it doesn't... Originally, #492 proposed a more fine grained release strategy for the lock which might have covered your use case, or not, who knows. Its probably worth figuring out what causes the timeout using 5.3.0 and fix the issue properly, if possible.

@gazpachoking
Copy link
Author

5.3.1 with bind_threads off works, until it doesn't...

I'm assuming the not working times would be the same as in 5.1.0? We've been using that version (or older) for many years (and many users) with no issues. Not trying to deny that there are issues your fixes solve, just that there weren't any issues with the old way for our use case, but with the changes introduced by 5.2 (and reverted in 5.3.1) we have a consistent deadlock.

Its probably worth figuring out what causes the timeout using 5.3.0 and fix the issue properly, if possible.

I agree. It could be that we are using rpyc "wrong," but in a way that never caused issues until now. I was having a hell of a time debugging or simplifying the exact cause though.

@notEvil
Copy link

notEvil commented Mar 7, 2023

I get the sentiment. It worked for your use case, and now it doesn't. For me it was entirely different though. I ran into the race condition I described soon after I learned about and started using rpyc, and I was able to reliably reproduce it. Also there are a significant number of issues opened, "fixed" and closed related to this race, just search for timeout, dead lock, race condition and the like, or look at the commit history related to AsyncResult and Connection.serve.

I'm assuming the not working times would be the same as in 5.1.0?

yes

but with the changes introduced by 5.2 (and reverted in 5.3.1) we have a consistent deadlock.

sometimes I wish issues were consistent xD

It could be that we are using rpyc "wrong,"

I doubt it, there is no wrong. The fix for the race condition just buys correctness at the expense of multithreading-ness which in general is a weak point of rpyc. Thread binding was implemented to provide both.

@notEvil
Copy link

notEvil commented Mar 8, 2023

Good news! It is possible to fix the issue by using the release strategy proposed in #492

The client requests DaemonService.exposed_handle_cli (flexget/ipc.py). In there, the server sends a couple of requests to the client from this thread and at some point a different thread calls DaemonService.client_log_sink. The response for self._conn.root from the client is not processed because the other thread is still inside exposed_handle_cli holding the lock.

It is not necessary to hold the lock for all paths of Connection._dispatch.

Even then, I just realized that the race was never truly fixed because it would require the lock to extend out to AsyncResult.wait as proposed. I remember losing interest arguing about it and instead focused on thread binding which doesn't suffer from this race. Truly sorry!

@notEvil
Copy link

notEvil commented Mar 9, 2023

@comrumino
Copy link
Collaborator

On 5.3.1, everything works fine with bind_threads on or off.

@gazpachoking yay!

It could be that we are using rpyc "wrong," but in a way that never caused issues until now. I was having a hell of a time debugging or simplifying the exact cause though.

@gazpachoking, I would rephrase "using rpyc wrong" as "not compensating for RPyC's short comings." A good library should be easy to use right and hard to use wrong. To be more specific, RPyC's connections/protocol tend to struggle around exceptions/threading/co-routines/locks based on GH issues.

sometimes I wish issues were consistent xD

Me too xD. This is why I ask for unit tests to prevent regressions like this.

Sry for the spam!

@notEvil linking to benchmarking is not spam, it is a god send 💯 .

It is possible to fix the issue by using the release strategy proposed in #492

From what @gazpachoking stated, their issue is resolved. If you could be more specific and provide some tests or a client-server example, stable or not, please open a new issue. My biggest hesitation with #492 is my lack of clarity around why some changes are introduced, but I'll give it another look. Namely, changes around the struct are unclear and I could have given the work more time/attention. Given the back and forth around code changes in those areas, I'd like to expand unit tests to be able to prove the fix works and prevent regressions. Pull requests with unit tests are always welcome, but I tend to push back if it looks like there are changes unrelated to the fix when it comes to threading... mostly because threading in RPyC is problematic as is so the bar for contribution is a bit higher.

It is not necessary to hold the lock for all paths of Connection._dispatch.

Yeah, I agree that the current locking implementation locks too many operations. Protecting the right data/operations is non-trivial, to say the least.

instead focused on thread binding which doesn't suffer from this race.

I think thread binding is the way forward. Thread binding seems like it might make it easier to acquire the lock for fewer operations and protect the right data/operations.

I remember losing interest arguing about it and instead focused on thread binding which doesn't suffer from this race. Truly sorry!

I tend to be more stubborn when there aren't unit tests to prove code works or minimal test cases to show demonstrate a bug exists. From my perspective, unit tests and minimal test cases allow me to determine whether the issue is in the protocol or how people are using RPyC. Of course, RPyC should be easier to use right and harder to use wrong, but it is not there yet.

My frustrations were with RPyC's threading implementation/behavior/bugs, not you @notEvil. The biggest design issues that RPyC currently has are related to threading and coroutines—most open issues are related to this topic. With all that said, @notEvil let's start a new issue around #492 because it seems there is some improvements in that area we can do.

@notEvil
Copy link

notEvil commented Mar 14, 2023

@comrumino I agree, unit tests are important. Back then I thought the race is obvious and doesn't require explanation. And if it wasn't for thread binding, I probably would've tried again after a few weeks.

Namely, changes around the struct are unclear

from PR message:

  • data is split to allow lock release as soon as possible

e.g. no need to hold the lock while unpacking the entire payload. It could be huge ... :) Again, I probably should have been more verbose about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Done The issue discussion is exhausted and is closed w/ comment
Projects
None yet
Development

No branches or pull requests

3 participants