Timeouts on newer RPyC versions #527

gazpachoking · 2023-02-16T17:44:19Z

Describe the issue briefly here, including:

I'm having an issue on RPyC versions 5.2.1 and above which we didn't have earlier. (Flexget/Flexget#3601 for reference) It seems that after this change 94dbe36 we started hitting the timeout here

rpyc/rpyc/core/protocol.py

Line 387 in 94dbe36

return self._recv_event.wait(timeout.timeleft())

rather than the program completing successfully. I've managed to fix it by swapping these lines to call release before dispatch here.

rpyc/rpyc/core/protocol.py

Lines 445 to 446 in 18a44e5

    
           self._dispatch(data) 
        
           self._recvlock.release()

I didn't open a PR, because I have no idea the intent of these things, and if doing this swap breaks something somewhere else. This could also be an issue with how I'm using RPyC, because it's not quite standard, and I'm having trouble creating a minimal example to reproduce. I'm hoping maybe there's something obvious to someone else about what's going wrong, or that the fix I've done is obviously correct to someone in the know. If not, I'll keep banging away to dig further and try to make a minimal reproduction.

The text was updated successfully, but these errors were encountered:

comrumino · 2023-02-21T04:43:01Z

Script used to validate my changes

#!/usr/bin/env bash
function ansi_wrap_msg() { printf "\e[1;${1}m${2}\e[m"; }
function blue_msg() { ansi_wrap_msg "34" "$1"; }
function red_msg() { ansi_wrap_msg "31" "$1"; }
function getpid() {
     printf "%s" "$(ps -aux | grep 'versions/3[^/]*/bin/flexget' | grep -v vim | awk '{print $2}' | tr -d '\s')"
}
function testflex() {
    timeout 5 flexget execute --tasks 'digest 1' && blue_msg "exit code $?: task-passed\n"
}
pid=$(getpid)
[[ "$pid" =~ ^[0-9]+$ ]] && kill -9 "$pid"
printf "" > "/tmp/rpyc.log"
flexget --loglevel DEBUG daemon start -d
testflex || red_msg "exit code $?: task-failed\n"
pid=$(getpid)
[[ "$pid" =~ ^[0-9]+$ ]] && kill -9 $pid

gazpachoking · 2023-02-21T04:46:04Z

Thanks so much!

…e/dispatch/notify ordering

comrumino · 2023-02-21T04:55:41Z

Thank you! You gave me a much needed test case. I plan to do a release this week. Maybe tomorrow 👍

notEvil · 2023-03-07T10:29:52Z

Hi,

since I was experiencing this class of issues a while ago and put some effort into fixing it (#492), I thought it would make sense to shed some extra light on this because the "fix" actually unfixes a fixed issue.

Consider thread A at

rpyc/rpyc/core/protocol.py

Line 438 in ec5fbe5

data = self._channel.poll(timeout) and self._channel.recv()

, waiting for data to arrive, and thread B in

rpyc/rpyc/core/async_.py

Line 47 in ec5fbe5

while not (self._is_ready or self.expired):

, waiting for a result by checking its own state and calling Connection.serve.

The current "fix" contains the following path

thread B sees that its result hasn't arrived yet and enters Connection.serve
thread A receives data containing the result for thread B and releases the lock
thread B acquires the lock and waits for data to arrive
thread A dispatches the data

If timeout is Timeout(None) and there is no third party sending things, thread B is stuck waiting for data that will never arrive because thread A did already.

#492 led to

rpyc/rpyc/core/protocol.py

Line 445 in 60608d4

self._dispatch(data)

(5.2.1 - 5.3.0) where data is dispatched before the lock is released, to kill paths like the above.

I'm having an issue on RPyC versions 5.2.1 and above which we didn't have earlier. (Flexget/Flexget#3601 for reference) It seems that after this change 94dbe36 we started hitting the timeout here

rpyc/rpyc/core/protocol.py

Line 387 in 94dbe36

return self._recv_event.wait(timeout.timeleft())

on the other hand references a commit that isn't on the surface of any of the newer releases. ~~This issue wouldn't exist if the release versions were used instead.~~*

So please revert the "fix" or tell me that I'm wrong :)

* I was wrong about that, sry! There are still many ways to lock up rpyc in a multithreaded environment because of its default behavior: any thread may process any request. You could try #507 and see if it helps. It is experimental for a few reasons, but not because it were unreliable. With an actual project using/needing it (other than my own) I would go the extra mile and write documentation, tests, benchmark, ...

gazpachoking · 2023-03-07T15:33:19Z

on the other hand references a commit that isn't on the surface of any of the newer releases.

I was definitely on 5.2.1 and 5.3.0 release versions when experiencing the issue.

You could try #507 and see if it helps

I actually did try that. On 5.3.0 it crashes like this with bind_threads on, and just locks up with it off.

Traceback (most recent call last):
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget_vanilla.py", line 11, in <module>
    flexget.main()
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\__init__.py", line 55, in main
    manager.start()
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\manager.py", line 343, in start
    client = IPCClient(ipc_info['port'], ipc_info['password'])
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\ipc.py", line 188, in __init__
    self.conn = rpyc.utils.factory.connect_channel(
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\utils\factory.py", line 48, in connect_channel
    return service._connect(channel, config)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\service.py", line 106, in _connect
    self.on_connect(conn)
  File "C:\Users\CSterling\PycharmProjects\Flexget\flexget\ipc.py", line 109, in on_connect
    daemon_version = self._conn.root.version()
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 751, in root
    self._remote_root = self.sync_request(consts.HANDLE_GETROOT)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 718, in sync_request
    return _async_res.value
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\async_.py", line 106, in value
    self.wait()
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\async_.py", line 51, in wait
    self._conn.serve(self._ttl)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 422, in serve
    return self._serve_bound(timeout, wait_for_lock)
  File "C:\Users\CSterling\PycharmProjects\Flexget\.venv\lib\site-packages\rpyc\core\protocol.py", line 552, in _serve_bound
    remote_thread_id, local_thread_id = brine.I4I4.unpack(message[:16])
struct.error: unpack requires a buffer of 8 bytes

On 5.3.1, everything works fine with bind_threads on or off.

notEvil · 2023-03-07T16:21:41Z

Yes, I saw the code, didn't find it in the release commits and wrongly deduced that you must have used it. My bad!

5.3.1 with bind_threads off works, until it doesn't... Originally, #492 proposed a more fine grained release strategy for the lock which might have covered your use case, or not, who knows. Its probably worth figuring out what causes the timeout using 5.3.0 and fix the issue properly, if possible.

gazpachoking · 2023-03-07T16:43:53Z

5.3.1 with bind_threads off works, until it doesn't...

I'm assuming the not working times would be the same as in 5.1.0? We've been using that version (or older) for many years (and many users) with no issues. Not trying to deny that there are issues your fixes solve, just that there weren't any issues with the old way for our use case, but with the changes introduced by 5.2 (and reverted in 5.3.1) we have a consistent deadlock.

Its probably worth figuring out what causes the timeout using 5.3.0 and fix the issue properly, if possible.

I agree. It could be that we are using rpyc "wrong," but in a way that never caused issues until now. I was having a hell of a time debugging or simplifying the exact cause though.

notEvil · 2023-03-07T17:55:23Z

I get the sentiment. It worked for your use case, and now it doesn't. For me it was entirely different though. I ran into the race condition I described soon after I learned about and started using rpyc, and I was able to reliably reproduce it. Also there are a significant number of issues opened, "fixed" and closed related to this race, just search for timeout, dead lock, race condition and the like, or look at the commit history related to AsyncResult and Connection.serve.

I'm assuming the not working times would be the same as in 5.1.0?

yes

but with the changes introduced by 5.2 (and reverted in 5.3.1) we have a consistent deadlock.

sometimes I wish issues were consistent xD

It could be that we are using rpyc "wrong,"

I doubt it, there is no wrong. The fix for the race condition just buys correctness at the expense of multithreading-ness which in general is a weak point of rpyc. Thread binding was implemented to provide both.

notEvil · 2023-03-08T11:12:39Z

Good news! It is possible to fix the issue by using the release strategy proposed in #492

The client requests DaemonService.exposed_handle_cli (flexget/ipc.py). In there, the server sends a couple of requests to the client from this thread and at some point a different thread calls DaemonService.client_log_sink. The response for self._conn.root from the client is not processed because the other thread is still inside exposed_handle_cli holding the lock.

It is not necessary to hold the lock for all paths of Connection._dispatch.

Even then, I just realized that the race was never truly fixed because it would require the lock to extend out to AsyncResult.wait as proposed. I remember losing interest arguing about it and instead focused on thread binding which doesn't suffer from this race. Truly sorry!

notEvil · 2023-03-09T12:56:19Z

Sry for the spam! Just fyi: https://github.com/notEvil/rpyc/tree/benchmark/benchmark

comrumino · 2023-03-14T15:02:30Z

On 5.3.1, everything works fine with bind_threads on or off.

@gazpachoking yay!

It could be that we are using rpyc "wrong," but in a way that never caused issues until now. I was having a hell of a time debugging or simplifying the exact cause though.

@gazpachoking, I would rephrase "using rpyc wrong" as "not compensating for RPyC's short comings." A good library should be easy to use right and hard to use wrong. To be more specific, RPyC's connections/protocol tend to struggle around exceptions/threading/co-routines/locks based on GH issues.

sometimes I wish issues were consistent xD

Me too xD. This is why I ask for unit tests to prevent regressions like this.

Sry for the spam!

@notEvil linking to benchmarking is not spam, it is a god send 💯 .

It is possible to fix the issue by using the release strategy proposed in #492

From what @gazpachoking stated, their issue is resolved. If you could be more specific and provide some tests or a client-server example, stable or not, please open a new issue. My biggest hesitation with #492 is my lack of clarity around why some changes are introduced, but I'll give it another look. Namely, changes around the struct are unclear and I could have given the work more time/attention. Given the back and forth around code changes in those areas, I'd like to expand unit tests to be able to prove the fix works and prevent regressions. Pull requests with unit tests are always welcome, but I tend to push back if it looks like there are changes unrelated to the fix when it comes to threading... mostly because threading in RPyC is problematic as is so the bar for contribution is a bit higher.

It is not necessary to hold the lock for all paths of Connection._dispatch.

Yeah, I agree that the current locking implementation locks too many operations. Protecting the right data/operations is non-trivial, to say the least.

instead focused on thread binding which doesn't suffer from this race.

I think thread binding is the way forward. Thread binding seems like it might make it easier to acquire the lock for fewer operations and protect the right data/operations.

I remember losing interest arguing about it and instead focused on thread binding which doesn't suffer from this race. Truly sorry!

I tend to be more stubborn when there aren't unit tests to prove code works or minimal test cases to show demonstrate a bug exists. From my perspective, unit tests and minimal test cases allow me to determine whether the issue is in the protocol or how people are using RPyC. Of course, RPyC should be easier to use right and harder to use wrong, but it is not there yet.

My frustrations were with RPyC's threading implementation/behavior/bugs, not you @notEvil. The biggest design issues that RPyC currently has are related to threading and coroutines—most open issues are related to this topic. With all that said, @notEvil let's start a new issue around #492 because it seems there is some improvements in that area we can do.

notEvil · 2023-03-14T22:46:40Z

@comrumino I agree, unit tests are important. Back then I thought the race is obvious and doesn't require explanation. And if it wasn't for thread binding, I probably would've tried again after a few weeks.

Namely, changes around the struct are unclear

from PR message:

data is split to allow lock release as soon as possible

e.g. no need to hold the lock while unpacking the entire payload. It could be huge ... :) Again, I probably should have been more verbose about this.

comrumino closed this as completed in 1bff083 Feb 21, 2023

comrumino self-assigned this Feb 21, 2023

comrumino added the Done The issue discussion is exhausted and is closed w/ comment label Feb 21, 2023

comrumino added a commit that referenced this issue Feb 21, 2023

Minor style change to comments around #527 and #449 issues and releas…

92c94e7

…e/dispatch/notify ordering

notEvil mentioned this issue Mar 14, 2023

Race condition in AsyncResult.wait and Connection.serve #530

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timeouts on newer RPyC versions #527

Timeouts on newer RPyC versions #527

gazpachoking commented Feb 16, 2023

comrumino commented Feb 21, 2023

gazpachoking commented Feb 21, 2023

comrumino commented Feb 21, 2023

notEvil commented Mar 7, 2023 •

edited

Loading

gazpachoking commented Mar 7, 2023

notEvil commented Mar 7, 2023

gazpachoking commented Mar 7, 2023

notEvil commented Mar 7, 2023

notEvil commented Mar 8, 2023

notEvil commented Mar 9, 2023

comrumino commented Mar 14, 2023

notEvil commented Mar 14, 2023

Timeouts on newer RPyC versions #527

Timeouts on newer RPyC versions #527

Comments

gazpachoking commented Feb 16, 2023

comrumino commented Feb 21, 2023

gazpachoking commented Feb 21, 2023

comrumino commented Feb 21, 2023

notEvil commented Mar 7, 2023 • edited Loading

gazpachoking commented Mar 7, 2023

notEvil commented Mar 7, 2023

gazpachoking commented Mar 7, 2023

notEvil commented Mar 7, 2023

notEvil commented Mar 8, 2023

notEvil commented Mar 9, 2023

comrumino commented Mar 14, 2023

notEvil commented Mar 14, 2023

notEvil commented Mar 7, 2023 •

edited

Loading