Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add timeout option on Client #46

Merged
merged 9 commits into from
Mar 1, 2019
Merged

Add timeout option on Client #46

merged 9 commits into from
Mar 1, 2019

Conversation

tmichela
Copy link
Member

No description provided.

@cydanil
Copy link
Member

cydanil commented Nov 28, 2018

Thank you! Perhaps there could be a more descriptive error message. Otherwise, it's fine for me

@codecov-io
Copy link

codecov-io commented Nov 29, 2018

Codecov Report

Merging #46 into master will increase coverage by 0.6%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master      #46     +/-   ##
=========================================
+ Coverage   76.44%   77.04%   +0.6%     
=========================================
  Files           6        6             
  Lines         467      488     +21     
=========================================
+ Hits          357      376     +19     
- Misses        110      112      +2
Impacted Files Coverage Δ
karabo_bridge/client.py 89.33% <100%> (+1.27%) ⬆️
karabo_bridge/cli/monitor.py 81.25% <0%> (+2.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 80a2361...f978e7d. Read the comment docs.

@tmichela
Copy link
Member Author

@cydanil Error message is in + small unit test

setup.py Outdated Show resolved Hide resolved
karabo_bridge/client.py Outdated Show resolved Hide resolved
@@ -66,10 +69,15 @@ def __init__(self, endpoint, sock='REQ', ser='msgpack'):
else:
raise NotImplementedError('socket is not supported:', str(sock))

if timeout is not None:
self._socket.setsockopt(zmq.RCVTIMEO, timeout)
self._recv_ready = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit wary of this extra state variable. I haven't thought through in detail, but I think it might end up with both server and client thinking it's their turn to listen for a message, and it won't be obvious why nothing is getting sent.

I don't have a better idea in the short term, but maybe this is the real reason to use another socket type like PUSH/PULL - it avoids having client & server state which can get out of step.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also don't like it so much... But is has proven reliable at least in use with karaboFAI.
I should spend more time playing with PUSH/PULL and other patterns, but (PUSH/PULL) seems to bring also some more complication:

  • handling properly message queues, since messages are very large it can easily crash in client applications
  • PUSH/PULL doesn't allow load balancing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can constrain the memory usage easily enough in our client code by setting a HWM. Not sure what you mean about the load balancing. I'm thinking about how the bridge can work most efficiently; maybe this is something we can work on when I'm there next week.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PUSH socket does not allow load balancing if you have several clients (PULL sockets), it is shared in round robin manner, so if a client is slow or not ready it will still receive data that will be queued. It is generally bad if you have some client that is used for monitoring once in a while what is in the data, in that case it will consume 1/nth train even if not using anything - also thinking what happens if one client is not consuming data, and the input queue is full, it this blocking the whole interface?)... Also for tools like onda, I believe the master process is waiting for processed train/pulses to arrive in order so if a slow worker still queue data at its input it can increase delay quite a bit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I never managed to make use of the HWM, if you have an example if it working I'm highly interested!

Copy link
Member Author

@tmichela tmichela Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks,

I tried it quickly, it does seem to work! I however hit into something weird:
It happened that exiting a PULL socket while receiving data could crash the PUSH socket:

sent 177
Assertion failed: !more (src/lb.cpp:110)
zsh: abort (core dumped)  python pushpull.py push

Bug is known and fixed in libzmq 4.2.x
zeromq/libzmq#1588

I had to update pyzmq>=17.1.0 to be built against the lib 4.2, but then the weird behavior is: when closing the PULL socket while receiving data the PUSH is not crashing anymore, but the next data received (after restarting a PULL) starts from the last part not sent. :

 % python pushpull.py pull
RCVHWM 1
RCVBUF -1
# parts: 2
# parts: 2
# parts: 2
^C%
tmichela@machine ~/projects/tmp/pushpull
 % python pushpull.py pull
RCVHWM 1
RCVBUF -1
# parts: 1
# parts: 2
# parts: 2
# parts: 2

you can find the code for the example here

Copy link
Member Author

@tmichela tmichela Dec 7, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it also seem that closing a PULL socket affects other PULL sockets:

 % python pushpull.py pull
RCVHWM 1
RCVBUF -1
9 - # parts: 2
10 - # parts: 3
11 - # parts: 3
12 - # parts: 3
13 - # parts: 3
14 - # parts: 3
16 - # parts: 3
19 - # parts: 3
20 - # parts: 3
23 - # parts: 3
25 - # parts: 3
26 - # parts: 2
27 - # parts: 3
29 - # parts: 3
31 - # parts: 2
32 - # parts: 3
33 - # parts: 3
34 - # parts: 3
36 - # parts: 3
38 - # parts: 2
39 - # parts: 3

Here I receive partial messages when an other PULL socket is terminated.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we move this discussion to a separate issue? I think it's possible that's a bug with ZeroMQ or pyzmq.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened an issue on pyzmq for it: zeromq/pyzmq#1244

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@takluyver
Copy link
Member

Do we have an idea how long it takes to transfer a train of stacked detector data? It would be good to document this, because you probably don't want to set the timeout lower than the time to transfer one message.

@tmichela
Copy link
Member Author

tmichela commented Dec 6, 2018

Do we have an idea how long it takes to transfer a train of stacked detector data? It would be good to document this, because you probably don't want to set the timeout lower than the time to transfer one message.

Yes,
This is dependent on the kind of data, on the 10G network, it should take about:

  • ~0.1s for 32 pulses
  • ~0.4s for 128 pulses
  • ~1s for 350 pulses

@takluyver
Copy link
Member

👍 let's mention a couple of those times somewhere (docstring?)

@tmichela
Copy link
Member Author

tmichela commented Mar 1, 2019

@takluyver Can I merge this? It will be required for one instrument.

@takluyver
Copy link
Member

Yes, go for it! Sorry, I forgot about this PR.

@takluyver
Copy link
Member

Close/reopen to rerun appveyor tests

@takluyver takluyver closed this Mar 1, 2019
@takluyver takluyver reopened this Mar 1, 2019
@takluyver
Copy link
Member

Sorry about that - too many tabs open with pull requests, I was thinking of a different one. We don't have or need Appveyor tests.

@tmichela
Copy link
Member Author

tmichela commented Mar 1, 2019

No worries, Thanks!
I'll probably make a new release soon.

@tmichela tmichela merged commit 6889339 into master Mar 1, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants