Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZeroMQ IPC fails after a while #607

Closed
Kile opened this issue Jan 9, 2024 · 5 comments · Fixed by #624
Closed

ZeroMQ IPC fails after a while #607

Kile opened this issue Jan 9, 2024 · 5 comments · Fixed by #624
Assignees
Labels
bug Something isn't working
Milestone

Comments

@Kile
Copy link
Owner

Kile commented Jan 9, 2024

This has been an issue for a while. In a development environment, zeromq works perfectly fine, however not long after a restart of the production code zeromq requests will start failing silently. This makes vote rewards not work as well as all GET endpoints used for the website, rendering it nearly completely useless. Several attempted fixes were implemented but none have worked so far.

This issue occurs in these lines of code:
Sever:

async def start(self):
"""Starts the zmq server asyncronously and handles incoming requests"""
context = Context()
auth = AsyncioAuthenticator(context)
auth.start()
auth.configure_plain(domain="*", passwords={"killua": IPC_TOKEN})
auth.allow("127.0.0.1")
socket = context.socket(ROUTER)
socket.plain_server = True
socket.bind("tcp://*:5555")
poller = Poller()
poller.register(socket, POLLIN)
while True:
socks = dict(await poller.poll())
if socket in socks and socks[socket] == POLLIN:
message = await socket.recv_multipart()
try:
identity, _, request = message # Sometimes there may be an empty frame in the middle of the message
except ValueError:
identity, request = message
decoded = loads(request.decode())
res = await getattr(self, decoded["route"])(decoded["data"])
if res:
await socket.send_multipart([identity, dumps(res).encode()])
else:
await socket.send_multipart([identity, b'{"status":"ok"}'])

Client:

async def make_request(route: str, data: dict) -> dict:
context = Context.instance()
socket = context.socket(DEALER)
socket.identity = uuid.uuid4().hex.encode('utf-8')
socket.plain_username = b"killua"
socket.plain_password = IPC_TOKEN.encode("UTF-8")
socket.connect("tcp://localhost:5555")
request = json.dumps({"route": route, "data": data}).encode('utf-8')
socket.send(request)
poller = Poller()
poller.register(socket, POLLIN)
while True:
events = dict(await poller.poll())
if socket in events and events[socket] == POLLIN:
multipart = json.loads((await socket.recv_multipart())[0].decode())
socket.close()
context.term()
return multipart

I suspected this was because of too many open connections but I am not sure if this is the case and I seem to close all connections. This is the output of an lsof command when this issue occurred in production:

Because this has been a longer ongoing issue and because it is quite important for the functionality I am turning this into an issue to keep track on the progress.

I have also asked this stack overflow question in hopes of a fix.

@Kile Kile added the bug Something isn't working label Jan 9, 2024
@Kile Kile added this to the Version 1.0 milestone Jan 9, 2024
@Kile Kile self-assigned this Jan 9, 2024
@Kile Kile moved this to Todo in Killua 1.0 todos Jan 9, 2024
@Kile
Copy link
Owner Author

Kile commented Jan 9, 2024

This seems to be an issue with the API, not zeromq. I can still internally request zeromq however the API fails. I remember it failing after a while before I created the website from time to time, it seems with the large number of additional requests this happens much faster. Only I am not sure why. I will continue investigating.

image
image

@Kile
Copy link
Owner Author

Kile commented Jan 13, 2024

I have changed hypercorn to use 8 workers instead of 1 a few days ago and this seems to have helped this issue. The API has been without issue for multiple days now.

@Kile
Copy link
Owner Author

Kile commented Jan 31, 2024

This issue is not resolved sadly. It is definitely a hypercorn issue. Increasing the number of workers only delays when the API starts timing out. I am looking into solutions.

@Kile Kile linked a pull request May 26, 2024 that will close this issue
@Kile
Copy link
Owner Author

Kile commented May 27, 2024

This now may be resolved. While rewriting this API to rust, I believe I have found the root cause of this issue with the help of @y21.

The root cause was that zeromq, for some reason, in its default behaviour, prevents dropping pointers at the end of a function. So when my make_request function ends and everything up until that point worked as expected, it tries to drop the variables but is prevented continuously.
image
This means no error is raised but the code freezes at a low level which is insanely hard to trace.

Turns out this is default zmq behaviour but there thankfully is a method to change this behaviour. So a simple one line fixes this:

socket.set_linger(0)

That's it. That I what I have tried to find for 8 months. Hopefully this actually fixes it. I will keep this issue open for a bit, if I close it that was it.

@Kile
Copy link
Owner Author

Kile commented May 27, 2024

image

Looking through the python implementation it is a bit harder to see because the linger argument will be passed to the underlying c implementation

@Kile Kile closed this as completed in #624 Jul 19, 2024
@github-project-automation github-project-automation bot moved this from Todo to Done in Killua 1.0 todos Jul 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

1 participant