Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

irrd-whois-worker process doesn't timeout long running open connections / queries #693

Closed
smbambling opened this issue Nov 7, 2022 · 6 comments

Comments

@smbambling
Copy link

Describe the bug
If a connection is opened to the IRRd service with no source data ( query ) passed for an period of time the irrd-whois-worker process is put into a 'hung' state and will no longer be available to process any queries passed to it from the irrd-whois-server-listener process.

Looking at these process PIDs that are in a "hung" state, via strace they are still making some basic system calls they never appear to process any data. When all of the irrd-whois-worker processes get put into this state the system is no longer able to serve any responses. Once all the irrd-whois-worker processes are 'hung' the IRRd client will can no longer process any request and the service becomes unavailable.

sudo strace -p ${PID} -c -f -e trace=network -s 10000
sudo strace -fp ${PID}

We are seeing traffic from clients that are making a connection ( client unknown ) but sending an empty or no query. These interactions are not logged via IRRd and cause the irrd-whois-worker process to be put into a 'hung' state

In the process table irrd-whois-worker processes have a client address and port appended to them when a connection is opened and remain in this state when becoming 'hung'

root     14896  0.0  0.3 2321592 53788 ?       Sl   Oct24   1:00              |   \_ irrd-whois-worker-::ffff:34.141.189.20:60596
root     14898  0.1  0.3 2400476 59396 ?       Sl   Oct24   3:46              |   \_ irrd-whois-worker-::ffff:34.141.189.20:2959
root     14899  0.0  0.3 258268 51772 ?        Sl   Oct24   0:06              |   \_ irrd-whois-worker-::ffff:34.141.189.20:29292
root     14900  0.1  0.3 2399760 58912 ?       Sl   Oct24   4:38              |   \_ irrd-whois-worker-::ffff:34.141.189.20:59075
root     14901  0.1  0.3 2472444 57668 ?       Sl   Oct24   2:39              |   \_ irrd-whois-worker-::ffff:34.141.189.20:28824
root     14905  0.2  0.3 2401256 60084 ?       Sl   Oct24   5:54              |   \_ irrd-whois-worker-::ffff:34.141.189.20:30111
root     14907  1.0  0.4 2481232 66396 ?       Sl   Oct24  23:55              |   \_ irrd-whois-worker-::ffff:34.141.189.20:55332
root     14908  0.1  0.3 2474664 60176 ?       Sl   Oct24   3:32              |   \_ irrd-whois-worker-::ffff:34.141.189.20:29167
root     14910  0.4  0.3 2410128 61652 ?       Sl   Oct24  10:48              |   \_ irrd-whois-worker-::ffff:34.90.66.217:38371
root     14911  0.1  0.3 2323684 56408 ?       Sl   Oct24   3:44              |   \_ irrd-whois-worker-::ffff:34.141.189.20:2758

To Reproduce
The nc or telnet utilities can be used to open a connection to an IRRd instance over port 43 but asking no question. Letting this process sit for several minutes will put the irrd-whois-worker process into the 'hung' state. This can be observed even after the client closes the connection to the IRRd instance after an undetermined period of time.

nc IRRD.HOST.NAME 43

The exact timing for how long a connection needs to be open to cause this issues is yet undetermined but it is longer than the 30s timeout configured within the application.

Expected behaviour
The IRRD irrd-whois-worker process would timeout connections and become available to process queries

IRRd version you are running
Version 4.2.5, but the behavior was also observed prior to an upgrade on version 4.1.8

Additional context
Add any other context about the problem here.

@smbambling smbambling changed the title irrd-whois-worker process don't timeout long running open connections / queries irrd-whois-worker process doesn't timeout long running open connections / queries Nov 7, 2022
@mxsasha
Copy link
Collaborator

mxsasha commented Nov 15, 2022

Summary: I think this issue is a variant of #607 - where that looked like just a cleanup at the time, it seems this can actually cause the issue you have been seeing. If this is the same issue, you should be seeing a fair number of connections in FIN-WAIT-2 that stay stuck there - and a similar strace. If that's right, 0b3c7d1 already fixes the problem, and I can release a new 4.2 soon.

Small strace from all workers and the handler (15549) after it got stuck, when I made a new connection:

# strace -p 15554,15555,15549,15552
strace: Process 15554 attached
strace: Process 15555 attached
strace: Process 15549 attached
strace: Process 15552 attached
[pid 15552] recvfrom(17,  <unfinished ...>
[pid 15549] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 15555] recvfrom(19,  <unfinished ...>
[pid 15554] recvfrom(18,  <unfinished ...>
[pid 15549] <... restart_syscall resumed> ) = 0
[pid 15549] poll([{fd=7, events=POLLIN}], 1, 500) = 0 (Timeout)
...
[pid 15549] poll([{fd=7, events=POLLIN}], 1, 500) = 0 (Timeout)
[pid 15549] poll([{fd=7, events=POLLIN}], 1, 500) = 1 ([{fd=7, revents=POLLIN}])
[pid 15549] accept4(7, {sa_family=AF_INET6, sin6_port=htons(46990), inet_pton(AF_INET6, "<client>", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, [28], SOCK_CLOEXEC) = 14
[pid 15549] futex(0x7f11f8001430, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15549] futex(0x8a5ba4, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15549] futex(0x8a5ba8, FUTEX_WAKE_PRIVATE, 1) = 1
[pid 15549] poll([{fd=7, events=POLLIN}], 1, 500) = 0 (Timeout)

So the handler is still picking up connections, dropping them off on the queue, but no workers will pick them up - they are all blocked on receiving data from a socket.

For worker 15552, this is the blocked socket:

tcp    FIN-WAIT-2 0      0        ::ffff:167.99.217.97:whois                 ::ffff:<client>:54829                 users:(("irrd-whois-work",pid=15552,fd=17)) uid:1001 ino:247372867 sk:a1 <--

A stack dump reveals:

File: "/home/irrd/irrd-venv/lib/python3.7/site-packages/irrd/server/whois/server.py", line 161, in run
  self.handle_connection()
File: "/home/irrd/irrd-venv/lib/python3.7/site-packages/irrd/server/whois/server.py", line 205, in handle_connection
  data = self.rfile.readline()
File: "/usr/lib/python3.7/socket.py", line 589, in readinto
  return self._sock.recv_into(b)

The readline() call does not appear to return even after the 30 second timer has expired and the connection has been forcibly closed from our end. The only way to break out of that is setting a timeout, as we now do in 0b3c7d1.

@dajeffers
Copy link

netstat -tpn does not reveal any FIN-WAIT-2 stuck connections, nor can we see the client address at all present in a connection list. This isn't to say that setting a default socket timeout won't resolve the issue, but I'm not sure it's the same behavior you're mentioning. Instead of a blocked socket, it seems more like a stuck worker that won't release its client. As more connections with this type of empty query come in, more and more workers are consumed until no actual queries can be processed as all workers are stuck. The strace on the worker looks like:

strace -p 27569
strace: Process 27569 attached
recvfrom(45, 

with no movement at all. If there's further data we can provide, let us know.

@mxsasha
Copy link
Collaborator

mxsasha commented Nov 15, 2022

recvfrom(45,

Can you see what ss -ap says about fd 45 of that process?

@dajeffers
Copy link

The PID assigned to the stuck worker is nowhere to be found in ss -ap output, nor is there a file descriptor 45 found in it, either. We've looked for the PID and the client address in ss/netstat output to no avail, we see no evidence of an existing connection despite the worker thinking there is.

@mxsasha
Copy link
Collaborator

mxsasha commented Nov 16, 2022

That is very peculiar, but as your case definitely stays stuck on the same recvfrom, I do think setting the socket timeout is likely to resolve this. I'll include this in a 4.2.6 release, probably released in the next few days.

@mxsasha
Copy link
Collaborator

mxsasha commented Nov 18, 2022

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants