Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

conmand takes a long time until connecting to a newly created UNIX socket #28

Closed
e4t opened this issue Jul 18, 2018 · 12 comments
Closed
Labels
Milestone

Comments

@e4t
Copy link
Contributor

e4t commented Jul 18, 2018

If a UNIX socket is created after conmand has started there may be a long delay until it connects to it.
Conmand watches for the appearance of new sockets using inotify, however, inotify triggers (and causes poll() to return) when the side creating the socket has bind()ed to the socket. At this time it is yet possible to connect() - this will not happen until the other side calls listen(). Thus the connect() after being invoked by inotify will most likely fail and conmand will return to poll() waiting for further events or a timeout. The timeouts are increased each time they are triggered up to a maximum of 3 minutes currently.

@dun
Copy link
Owner

dun commented Jul 21, 2018

Just a quick ack... I'm hoping to have time to look at this next week.

@e4t
Copy link
Contributor Author

e4t commented Jul 21, 2018

Thanks, Chris!
For reference, a description how to reproduce this quickly:

  1. open 3 root consoles

  2. Add a configuration for a UNIX socket:
    # echo 'console name="socket1" dev="unix:/tmp/testsock"' >> /etc/conman.conf

  3. On console 1 start conmand:
    # /usr/sbin/conmand -c /etc/conman.conf -F

  4. On console 2 monitor the socket:
    # conman socket1

  5. Wait a minute, then connect to the socket on console 3:
    # netcat -ClU /tmp/testsock

  6. Enter something on console 3 - observe the delay for this to appear on console 2.

@dun
Copy link
Owner

dun commented Jul 26, 2018

Thanks for taking the time to run this down and write-up a detailed analysis. I wasn't able to reproduce this with netcat as you show, so I wrote a little server where I could introduce various delays to trigger it. I restructured things a bit in 138e1e2 (uds-connect-via-inotify-fails). It passes my testing. Can you look at it to make sure it works in your setup, and that you're ok with the changes I made?

BTW, the max timeout for UNIX domain socket connections is 60s. Still, that's a long time to wait when you're triggered by inotify.

@e4t
Copy link
Contributor Author

e4t commented Jul 27, 2018

Thanks for looking into this. I believe I found out why you are not seeing this issue with the simple test case above - and I should have thought about this earlier: the VM I've been using for testing was just single socket/core. After increasing the number of cores, the problem did not show most of the time.

For reference, let me attache the strace -ttt outputs of step 3 and 5 for a single CPU and 4 CPUs:

Single CPU

  • Step 3.
...
1532671822.691379 poll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=3, events=POLLIN},{fd=-1 {fd=9, events=POLLIN}], 10, 16000) = 1 ([{fd=6, revents=POLLIN}])
1532671827.714178 read(6,"\1\0\0\0\0\1\0\0\0\0\0\0\20\0\0\0testsock\0\0\0\0\0\0
1532671827.714216 stat("/tmp/testsock", {st_mode=S_IFSOCK|0755, st_size=0,...})
1532671827.714314 socket(PF_LOCAL, SOCK_STREAM, 0) = 10
1532671827.714348 fcntl(10, F_GETFL)    = 0x2 (flags O_RDWR)
1532671827.714426 fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0 
1532671827.714455 fcntl(10, F_SETFD, FD_CLOEXEC) = 0
1532671827.714480 connect(10, {sa_family=AF_LOCAL,  sun_path="/tmp/testsock"}, 110) = -1 ECONNREFUSED (Connection refused)
1532671827.714587 write(2, "INFO:      Console [socket1] can"..., 90INFO: Console [socket1] cannot connect to device "/tmp/testsock": Connection   refused ) = 90
...
  • Step 5.
...
1532671827.713970 socket(PF_LOCAL, SOCK_STREAM, 0) = 3
1532671827.714004 bind(3, {sa_family=AF_LOCAL, sun_path="/tmp/testsock"}, 15) = 0
1532671827.714942 listen(3, 5)          = 0
1532671827.714968 accept(3,
...

From the time stamps one sees that poll() returns because of the inotify right after the bind() call. This is consistent with the inotify man page. However, the subsequent connect() fails. This would only succeed after the other process has called listen() however, the other process gets scheduled only when conmand goes back to poll() waiting for further events.

Multiple CPUs

  • Step 3.
1532674082.199409 poll([{fd=-1}, {fd=-1}, {fd=-1}, {fd=3, events=POLLIN}, {fd=-1}, {fd=-1}, {fd=6, events=POLLIN}, {fd=7, events=POLLIN}, {fd=-1}, {fd=9, events=POLLIN}], 10, 16000) = 1 ([{fd=6, revents=POLLIN}])
1532674089.944931 read(6, "\1\0\0\0\0\1\0\0\0\0\0\0\20\0\0\0testsock\0\0\0\0\0\0\0\0", 4096) = 32
1532674089.944976 stat("/tmp/testsock", {st_mode=S_IFSOCK|0755, st_size=0, ...}) = 0
1532674089.945021 socket(PF_LOCAL, SOCK_STREAM, 0) = 10
1532674089.945061 fcntl(10, F_GETFL)    = 0x2 (flags O_RDWR)
1532674089.945098 fcntl(10, F_SETFL, O_RDWR|O_NONBLOCK) = 0
1532674089.945131 fcntl(10, F_SETFD, FD_CLOEXEC) = 0
1532674089.945165 connect(10, {sa_family=AF_LOCAL, sun_path="/tmp/testsock"}, 110) = 0
1532674089.945236 write(2, "INFO:      Console [socket1] con"..., 58INFO:      Console [socket1] connected to "/tmp/testsock"
) = 58
  • Step 5.
1532674089.944609 socket(PF_LOCAL, SOCK_STREAM, 0) = 3
1532674089.944749 bind(3, {sa_family=AF_LOCAL, sun_path="/tmp/testsock"}, 15) = 0
1532674089.944945 listen(3, 5)          = 0
1532674089.945072 accept(3, {sa_family=AF_LOCAL, NULL}, [2]) = 4

Here it is pretty obvious that conmand and netcat are running on different cores so that the former doesn't get scheduled away to schedule the latter to notify it of the event.
The more CPUs a machine has the less likely this issue will occur - however, this still depends on how busy the machine is at this time.

I will grab your updated fix and test it here.

Thanks for looking into this!

@e4t
Copy link
Contributor Author

e4t commented Jul 27, 2018

Thanks for the fix - I've tested it: it works - of course!
Your patch only resets auxp->isViaInotify when the connect() fails. This may leave the state data inconsistent. If you prefer to go the route using a state variable, I'd suggest to add e4t/conman@ea0312a to reset this flag as soon as possible in the intended consumer. I don't see much harm in not doing it, but it may people wonder how long this flag is actually needed.
Potentially, there are other code paths which could render this flag inconsistent and future changes may contribute more.
I had considered using the state structure for this purpose as well but decided against it to avoid such inconsistencies and instead passed the flag along as function arguments - as ugly as it looks.

@dun
Copy link
Owner

dun commented Jul 28, 2018

Whoops... I forgot to reset the flag in disconnect_unixsock_obj(), but clearing it immediately in connect_unixsock_obj() is fine. Thanks for the catch!

I was thinking the state variable approach would be a bit easier to follow.

I'll try to tag a release in the next couple of weeks to get this fix out. There's a bit of cleanup I'd like to do if I can set aside some time.

@dun dun added the bug label Jul 28, 2018
@dun dun added this to the 0.3.0 milestone Jul 28, 2018
@dun dun closed this as completed in 8f5bd09 Jul 28, 2018
@e4t
Copy link
Contributor Author

e4t commented Aug 23, 2018

@dun Chris, thanks for taking care of this!

@dun
Copy link
Owner

dun commented Aug 23, 2018

@e4t I'll get this in the next release. I'm in the process of cleaning the old autoconf cruft. I'm planning to tag & test once that's done. Thanks again for your PR.

@e4t
Copy link
Contributor Author

e4t commented Aug 29, 2018

@chris, thanks for letting me know - I've just pushed the patch into a maintenance update for SUSE SLES/ openSUSE Leap.
I wonder if I should retract it before it goes to QA and rather wait for the next release.

@dun
Copy link
Owner

dun commented Aug 29, 2018

I expect to release by Sep 15. That should give me ample time yet, but I've had to deal with other issues this past week that are still ongoing. The only thing going into this release besides your patch is the autotools cleanup.

@dun
Copy link
Owner

dun commented Sep 15, 2018

@e4t: My autotools cleanup isn't quite ready, so it'll have to wait 'til next time. I've just released 0.3.0 with your patch. Thanks for your patience.

@e4t
Copy link
Contributor Author

e4t commented Sep 16, 2018

Thanks for letting me know, Chris! Will update the package ASAP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants