Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPC sockets do not reconnect to active bind #2764

Closed
Kentzo opened this issue Oct 3, 2017 · 12 comments
Closed

IPC sockets do not reconnect to active bind #2764

Kentzo opened this issue Oct 3, 2017 · 12 comments

Comments

@Kentzo
Copy link
Contributor

Kentzo commented Oct 3, 2017

Please use this template for reporting suspected bugs or requests for help.

Issue description

Environment

  • libzmq version (commit hash if unreleased): 4.1.6
  • OS: macOS 10.12.6
  • pyzmq: 16.0.2

Minimal test code / Steps to reproduce the issue

Here are 2 python files:

# pub.py

import zmq
import time

c = zmq.Context.instance()
s = c.socket(zmq.PUB)
s.bind('ipc:///tmp/socket')
time.sleep(10)  # helps to run 2 processes simultaneously
s.send(b'42')
# sub.py

import zmq

c = zmq.Context.instance()
s = c.socket(zmq.SUB)
s.subscribe('')
s.connect('ipc:///tmp/socket')

while True:
    print(s.recv())

To reproduce the issue:

  1. Run sub.py
  2. Run 2 pub.py simultaneously

What's the actual result? (include assertion message & call stack if applicable)

sub.py received only one message (expected), but subsequent calls to pub.py will not cause sub.py to receive anything.

What's the expected result?

sub.py should notice that previously connected socket is closed, reconnect, and be able to receive messages from subsequent pub.py.

@bluca
Copy link
Member

bluca commented Oct 3, 2017

Sounds similar to #2763 can't remember right now, but I think AF_UNIX sockets have the same behaviour as AF_INET w.r.t. dead peer detection.

Could you please try with the ZMQ_HEARTBEAT* options?

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 3, 2017

@bluca I'll try that. But shouldn't 0-length read from a socket indicate that remote is closed?

While debugging I also came a across that ZeroMQ allows the same process to bind to the same IPC address twice. Not sure that should be allowed.

@bluca
Copy link
Member

bluca commented Oct 3, 2017

Yes 0 length read closes the socket: https://github.com/zeromq/libzmq/blob/master/src/stream_engine.cpp#L316

The problem is the kernel won't give you that in a lot of cases. For AF_UNIX as I mentioned I can't remember what's the exact behaviour right now.

Regarding double binding, it's because as it is customs the library will unlink the socket file before binding. That's because those files are never automatically cleaned up - that's just how AF_UNIX works I'm afraid.

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 3, 2017

I wonder if that problem also happens due to unlinking. I just tried the same test, but instead of running 2 parallel processes I bound 2 sockets to the same address around a single connect.

Do you think socket should consider "unlink" as close (e.g. can be implemented via inode)?

Regarding double binding, it's because as it is customs the library will unlink the socket file before binding. That's because those files are never automatically cleaned up - that's just how AF_UNIX works I'm afraid.

Perhaps it can be solved at least at the context level by having a repo of in-use sockets?

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 3, 2017

I wonder if that problem also happens due to unlinking.

Well, at least plain sockets handle this situation properly:

import socket
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.bind('/tmp/socket')
s.listen(1)
client, addr = s.accept()
client.send(b'123)
os.unlink('/tmp/socket')
client.send(b'456')
client.close()
import socket
s = socket.socket(socket.AF_UNIX, socket.SOCK_STREAM)
s.connect('/tmp/socket')
s.recv(16)  # b'123'
s.recv(16)  # b'456'
s.recv(16)  # b''

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 3, 2017

Here is the test that uses C API:
#include <fstream>
#include <sys/stat.h>

#include "testutil.hpp"

static const char* SOCKET_PATH = "/tmp/tester";
static const char* SOCKET_ADDR = "ipc:///tmp/tester";


void test_close_and_unlink()
{
    int timeout = 100;

    void *ctx = zmq_ctx_new ();
    assert (ctx);

    void *sb = zmq_socket (ctx, ZMQ_PUSH);
    assert (sb);
    int rc = zmq_setsockopt (sb, ZMQ_SNDTIMEO, &timeout, sizeof (int));
    assert (rc == 0);
    rc = zmq_bind (sb, SOCKET_ADDR);
    assert (rc == 0);

    void *sc = zmq_socket (ctx, ZMQ_PULL);
    assert (sc);
    rc = zmq_setsockopt (sc, ZMQ_RCVTIMEO, &timeout, sizeof (int));
    assert (rc == 0);
    rc = zmq_connect (sc, SOCKET_ADDR);
    assert (rc == 0);

    rc = zmq_send (sb, "42", 2, 0);
    assert (rc == 2);

    char buffer [2];
    rc = zmq_recv(sc, buffer, 2, 0);
    assert (rc == 2);

    rc = zmq_close (sb);
    assert (rc == 0);

    assert (!std::ifstream(SOCKET_PATH).good());

    sb = zmq_socket (ctx, ZMQ_PUSH);
    assert (sb);
    rc = zmq_setsockopt (sb, ZMQ_SNDTIMEO, &timeout, sizeof (int));
    assert (rc == 0);
    rc = zmq_bind (sb, SOCKET_ADDR);
    assert (rc == 0);

    rc = zmq_send (sb, "42", 2, 0);
    assert (rc == 2);

    rc = zmq_recv(sc, buffer, 2, 0);
    assert (rc == 2);

    rc = zmq_close (sc);
    assert (rc == 0);

    rc = zmq_close (sb);
    assert (rc == 0);

    rc = zmq_ctx_term (ctx);
    assert (rc == 0);
}

void test_unlink_and_close()
{
    int timeout = 100;

    void *ctx = zmq_ctx_new ();
    assert (ctx);

    void *sb = zmq_socket (ctx, ZMQ_PUSH);
    assert (sb);
    int rc = zmq_setsockopt (sb, ZMQ_SNDTIMEO, &timeout, sizeof (int));
    assert (rc == 0);
    rc = zmq_bind (sb, SOCKET_ADDR);
    assert (rc == 0);

    void *sc = zmq_socket (ctx, ZMQ_PULL);
    assert (sc);
    rc = zmq_setsockopt (sc, ZMQ_RCVTIMEO, &timeout, sizeof (int));
    assert (rc == 0);
    rc = zmq_connect (sc, SOCKET_ADDR);
    assert (rc == 0);

    rc = zmq_send (sb, "42", 2, 0);
    assert (rc == 2);

    char buffer [2];
    rc = zmq_recv(sc, buffer, 2, 0);
    assert (rc == 2);

    struct stat stat_before_rebind = {};
    rc = stat(SOCKET_PATH, &stat_before_rebind);
    assert (rc == 0);

    sb = zmq_socket (ctx, ZMQ_PUSH);
    assert (sb);
    rc = zmq_setsockopt (sb, ZMQ_SNDTIMEO, &timeout, sizeof (int));
    assert (rc == 0);
    rc = zmq_bind (sb, SOCKET_ADDR);  // implicitly unlinks old socket
    assert (rc == 0);

    struct stat stat_after_rebind = {};
    rc = stat(SOCKET_PATH, &stat_after_rebind);
    assert (rc == 0);

    assert (stat_before_rebind.st_ino != stat_after_rebind.st_ino);

    rc = zmq_send (sb, "42", 2, 0);
    assert (rc == 2);

    rc = zmq_recv(sc, buffer, 2, 0);
    assert (rc == 2);

    rc = zmq_close (sc);
    assert (rc == 0);

    rc = zmq_close (sb);
    assert (rc == 0);

    rc = zmq_ctx_term (ctx);
    assert (rc == 0);
}



int main (void)
{
    setup_test_environment();

    test_close_and_unlink();
    test_unlink_and_close();

    return 0 ;
}

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 3, 2017

It appears there is a race condition due to unlink:

zmq_bind [1]
zmq_close [1]
zmq_bind [2]

can be executed as:

ipc_listener_t::set_address [1]
ipc_listener_t::set_address [2]
ipc_listener_t::close [1]

In which case close of the [1] socket will actually unlink file that belongs to [2] (assuming they are bound to the same address).

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 3, 2017

Why does ZeroMQ unlink on close in the first place? If that can be avoided, many other issues can be easily avoided as well.

Kentzo added a commit to GreatFruitOmsk/libzmq that referenced this issue Oct 4, 2017
Otherwise it's possible to close file that does not belong
to the current socket.

Refs zeromq#2764
@bluca
Copy link
Member

bluca commented Oct 4, 2017

IIRC the problem is that if the socket file is there, binding to it will give an "address in use" error. And if it's not removed before binding, an application that crashes and is restarted (eg: by the init system) will fail with the same error.

In the end I think semantically double binding should not be allowed, as it really doesn't make much sense with SOCK_STREAM sockets.

@Kentzo
Copy link
Contributor Author

Kentzo commented Oct 4, 2017

IIRC the problem is that if the socket file is there, binding to it will give an "address in use" error.

That's correct, but current behavior is to unlink before bind (acceptable) and on close (incorrect).

It's incorrect, because on close it's possible to unlink file that does not belong to the socket anymore.
Consider the flow I posted above. There will be no file in the end.

In the end I think semantically double binding should not be allowed, as it really doesn't make much sense with SOCK_STREAM sockets.

unlink'n'bind looks like a legitimate approach and should stay for backward compatibility. Perhaps an option should be added for users that desire stricter control over this.

@bluca
Copy link
Member

bluca commented Oct 4, 2017

I see, you are right and it makes sense. Having an additional option sounds good as well. Thanks!

@bluca
Copy link
Member

bluca commented Oct 6, 2017

Fixed by #2765

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants