Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PUB crash when SUB exceeded SNDHWM #2942

Closed
sublee opened this issue Feb 12, 2018 · 60 comments
Closed

PUB crash when SUB exceeded SNDHWM #2942

sublee opened this issue Feb 12, 2018 · 60 comments

Comments

@sublee
Copy link

sublee commented Feb 12, 2018

Please use this template for reporting suspected bugs or requests for help.

Issue description

When all of these conditions are satisfied, the assertion failure from mtrie.cpp occurs:

  • A connection between a PUB socket and many SUB sockets.
  • A SUB socket subscribe/unsubscribe many prefixes.
  • Call zmq_getsockopt() with ZMQ_EVENTS for SUB sockets.
Assertion failed: erased == 1 (src/mtrie.cpp:297)
[1]    30266 abort (core dumped)  ./a.out

Environment

  • libzmq version (commit hash if unreleased): 4.2.0 and 4.2.3
  • OS: Ubuntu 16.04 LTS

Minimal test code / Steps to reproduce the issue

To reproduce this crash, we should prepare a PUB socket and many SUB sockets.

We will call this sequence (pseudo-code): pub.connect(sub) or sub.connect(pub); pub.getsockopt(ZMQ_EVENTS); sub.subscribe(prefix); sub.getsockopt(ZMQ_EVENTS); sub.unsubscribe(prefix); sub.getsockopt(ZMQ_EVENTS). There will be many prefixes to subscribe/unsubscribe.

Calling getsockopt(ZMQ_EVENTS) after SUB's SUBSCRIBE/UNSUBSCRIBE, or PUB's zmq_connect() will produce a crash due to the assertion failure in mtrie_t::rm_helper.

You can switch PUB<->SUB connection topology by the pub_to_sub variable.

#include "zmq.h"
#include <stdio.h>

// Set 1 or 0 to switch the PUB<->SUB connection topology.
static int pub_to_sub = 1;

void gen_topic(int n, char* topic)
{
    // Simple hash function to generate a subscription prefix from a number.
    n = (n * 2654435761);
    sprintf(topic, "%08x", n);
}

void getsockopt_events_within_many_subscriptions(void* sub)
{
    char topic[8];
    char opt[256];
    size_t opt_len = 256;

    for (int j = 0; j < 10000; ++j)
    {
        gen_topic(j, topic);
        zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
        // CRASH: Get ZMQ_EVENTS from a SUB socket.
        zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
    }
    for (int j = 0; j < 10000; ++j)
    {
        gen_topic(j, topic);
        zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
        // CRASH: Get ZMQ_EVENTS from a SUB socket.
        zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
    }
}

int main()
{
    printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);

    void *context = zmq_ctx_new();
    void *pub = zmq_socket(context, ZMQ_PUB);
    void *sub;

    char addr[256]; size_t addr_len = 256;
    char opt[256];  size_t opt_len  = 256;

    if (pub_to_sub)
    {
        // PUB->SUB
        for (int i = 0; i < 100; ++i)
        {
            sub = zmq_socket(context, ZMQ_SUB);

            zmq_bind(sub, "tcp://127.0.0.1:*");
            zmq_getsockopt(sub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
            zmq_connect(pub, addr);

            getsockopt_events_within_many_subscriptions(sub);
        }
    }
    else
    {
        // SUB->PUB
        zmq_bind(pub, "tcp://127.0.0.1:*");
        zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
        for (int i = 0; i < 100; ++i)
        {
            sub = zmq_socket(context, ZMQ_SUB);

            zmq_connect(sub, addr);

            getsockopt_events_within_many_subscriptions(sub);

            // CRASH: Get ZMQ_EVENTS from the PUB socket.
            zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);
        }
    }
}

What's the actual result? (include assertion message & call stack if applicable)

$ gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out
4.2.3
Assertion failed: erased == 1 (src/mtrie.cpp:297)
[1]    30266 abort (core dumped)  ./a.out

What's the expected result?

$ gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out
4.2.3
$ echo $?
0

When SUB sockets connect to the PUB socket, this crash doesn't happen.

@bluca
Copy link
Member

bluca commented Feb 12, 2018

I can't seem to reproduce the crash. Did you build the library with or without draft apis?

@sublee
Copy link
Author

sublee commented Feb 12, 2018

@bluca Oh... I built libzmq-4.2.3 without any configuration arguments.

I guess low SNDHWM on SUB sockets reproduce the crash easier. Would you please try it?

zmq_setsockopt(sub, ZMQ_SNDHWM, (void *)10, 0);

@sublee
Copy link
Author

sublee commented Feb 12, 2018

Here's my gdb backtrace:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
4.2.3
[New Thread 0x7ffff6abc700 (LWP 23970)]
[New Thread 0x7ffff62bb700 (LWP 23972)]
Assertion failed: erased == 1 (src/mtrie.cpp:297)

Thread 1 "a.out" received signal SIGABRT, Aborted.
0x00007ffff77b8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
54      ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x00007ffff77b8428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007ffff77ba02a in __GI_abort () at abort.c:89
#2  0x00007ffff7b73539 in zmq::zmq_abort (errmsg_=errmsg_@entry=0x7ffff7bb75c6 "erased == 1") at src/err.cpp:87
#3  0x00007ffff7b7de6f in zmq::mtrie_t::rm_helper (this=this@entry=0xa34380, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:297
#4  0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34360, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#5  0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34340, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#6  0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34320, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#7  0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0xa34300, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#8  0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x61c7c0, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#9  0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x960c50, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#10 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x61f020, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#11 0x00007ffff7b7dc7b in zmq::mtrie_t::rm_helper (this=this@entry=0x617ce8, prefix_=<optimized out>, size_=<optimized out>, pipe_=0xade9b0) at src/mtrie.cpp:315
#12 0x00007ffff7b7e235 in zmq::mtrie_t::rm (this=this@entry=0x617ce8, prefix_=<optimized out>, size_=<optimized out>, pipe_=<optimized out>) at src/mtrie.cpp:288
#13 0x00007ffff7baddc9 in zmq::xpub_t::xread_activated (this=0x617740, pipe_=0xade9b0) at src/xpub.cpp:115
#14 0x00007ffff7b92f0f in zmq::socket_base_t::process_commands (this=this@entry=0x617740, timeout_=timeout_@entry=0, throttle_=throttle_@entry=false) at src/socket_base.cpp:1378
#15 0x00007ffff7b95829 in zmq::socket_base_t::connect (this=0x617740, addr_=0x7fffffffe0a0 "tcp://127.0.0.1:40429") at src/socket_base.cpp:709
#16 0x0000000000400b53 in main ()

@bluca
Copy link
Member

bluca commented Feb 12, 2018

No change. The program just spins forever in one of the getsockopt_events_within_many_subscriptions loops

@bluca
Copy link
Member

bluca commented Feb 12, 2018

I am running with libsodium:

$ ldd src/.libs/libzmq.so.5.1.4 | grep sodium
	libsodium.so.18 => /usr/lib/x86_64-linux-gnu/libsodium.so.18 (0x00007f786586e000)

It doesn't make much sense - unless you are using CURVE, and it that example it's not, sodium or tweetnacl will make no difference at all.

Are you sure you didn't have multiple versions of the libraries laying around? And maybe building with one, but running with an old one or viceversa?

@sublee
Copy link
Author

sublee commented Feb 12, 2018

I'm sorry my comment about libsodium was hasty. I removed it quickly but you already read.

Here's the comment:

I've checked my own environment again. There was libsodium in my system. When I uninstalled it, the code didn't crash! Can you run the reproduction script after installing libsodium?

@sublee
Copy link
Author

sublee commented Feb 13, 2018

Are you sure you didn't have multiple versions of the libraries laying around? And maybe building with one, but running with an old one or viceversa?

Yes, there's only 1 libzmq on my system.

I want to explain the environment difference between us. I downloaded zeromq-4.2.3 from zeromq-4.2.3.tar.gz. I built it with ./configure && make && sudo make install. I ran the script with gcc zmq_events_crash.c -L ~/usr/local/lib -lzmq && ./a.out.

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04 LTS
Release:        16.04
Codename:       xenial
$ uname -a
Linux fantine 4.4.0-104-generic #127-Ubuntu SMP Mon Dec 11 12:16:42 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

@bluca
Copy link
Member

bluca commented Feb 13, 2018

You are not running ldconfig - are you sure there are no other libs? What does ldd a.out say?

@sublee
Copy link
Author

sublee commented Feb 13, 2018

Oh, I ran ldconfig also. Sorry for omitting it.

The ldd a.out result is:

linux-vdso.so.1 =>  (0x00007ffca17cf000)
libzmq.so.5 => /usr/local/lib/libzmq.so.5 (0x00007f67cdf02000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f67cdb38000)
librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f67cd930000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f67cd713000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007f67cd391000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f67cd088000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f67cce72000)
/lib64/ld-linux-x86-64.so.2 (0x00007f67ce18d000)

@sublee
Copy link
Author

sublee commented Feb 13, 2018

Anyways, I'm still researching my crash. I found a symptom that PUB socket doesn't receive all of SUBSCRIBE messages by custom logs on zmq::xpub_t::xread_activated(). So some UNSUBSCRIBE message can be processed without it's corresponding SUBSCRIBE. I'll check whether SUB drops outgoing messages or PUB drops incoming messages.

@sublee
Copy link
Author

sublee commented Feb 13, 2018

I found the condition. A SUB socket can drop a SUBSCRIBE message due to the not enough SNDHWM. But it can send an UNSUBSCRIBE message for the dropped subscription. Then it kills the connected PUB socket.

If the dropping is a normal behavior of SUB sockets, PUB socket should prevent the crash itself. If it just crashes, we cannot keep an application safely.

@sublee
Copy link
Author

sublee commented Feb 13, 2018

#2252 what I reported 14 months ago came to my mind. At that time, you pointed dropped subscriptions due to the HWM. But I couldn't reproduce it without PyZMQ so I tried and gave up to fix it in PyZMQ.

So essentially this issue is same as that issue. But this time, I can reproduce it without PyZMQ. @bluca I want to make it work in your environment also.

I'll replace the subject with "PUB crash when SUB exceeded SNDHWM".

@sublee sublee changed the title Crash on ZMQ_EVENTS after SUBSCRIBE/UNSUBSCRIBE PUB crash when SUB exceeded SNDHWM Feb 13, 2018
@sublee
Copy link
Author

sublee commented Feb 13, 2018

How about this code?

Show code
#include "zmq.h"
#include 

void gen_topic(int n, char* topic)
{
// Simple hash function to generate a subscription prefix from a number.
n = (n * 2654435761);
sprintf(topic, "%08x", n);
}

void getsockopt_events_within_many_subscriptions(void* sub, int i)
{
char topic[8];
char opt[256];
size_t opt_len = 256;

int n = 10000;

for (int j = 0; j < n; ++j)
{
    gen_topic(j, topic);
    zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
    // CRASH: Get ZMQ_EVENTS from a SUB socket.
    zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}
for (int j = 0; j < n; ++j)
{
    gen_topic(j, topic);
    zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
    // CRASH: Get ZMQ_EVENTS from a SUB socket.
    zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
}

}

int main()
{
printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);

void *context = zmq_ctx_new();
void *pub = zmq_socket(context, ZMQ_PUB);
void *sub;

char addr[256]; size_t addr_len = 256;
char opt[256];  size_t opt_len  = 256;
int sub_hwm = 1;

zmq_bind(pub, "tcp://127.0.0.1:*");
zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);

for (int i = 0; i < 100; ++i)
{
    sub = zmq_socket(context, ZMQ_SUB);
    zmq_setsockopt(sub, ZMQ_SNDHWM, &sub_hwm, sizeof(sub_hwm));

    zmq_connect(sub, addr);
    zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);

    getsockopt_events_within_many_subscriptions(sub, i);
}

}

Use #2942 (comment) instead.

@sublee
Copy link
Author

sublee commented Feb 13, 2018

Or this? (more duplicated subscriptions, lowest RCVHWM for the PUB socket)

#include "zmq.h"
#include <stdio.h>

void getsockopt_events_within_many_subscriptions(void* sub)
{
    char opt[256];
    size_t opt_len = 256;
    char topic[8];
    int n = 100;

    for (int i = 0; i < n; ++i)
    {
        for (int j = 0; j < n; ++j)
        {
            sprintf(topic, "%08x", j);
            zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
            zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
        }
        for (int j = 0; j < n; ++j)
        {
            sprintf(topic, "%08x", j);
            zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
            zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
        }
    }
}

int main()
{
    printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);

    void *context = zmq_ctx_new();
    void *pub = zmq_socket(context, ZMQ_PUB);
    void *sub;

    char addr[256]; size_t addr_len = 256;
    char opt[256];  size_t opt_len  = 256;
    int lowest_hwm = 1;

    zmq_bind(pub, "tcp://127.0.0.1:*");
    zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);
    zmq_setsockopt(pub, ZMQ_RCVHWM, &lowest_hwm, sizeof(lowest_hwm));

    for (int i = 0; i < 100; ++i)
    {
        sub = zmq_socket(context, ZMQ_SUB);
        zmq_setsockopt(sub, ZMQ_SNDHWM, &lowest_hwm, sizeof(lowest_hwm));

        zmq_connect(sub, addr);
        zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);

        getsockopt_events_within_many_subscriptions(sub);
    }
}

@sublee
Copy link
Author

sublee commented Feb 13, 2018

Here's another variance. I should make a code to drop SUBSCRIBE messages but not drop UNSUBSCRIBE messages to reproduce the crash. This variance assumes that many subscriptions before connecting have more change to be dropped.

#include "zmq.h"
#include <stdio.h>

int main()
{
    printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);

    void *context = zmq_ctx_new();
    void *pub = zmq_socket(context, ZMQ_PUB);
    void *sub;

    char addr[256]; size_t addr_len = 256;
    char opt[256];  size_t opt_len  = 256;
    char topic[8];

    int lowest_hwm = 1;
    int n = 10000;

    zmq_setsockopt(pub, ZMQ_RCVHWM, &lowest_hwm, sizeof(lowest_hwm));

    zmq_bind(pub, "tcp://127.0.0.1:*");
    zmq_getsockopt(pub, ZMQ_LAST_ENDPOINT, addr, &addr_len);

    for (int i = 0; i < 100; ++i)
    {
        sub = zmq_socket(context, ZMQ_SUB);
        zmq_setsockopt(sub, ZMQ_SNDHWM, &lowest_hwm, sizeof(lowest_hwm));

        for (int j = 0; j < n; ++j)
        {
            sprintf(topic, "%08x", j);
            zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
        }

        zmq_connect(sub, addr);

        zmq_getsockopt(pub, ZMQ_EVENTS, opt, &opt_len);
        zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);

        for (int j = 0; j < n; ++j)
        {
            sprintf(topic, "%08x", j);
            zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
            zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
        }
    }
}

@bluca
Copy link
Member

bluca commented Feb 13, 2018

Both programs work fine for me on both 4.2.2 and latest master - this is on Debian 9 x86_64

@sigiesec
Copy link
Member

I tried to run this (with VS), but get the following error at the end of getsockopt_events_within_many_subscriptions:

Run-Time Check Failure #2 - Stack around the variable 'topic' was corrupted.

It's in my branch https://github.com/sigiesec/libzmq/tree/fix-issue-2943

@bluca
Copy link
Member

bluca commented Feb 13, 2018

"topic" is an array so it shouldn't pass its address I think - gcc is more forgiving

@sigiesec
Copy link
Member

@bluca it's not a compiler, but a run-time error

@sublee
Copy link
Author

sublee commented Feb 13, 2018

Does it also succeed? I've switched the connection topology to SUB binds from PUB binds.

#include "zmq.h"
#include <stdio.h>

int main()
{
    printf("%d.%d.%d\n", ZMQ_VERSION_MAJOR, ZMQ_VERSION_MINOR, ZMQ_VERSION_PATCH);

    int hwm = 1;
    int n = 10000;

    char addr[256]; size_t addr_len = 256;
    char opt[256];  size_t opt_len  = 256;
    char topic[8];

    void *context = zmq_ctx_new();
    void *pub, *sub;

    pub = zmq_socket(context, ZMQ_PUB);
    zmq_setsockopt(pub, ZMQ_RCVHWM, &hwm, sizeof(hwm));

    for (int i = 0; i < 100; ++i)
    {
        sub = zmq_socket(context, ZMQ_SUB);
        zmq_setsockopt(sub, ZMQ_SNDHWM, &hwm, sizeof(hwm));

        zmq_bind(sub, "tcp://127.0.0.1:*");
        zmq_getsockopt(sub, ZMQ_LAST_ENDPOINT, addr, &addr_len);

        for (int j = 0; j < n; ++j)
        {
            sprintf(topic, "%08x", j);
            zmq_setsockopt(sub, ZMQ_SUBSCRIBE, &topic, 8);
        }

        zmq_connect(pub, addr);

        for (int j = 0; j < n; ++j)
        {
            sprintf(topic, "%08x", j);
            zmq_setsockopt(sub, ZMQ_UNSUBSCRIBE, &topic, 8);
            zmq_getsockopt(sub, ZMQ_EVENTS, opt, &opt_len);
        }
    }
}

@bluca
Copy link
Member

bluca commented Feb 13, 2018

Yeah they all succeed, last one included, apart from the very first one which seems to just run forever.

@bluca
Copy link
Member

bluca commented Feb 13, 2018

Is there any more information that you can give? Have you tried with the latest libzmq master? Using any non-standard compiler/compiler flag? Have you tried using just the packaged library from Ubuntu?

@sublee
Copy link
Author

sublee commented Feb 14, 2018

Okay, I should stop to edit the code. I'm trying to reproduce the crash in Travis CI to make clear any environment differences.

I tried it with v4.2.3 tag of libzmq repo. I compiled it with only ./configure and make. And I could reproduce it at my another machine. So I don't think there can be any non-standard flags. I didn't try it with the packaged library.

@sublee
Copy link
Author

sublee commented Feb 14, 2018

It is reproduced at Travis CI! Can you check a crashed result and the script in CI?

$ ./a.out
4.2.3
Assertion failed: erased == 1 (src/mtrie.cpp:297)
/home/travis/.travis/job_stages: line 57: 12542 Aborted                 (core dumped) ./a.out


The command "./a.out" exited with 134.

Done. Your build exited with 1.

@sigiesec
Copy link
Member

I noticed that topic is too small, it must be made size 9 to store an 8 byte string plus 0 termination. When I change this, I get a similar assertion in VS:

Assertion failed: erased == 1 (D:\Dev\libzmq\src\mtrie.cpp:290)

@sublee
Copy link
Author

sublee commented Feb 14, 2018

@sigiesec Oh, thank you for the point! I updated at https://github.com/sublee/zmq-pubsub-crash.

@sigiesec
Copy link
Member

sigiesec commented Feb 14, 2018

@bluca May this be a bug in mtrie? I wanted to add unittests for that anyway, since it is not well covered by the existing tests.

@sublee I have the test also in a branch, where I have migrated the tests to unity: https://github.com/sigiesec/libzmq/tree/fix-issue-2942

@sublee
Copy link
Author

sublee commented Feb 14, 2018

@sigiesec Thank you! There's a little mistake. It is the issue 2942 not 2943. Anyway, I'm happy to meet another person who is interested in this issue.

How do you think how we can fix this crash? I think if SUB's subscriptions can be ignored, PUB must not check erased == 1. It can be 0 or 1 normally.

@sublee
Copy link
Author

sublee commented Feb 14, 2018

@bluca I don't think this is a stack corruption. There are several proofs:

  • I appended logs around xpub.cpp#114 to detect actual subscriptions which PUB received. Some subscriptions were not delivered to PUB.
  • I appended logs around xsub.cpp#244 to detect whether a SUB socket drops subscriptions or not. There were many dropped subscriptions.
  • I appended logs around pipe.cpp#540 to confirm HWM exceeding. Explicitly, the HWM exceeding is the reason to drop subscriptions. Of course, it is a documented and expected behavior.
  • SNDHWM=0 for SUB sockets and RCVHWM=0 for PUB sockets perfectly fix this crash. If it is a stack corruption, the unlimited HWM should not fix the crash.

I didn't have a crash with %x instead of %08x so I thought it weird just like you. But when I increase the loop iterations and call ZMQ_EVENTS after ZMQ_SUBSCRIBE, it also crashes. You can see the crash code differences at sublee/zmq-pubsub-crash@ddc556e. And here's the crash result: https://travis-ci.org/sublee/zmq-pubsub-crash/builds/341495514.

Assertion failed: erased == 1 (src/mtrie.cpp:297)

sublee added a commit to what-studio/pyzmq that referenced this issue Feb 14, 2018
It is libzmq issue.  PyZMQ doesn't have a responsibility to fix it.
The issue is reported at zeromq/libzmq#2942.
@sublee
Copy link
Author

sublee commented Feb 17, 2018

@bluca Can we exclude the topic length from the suspects? Is there any risk if we remove the assertion of erased == 1?

@sigiesec
Copy link
Member

I think at the moment the assertion can be removed. If the mtrie is completely empty, e.g., rm simply returns false.

@bluca
Copy link
Member

bluca commented Feb 17, 2018

Sorry but I'm not a huge fan of removing error checks - at the very least we need to understand what condition it was added to guard against, and why it is wrong and how to relax it. And possibly with tests - how is the unit test for mtrie coming along?

Especially in this case - compiling with address sanitizer shows some stash smashing is going on:

=================================================================
==32713==ERROR: AddressSanitizer: stack-buffer-overflow on address 0x7ffd5af0ead8 at pc 0x7f2f9f2c1592 bp 0x7ffd5af0e8c0 sp 0x7ffd5af0e070
WRITE of size 9 at 0x7ffd5af0ead8 thread T0
    #0 0x7f2f9f2c1591 in __interceptor_vsprintf (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x8b591)
    #1 0x7f2f9f2c17a2 in sprintf (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x8b7a2)
    #2 0x555c170eef08 in main /home/bluca/git/libzmq/test_crash.c:31
    #3 0x7f2f9eb372e0 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x202e0)
    #4 0x555c170eebd9 in _start (/home/bluca/git/libzmq/a.out+0xbd9)

Address 0x7ffd5af0ead8 is located in stack of thread T0 at offset 232 in frame
    #0 0x555c170eecef in main /home/bluca/git/libzmq/test_crash.c:5

  This frame has 6 object(s):
    [32, 36) 'hwm'
    [96, 104) 'addr_len'
    [160, 168) 'opt_len'
    [224, 232) 'topic' <== Memory access at offset 232 overflows this variable
    [288, 544) 'addr'
    [576, 832) 'opt'
HINT: this may be a false positive if your program uses some custom stack unwind mechanism or swapcontext
      (longjmp and C++ exceptions *are* supported)
SUMMARY: AddressSanitizer: stack-buffer-overflow (/usr/lib/x86_64-linux-gnu/libasan.so.3+0x8b591) in __interceptor_vsprintf
Shadow bytes around the buggy address:
  0x10002b5d9d00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9d10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9d20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9d30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1
  0x10002b5d9d40: f1 f1 04 f4 f4 f4 f2 f2 f2 f2 00 f4 f4 f4 f2 f2
=>0x10002b5d9d50: f2 f2 00 f4 f4 f4 f2 f2 f2 f2 00[f4]f4 f4 f2 f2
  0x10002b5d9d60: f2 f2 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9d70: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9d80: 00 00 f2 f2 f2 f2 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9d90: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
  0x10002b5d9da0: 00 00 00 00 00 00 f3 f3 f3 f3 00 00 00 00 00 00
Shadow byte legend (one shadow byte represents 8 application bytes):
  Addressable:           00
  Partially addressable: 01 02 03 04 05 06 07 
  Heap left redzone:       fa
  Heap right redzone:      fb
  Freed heap region:       fd
  Stack left redzone:      f1
  Stack mid redzone:       f2
  Stack right redzone:     f3
  Stack partial redzone:   f4
  Stack after return:      f5
  Stack use after scope:   f8
  Global redzone:          f9
  Global init order:       f6
  Poisoned by user:        f7
  Container overflow:      fc
  Array cookie:            ac
  Intra object redzone:    bb
  ASan internal:           fe
  Left alloca redzone:     ca
  Right alloca redzone:    cb
==32713==ABORTING

Use -fsanitize=address with a decently recent version of GCC to see it

@sublee
Copy link
Author

sublee commented Feb 17, 2018

@bluca I'm not doubtful of topic[9]. topic[8] in the original code was perfectly wrong. As you said, topic[9] is the right code. Even though topic[8] has the memory issue, the mtrie crash is not related to that.

The mtrie crash is caused by not paired SUBSCRIBE and UNSUBSCRIBE messages. It is a possible normal behavior of SUB or XSUB sockets when they exceed the SNDHWM. PUB or XPUB sockets' RCVHWM also can cause the crash.

So I suggest removing the assertion simply. Or we can check the pipe's subscription existence before processing UNSUBSCRIBE messages if the assertion is still necessary.

@sublee
Copy link
Author

sublee commented Feb 17, 2018

@sigiesec Thank you for the idea! I'll send a PR you soon.

@bluca
Copy link
Member

bluca commented Feb 17, 2018

Please don't - again:

Sorry but I'm not a huge fan of removing error checks - at the very least we need to understand what condition it was added to guard against, and why it is wrong and how to relax it. And possibly with tests - how is the unit test for mtrie coming along?

Error checks must not be removed casually, they are there for a reason

@bluca
Copy link
Member

bluca commented Feb 17, 2018

I'm not doubtful of topic[9]. topic[8] in the original code was perfectly wrong. As you said, topic[9] is the right code. Even though topic[8] has the memory issue, the mtrie crash is not related to that.

I am doubtful - the topic is 8 bytes, so why is your example requiring a larger array? It doesn't make any sense

@sublee
Copy link
Author

sublee commented Feb 17, 2018

@bluca Okay, I won't remove the assertion. But is it okay that checking existence before PUB removes a subscription? There will be no assertion removal.

@sublee
Copy link
Author

sublee commented Feb 17, 2018

@bluca Yes, the topic is 8 bytes. But I've used sprintf(). It appends \0 after the content. So it needs 9 bytes to be stored safely. It makes sense. Doesn't it? (reference)

@sublee
Copy link
Author

sublee commented Feb 17, 2018

I'm sorry for trying to remove the assertion hastily. But I want to move the conversation onto unpaired SUBSCRIBE-UNSUBSCRIBE. It is the most important point to rescue applications which use ZeroMQ PUB/SUB from this crash. I want to hear @bluca your opinion about it. Because you are the ZeroMQ expert. Your opinion will be influential to me.

I think topic[9] is completely out of this issue. Just sprintf() requires 1 more byte to store \0 even though a SUB socket uses the string except \0.

@bjovke
Copy link
Contributor

bjovke commented Feb 17, 2018

I am doubtful - the topic is 8 bytes, so why is your example requiring a larger array? It doesn't make any sense

sprintf(topic, "%08x", j);

I've tried removing the assert zmq_assert (erased == 1);
Now it runs for longer time and sometimes crashes on a totally unrelated place (Windows 64 bit build):

Unhandled exception at 0x00007FFC3305DE0E (ntdll.dll) in test_1.exe: 0xC0000005: Access violation writing location 0x0000000000000024.

It's because of simultaneous write access to family_entries map from multiple threads.

void zmq::select_t::reset_pollout (handle_t handle_)
{
    check_thread ();
#if defined ZMQ_HAVE_WINDOWS
    u_short family = get_fd_family (handle_);
    wsa_assert (family != AF_UNSPEC);
    family_entry_t &family_entry = family_entries[family];   // <- HERE
#endif
    FD_CLR (handle_, &family_entry.fds_set.write);
}

In Windows debug builds there's a mutx inside std::map which is being locked on writes. I'm not sure what is this now...

Regardless of that, this fixes the crash without changing anything in libzmq, adding delays around zmq_connect(pub, addr);:

    std::this_thread::sleep_for(std::chrono::milliseconds(1));
    zmq_connect(pub, addr);
    std::this_thread::sleep_for(std::chrono::milliseconds(1));

It seems that the issue is with ZMQ_SUBSCRIBE / ZMQ_UNSUBSCRIBE messages being lost and/or not being processed in the order they were sent. Possibly two ZMQ_UNSUBSCRIBE messages from the same SUB arriving to PUB one after the another??

During the actual assert the pipe_ is not present in pipes:

pipes_t::size_type erased = pipes->erase (pipe_);

+	pipe_	0x0000001d1aeac0e0 
-	pipes	0x0000001d1b153c00 { size=1 }	std::set<zmq::pipe_t *,std::less<zmq::pipe_t *>,std::allocator<zmq::pipe_t *> > *
+		[0]	0x0000001d1aeacd40

pipes set contains only one entry which is not the one being deleted.
It's maybe best at this point to find out why the assert is there.

@bjovke
Copy link
Contributor

bjovke commented Feb 17, 2018

With 1 millisecond delay on Windows sometimes unhandled exception still happens but with 10 millisecond delay everything works fine.

@sublee
Copy link
Author

sublee commented Feb 17, 2018

@bjovke Thank you to inspect this issue.

Possibly two ZMQ_UNSUBSCRIBE messages from the same SUB arriving to PUB one after the another??

When I appended some logs to debug it, I didn't see duplicated UNSUBSCRIBE messages. Here's what I saw:

  1. SUB drops some outgoing SUBSCRIBE due to SNDHWM, or PUB drops some incoming SUBSCRIBE due to RCVHWM. Both can happen.
  2. So PUB doesn't know that SUB subscribes the topic.
  3. SUB and PUB does NOT drop UNSUBSCRIBE. PUB receives the UNSUBSCRIBE message successfully.
  4. So PUB tries to remove the topic from the subscriptions. But there's nothing.
  5. erased is 0. It crashes.

If you append some logs, you will see the same behaviors.

@bjovke
Copy link
Contributor

bjovke commented Feb 17, 2018

Well I didn't say that ZMQ_UNSUBSCRIBE messages are duplicates.
Since your example is running in a loop the same topics and SUB are repeatedly used hence the PUB can observe two consecutive ZMQ_UNSUBSCRIBEs if messages between them are lost or messages are processed out of order.

Anyway, this should not be happening in any case.

@sublee
Copy link
Author

sublee commented Feb 17, 2018

@bjovke Oh, I misunderstood. Anyways, SNDHWM=0 on SUB sockets and RCVHWM=0 on the PUB socket fixes the crash without any delays. Can you try it in your environment?

int hwm = 0;

my guess is that this issue can be observed only when PUB and SUB are on the same machine or they are communicating on a network with extremely low latency.

I don't agree your guess. Here're counterexamples:

This crash is not only in the example code. My production servers are having it currently(I'll fix it with the unlimited HWM at the next maintenance window.) The servers are distributed on multiple machines.

The crash happens when 2 types of events:

  1. When a process which has SUB sockets has been killed, connected PUB sockets bring the crash.
  2. When a new PUB socket connects to a mature cluster. A mature cluster has many SUB sockets with many many subscriptions at least 10k topics. The SUB sockets actively subscribe and unsubscribe some topics forever.

A fresh cluster has same SUB sockets but has <10 subscriptions. When a new PUB socket connects to a fresh cluster, it doesn't crash. If the network latency is the reason, there should not be the difference between mature and fresh clusters.

@bjovke
Copy link
Contributor

bjovke commented Feb 17, 2018

Yes, you're right about the network latency, I've removed that text from my comment.

Everything works fine with hwm = 0 for some time and I get an already mentioned unhandled exception on Windows. But this is not related to this assert issue.

Either the logic for detecting cases like this needs to be added to the code or it might be enough just to remove the assert. It needs to be investigated.

@sublee
Copy link
Author

sublee commented Feb 17, 2018

I prefer removing the assertion and making mtrie::rm() return false if erased == 0. But I'm not an expert so you guys' opinion is more important.

Before we fix this issue, we must warn about the crash for the PUB/SUB pattern users. The SNDHWM and RCVHWM documentation should suggest 0 for PUB/SUB sockets.

@sigiesec
Copy link
Member

@bluca Maybe I find some time tomorrow to add sufficient tests, so that we can discuss consistency of mtrie behaviour.

At the moment my impression is that the assertion is too strict within mtrie, but it may well be worth an assertion at the call site. I did not dig into the larger picture yet.

@bjovke
Copy link
Contributor

bjovke commented Feb 17, 2018

The assert is probably there in order to make sure that unsubscribe from a SUB actually deleted a corresponding pipe from pipes set and there are no dangling pipes there. Dangling pipes can, in a long-running program, pile up and exhaust the RAM.

But in this case SUB is sending unsubscribe to a subscription which never reached the PUB, thus the subscription is non-existent on PUB side. I guess then there will be no dangling pipes in the PUB?
Except the ones where SUB subscribed and then died (process crash and/or network connection lost).

As far as I've looked at the code, the zmq_assert (erased == 1); can be safely removed since this pipe was never inserted into pipes set at the first place.

There's this code in zmq::xpub_t::xread_activated:

            } else {
                bool unique;
                if (*data == 0)
                    unique = subscriptions.rm (data + 1, size - 1, pipe_);  // <- ASSERT IS HERE
                else
                    unique = subscriptions.add (data + 1, size - 1, pipe_);

                //  If the (un)subscription is not a duplicate store it so that it can be
                //  passed to the user on next recv call unless verbose mode is enabled
                //  which makes to pass always these messages.
                if (options.type == ZMQ_XPUB
                    && (unique || (*data == 1 && verbose_subs)
                        || (*data == 0 && verbose_unsubs && verbose_subs))) {
                    pending_data.push_back (blob_t (data, size));
                    if (metadata)
                        metadata->add_ref ();
                    pending_metadata.push_back (metadata);
                    pending_flags.push_back (0);
                }
            }

If the assert is removed unique might have the proper value:

  • If the subscribe never reached the PUB and unsubscribe arrived, SUB thinks that it is connected and it should be disconnected by PUB ("passed to the user on next recv call").
  • If PUB wants to ignore non-matching unsubscribe then in this case unique should be false but that defeats the meaning of it - it should be false only when there are no more subscriptions from this SUB.

But, then again, this is definitely caused by ZMQ_SNDHWM on SUB, which means that SUB thinks it sent a subscribe to the PUB but it was never actually sent on the wire.
The way to fix it is to check on the SUB if the subscribe message actually entered the send queue and don't transition to "subscribed" state if it did not.

I personally think that this needs to be fixed on both the PUB and the SUB, by removing assert on the PUB and fixing the ZMQ_SUBSCRIBE in case of discarded message due to ZMQ_SNDHWM limit.

@sublee
Copy link
Author

sublee commented Feb 18, 2018

@bjovke Thank you to agree my hypothesis!

I saw PUB side HWM exceeding also, but it doesn't seem to cause the crash. I couldn't reproduce the crash with SUB.SNDHWM=0 and PUB.RCVHWM=1 now. So I think that your idea, the sent subscription checking on SUB sockets, would work well.

@sublee
Copy link
Author

sublee commented Feb 19, 2018

@bluca I'm looking forward to your opinion on my hypothesis: successful UNSUBSCRIBE messages, which are after dropped SUBSCRIBE messages, cause the crash on the PUB side. Please review whether this crash mechanism is possible or not.

@bluca
Copy link
Member

bluca commented Mar 2, 2018

@sublee please try again from latest master

@sublee
Copy link
Author

sublee commented Mar 5, 2018

@bluca Okay, I'll see.

@sublee
Copy link
Author

sublee commented Mar 5, 2018

@bluca There's no more crash. Finally, this issue seems to be fixed! Thank you guys so much for the hard work to fix it.

@sublee sublee closed this as completed Mar 5, 2018
bluca added a commit that referenced this issue Mar 21, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants