Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when using multicast function #3218

Closed
answeroo opened this issue Aug 15, 2018 · 6 comments
Closed

Crash when using multicast function #3218

answeroo opened this issue Aug 15, 2018 · 6 comments

Comments

@answeroo
Copy link
Contributor

answeroo commented Aug 15, 2018

Please use this template for reporting suspected bugs or requests for help.

Issue description

System crash occasionally when using OpenPGM for multicast. We did some test and found that pgm_receiver_t::inpos becomes a nullptr, and it is passed to memcpy() as the second argument(pointer to source memory). That caused the crash. Call stack is provided below.

We suspect this is a threading issue. Are pgm_receiver_t::in_event() and pgm_receiver_t::restart_input() called simultaneously in different thread?

Environment

  • libzmq version (commit hash if unreleased): 4.0.5
  • OS: Centos 6.7

Minimal test code / Steps to reproduce the issue

This does not always happen. It happens more often when working load is heavy.

What's the actual result? (include assertion message & call stack if applicable)

This call stack is from version 4.0.5. After we upgraded to 4.2.3, we still get the same issue.

(gdb) bt
#0 0x0000003fe348995e in memcpy () from /lib64/libc.so.6
#1 0x00007f942b6b04a9 in zmq::decoder_base_tzmq::v1_decoder_t::decode(unsigned char const*, unsigned long, unsigned long&) () at decoder.hpp:119
#2 0x00007f942b68823c in zmq::pgm_receiver_t::process_input(zmq::v1_decoder_t*) () at pgm_receiver.cpp:261
#3 0x00007f942b687a69 in zmq::pgm_receiver_t::restart_input() () at pgm_receiver.cpp:124
#4 0x00007f942b695bb5 in zmq::session_base_t::write_activated(zmq::pipe_t*) () at session_base.cpp:260
#5 0x00007f942b68d9e9 in zmq::pipe_t::process_activate_write(unsigned long) () at pipe.cpp:233
#6 0x00007f942b682d2d in zmq::object_t::process_command(zmq::command_t&) () from /opt/yadev/3rdParty/cpp/ZeroMQ/4.0.5/lib/libzmq.so.4
#7 0x00007f942b6789da in zmq::io_thread_t::in_event() () at io_thread.cpp:73
#8 0x00007f942b676c3d in zmq::epoll_t::loop() () at epoll.cpp:165
#9 0x00007f942b676cfa in zmq::epoll_t::worker_routine(void*) () at epoll.cpp:178
#10 0x00007f942b6a78be in thread_routine () from /opt/yadev/3rdParty/cpp/ZeroMQ/4.0.5/lib/libzmq.so.4
#11 0x0000003fe3c079d1 in start_thread () from /lib64/libpthread.so.0
#12 0x0000003fe34e8b6d in clone () from /lib64/libc.so.6

What's the expected result?

System does not crash.

@bluca
Copy link
Member

bluca commented Aug 15, 2018

Can you share a simple example to reproduce? Are you sure that a socket is not being used from multiple threads? This includes create and close

@answeroo
Copy link
Contributor Author

answeroo commented Aug 17, 2018

To reproduce this situation is not easy. It has something to do with data, and even the frequency of CPU. This is just a undefined behavior that happens now and then.

We did some testing for pgm_receiver_t::in_event() function, and we found that in_event could be called even if there is a pending restart_input call. This is the problem that caused the crash, because inpos could become nullptr in in_event function call. In this situation, the system would crash. If we are lucky that inpos does not become null, its value could be changed. restart_input wont get the intended result.

The reason for this conflict is that though in_event stops polling, it does not stops triggering. Some events are already polled, and about to trigger in_event.

Our proposed fix for this problem is the following:

  1. Add a check for active_tsi at the beginning of in_event. If this pointer is not null, there is a pending restart_input call. We just return to make sure restart_input run first and restart_input calls in_event after processing the pending data;

  2. Add insize = 0 at the beginning of each cycle.

  3. Move "inpos = (unsigned char*)tmp;" down to "insize = static_cast<size_t>(received), to make code cleaner.

This function now looks like this. The output part at the very beginning is just for our testing purpose, and that message IS printed occasionally.

Certainly our change need reviews from the experts in zmq.

//////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

void zmq::pgm_receiver_t::in_event ()
{

if (active_tsi != NULL) {
       std::cerr << __FILE__ << "::" << __LINE__ << " in_event called with active_tsi not null " << std::endl;
       return;
}
// Read data from the underlying pgm_socket.
const pgm_tsi_t *tsi = NULL;

if (has_rx_timer) {
    cancel_timer (rx_timer_id);
    has_rx_timer = false;
}

//  TODO: This loop can effectively block other engines in the same I/O
//  thread in the case of high load.
while (true) {

    //  Get new batch of data.
    //  Note the workaround made not to break strict-aliasing rules.
            insize = 0;
    void *tmp = NULL;
    ssize_t received = pgm_socket.receive (&tmp, &tsi);

    //  No data to process. This may happen if the packet received is
    //  neither ODATA nor ODATA.
    if (received == 0) {
        if (errno == ENOMEM || errno == EBUSY) {
            const long timeout = pgm_socket.get_rx_timeout ();
            add_timer (timeout, rx_timer_id);
            has_rx_timer = true;
        }
        break;
    }

    //  Find the peer based on its TSI.
    peers_t::iterator it = peers.find (*tsi);

    //  Data loss. Delete decoder and mark the peer as disjoint.
    if (received == -1) {
        if (it != peers.end ()) {
            it->second.joined = false;
            if (it->second.decoder != NULL) {
                LIBZMQ_DELETE(it->second.decoder);
            }
        }
        break;
    }

    //  New peer. Add it to the list of know but unjoint peers.
    if (it == peers.end ()) {
        peer_info_t peer_info = {false, NULL};
        it = peers.ZMQ_MAP_INSERT_OR_EMPLACE (*tsi, peer_info).first;
    }

    insize = static_cast <size_t> (received);
    inpos = (unsigned char*) tmp;

    //  Read the offset of the fist message in the current packet.
    zmq_assert (insize >= sizeof (uint16_t));
    uint16_t offset = get_uint16 (inpos);
    inpos += sizeof (uint16_t);
    insize -= sizeof (uint16_t);

    //  Join the stream if needed.
    if (!it->second.joined) {

        //  There is no beginning of the message in current packet.
        //  Ignore the data.
        if (offset == 0xffff)
            continue;

        zmq_assert (offset <= insize);
        zmq_assert (it->second.decoder == NULL);

        //  We have to move data to the beginning of the first message.
        inpos += offset;
        insize -= offset;

        //  Mark the stream as joined.
        it->second.joined = true;

        //  Create and connect decoder for the peer.
        it->second.decoder = new (std::nothrow)
            v1_decoder_t (0, options.maxmsgsize);
        alloc_assert (it->second.decoder);
    }

    int rc = process_input (it->second.decoder);
    if (rc == -1) {
        if (errno == EAGAIN) {
            active_tsi = tsi;

            //  Stop polling.
            reset_pollin (pipe_handle);
            reset_pollin (socket_handle);

            break;
        }

        it->second.joined = false;
        LIBZMQ_DELETE(it->second.decoder);
        insize = 0;
    }
}

//  Flush any messages decoder may have produced.
session->flush ();

}

@bluca
Copy link
Member

bluca commented Aug 17, 2018

That looks fine, feel free to send a PR

@answeroo
Copy link
Contributor Author

I am not quite familiar with Github yet. Could you help me create a PR for this? @bluca

@bluca
Copy link
Member

bluca commented Aug 17, 2018

In short you need to fork this repository, clone it, branch it, commit the changes and push them, and then open a pull request on the website. But there are plenty tutorials available to be found on google, better than anything I could write.

@bluca
Copy link
Member

bluca commented Aug 19, 2018

Fixed by #3226

@bluca bluca closed this as completed Aug 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants