Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable hanging on zmq_recv() with timeout on IOS 10.3. ( Xcode 8.3.1 is used) #2537

Closed
cybermag opened this issue Apr 19, 2017 · 10 comments

Comments

@cybermag
Copy link

cybermag commented Apr 19, 2017

Hello! I am using ZeroMQ 4.2.2 for my IOS project and faced with the hanging issue when
code calls zmq_recv() with some defined timeout. To exclude developer errors I have reproduced issue using test_timeio from official tests.

You can find the code of the test below, I have marked the line on which test freezes with
// this hangs comment.

int main (void)
{
    setup_test_environment();
    void *ctx = zmq_ctx_new ();
    assert (ctx);

    void *frontend = zmq_socket (ctx, ZMQ_DEALER);
    assert (frontend);
    int rc = zmq_bind (frontend, "inproc://timeout_test");
    assert (rc == 0);

    //  Receive on disconnected socket returns immediately
    char buffer [32];
    rc = zmq_recv (frontend, buffer, 32, ZMQ_DONTWAIT);
    assert (rc == -1);
    assert (zmq_errno() == EAGAIN);
    
    //  Check whether receive timeout is honored
    int timeout = 250;
    rc = zmq_setsockopt (frontend, ZMQ_RCVTIMEO, &timeout, sizeof (int));
    assert (rc == 0);

    void* stopwatch = zmq_stopwatch_start();
    rc = zmq_recv (frontend, buffer, 32, 0);            // this hangs
    assert (rc == -1);
    assert (zmq_errno () == EAGAIN);
    unsigned int elapsed = zmq_stopwatch_stop(stopwatch) / 1000;
    assert (elapsed > 200 && elapsed < 300);

    //  Check that normal message flow works as expected
    void *backend = zmq_socket (ctx, ZMQ_DEALER);
    assert (backend);
    rc = zmq_connect (backend, "inproc://timeout_test");
    assert (rc == 0);
    rc = zmq_setsockopt (backend, ZMQ_SNDTIMEO, &timeout, sizeof (int));
    assert (rc == 0);

    rc = zmq_send (backend, "Hello", 5, 0);
    assert (rc == 5);
    rc = zmq_recv (frontend, buffer, 32, 0);
    assert (rc == 5);

    //  Clean-up
    rc = zmq_close (backend);
    assert (rc == 0);
    
    rc = zmq_close (frontend);
    assert (rc == 0);
    
    rc = zmq_ctx_term (ctx);
    assert (rc == 0);

    return 0 ;
}

I have debugged the code related to zmq_recv() and I have found that problem in a piece of code inside function int zmq::socket_base_t::recv (msg_t *msg_, int flags_)

//  Compute the time when the timeout should occur.
    //  If the timeout is infinite, don't care.
    int timeout = options.rcvtimeo;
    uint64_t end = timeout < 0 ? 0 : (clock.now_ms () + timeout);

    //  In blocking scenario, commands are processed over and over again until
    //  we are able to fetch a message.
    bool block = (ticks != 0);
    while (true) {
        if (unlikely (process_commands (block ? timeout : 0, false) != 0)) {
            return -1;
        }
        rc = xrecv (msg_);
        if (rc == 0) {
            ticks = 0;
            break;
        }
        if (unlikely (errno != EAGAIN)) {
            return -1;
        }
        block = true;
        if (timeout > 0) {
            timeout = (int) (end - clock.now_ms ());
            if (timeout <= 0) {
                errno = EAGAIN;
                return -1;
            }
        }
    }

In some case the while(true) loop never breaks.
I have logged end, timeout and clock.now_ms() value and have discoverd that value of clock.now_ms() sometimes is not changed, it always returns the same value.
The root of the problem in rdtsc() function.


uint64_t zmq::clock_t::rdtsc ()
{
#if (defined _MSC_VER && (defined _M_IX86 || defined _M_X64))
    return __rdtsc ();
#elif (defined __GNUC__ && (defined __i386__ || defined __x86_64__))
    uint32_t low, high;
    __asm__ volatile ("rdtsc" : "=a" (low), "=d" (high));
    return (uint64_t) high << 32 | low;
#elif (defined __SUNPRO_CC && (__SUNPRO_CC >= 0x5100) && (defined __i386 || \
    defined __amd64 || defined __x86_64))
    union {
        uint64_t u64val;
        uint32_t u32val [2];
    } tsc;
    asm("rdtsc" : "=a" (tsc.u32val [0]), "=d" (tsc.u32val [1]));
    return tsc.u64val;
#elif defined(__s390__)
    uint64_t tsc;
    asm("\tstck\t%0\n" : "=Q" (tsc) : : "cc");
    return(tsc);
#else
    struct timespec ts;
    #if defined ZMQ_HAVE_OSX && __MAC_OS_X_VERSION_MIN_REQUIRED < 101200 // less than macOS 10.12
        alt_clock_gettime (CLOCK_MONOTONIC, &ts);
    #else
        clock_gettime (CLOCK_MONOTONIC, &ts);
    #endif
    return (uint64_t)(ts.tv_sec) * 1000000000 + ts.tv_nsec;
#endif
}
 

as ZMQ_HAVE_OSX define is set, rtdsc() calls function alt_clock_gettime()


int alt_clock_gettime (int clock_id, timespec *ts)
{
    // The clock_id specified is not supported on this system.
    if (clock_id != CLOCK_REALTIME) {
        errno = EINVAL;
        return -1;
    }
    clock_serv_t cclock;
    mach_timespec_t mts;
    host_get_clock_service (mach_host_self (), CALENDAR_CLOCK, &cclock);
    clock_get_time (cclock, &mts);
    mach_port_deallocate (mach_task_self (), cclock);
    ts->tv_sec = mts.tv_sec;
    ts->tv_nsec = mts.tv_nsec;
    return 0;
}

as clock_id == CLOCK_MONOTONIC the function returns -1 with errno = EINVAL.
The return value of alt_clock_gettime() is not checked, so ts struct will be filled by garbage which leads to unpredictable issues on recv.

I have changed the code to use
alt_clock_gettime (CLOCK_REALTIME, &ts);
instead of
alt_clock_gettime (CLOCK_MONOTONIC, &ts);

and this worked for me, but I am not sure that such fix is correct.
Could you please advice if such fix is correct?

@bluca
Copy link
Member

bluca commented Apr 19, 2017

That does seem a bit weird, to have a function that errors out if not using _REALTIME but then calling it with something else, but it was refactored a few times so maybe that's why

Let's start with the simple fix to get past the hang, could you please send a PR? Please also include the unit test

Then, what we really want is the monotonic clock since it's measuring time elapsed, and realtime can jump backwards

@evoskuil @robertcastle @azawawi can we do CLOCK_MONOTONIC on OSX?

That code was added by a8f11b3 and modified by 0dfb32a and bf6d8a5

@azawawi
Copy link
Contributor

azawawi commented Apr 19, 2017

iOS 9 is affected. Please take a look at azawawi/SwiftyZeroMQ#14.

@Asmod4n
Copy link
Contributor

Asmod4n commented Apr 19, 2017

Why is the Wall clock used there? CALENDAR_CLOCK should be changed to SYSTEM_CLOCK or changes from ntpd could have interesting outcomes..

@evoskuil
Copy link
Contributor

evoskuil commented Apr 19, 2017

@Asmod4n The clock choice is parameterized, but I failed to pass on the parameter to host_get_clock_service. Presumably fixing that will resolve the issue.

@bluca
Copy link
Member

bluca commented Apr 19, 2017

Could you please send a PR to fix it?

@Asmod4n
Copy link
Contributor

Asmod4n commented Apr 19, 2017

@bluca done, tests passed locally with the changes.

@bluca bluca closed this as completed in ce602d0 Apr 19, 2017
@bluca
Copy link
Member

bluca commented Apr 19, 2017

@cybermag please try again from latest master

@cybermag
Copy link
Author

I have checked the fixes. All is ok in master branch.
Would such fix be merged into 4.2.2 branch or would it be included on next stable release?

@cybermag
Copy link
Author

Thanks a lot!

@bluca
Copy link
Member

bluca commented Apr 20, 2017

It will be in the next stable release, 4.2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants