Skip to content

Condition destroy error shouldn't be a fatal error #2893

@vagetablechicken

Description

@vagetablechicken

When we stop one be, it always makes a fatal error:

F0213 12:33:53.604131 117681 utils.cpp:1124] fail to destroy cond. err=Device or resource busy

https://github.com/apache/incubator-doris/blob/fd492e3b6fd729e617536842ba4092911f8afae8/be/src/olap/utils.cpp#L133-L139

We all know that EBUSY means destroy the object referenced by cond while it is referenced by another thread.

It's a common fault in multi-threads, so we shouldn't make it fatal after one try.
How about make it fatal after several failure attempts? As follows.

#define PTHREAD_COND_DESTROY_WITH_LOG(condptr) \
    do {\
        int cond_ret = 0;\
        int try_time = 0;\
        while (0 != (cond_ret = pthread_cond_destroy(condptr))) {\
            if (try_time++ < 20) sleep(1); \
            else LOG(FATAL) << "fail to destroy cond. err=" << strerror(cond_ret); \
        }\
    } while (0)

My test result:
It will wait 10~15s when the be is idle.
----2020/02/26----
It's my misunderstanding of wait 10s. My stop ope is:

  1. send SIGTERM
  2. wait 10s, if process can't exit, send SIGKILL

So, if only SIGTERM sent, BE may take longer to destory itself.
The root cause is the thread pool management. So this issue should be closed.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions