async_work: worker callback is called twice #992

cspiel1 · 2023-10-27T11:47:11Z

Observed this with baresip selftest in branch https://github.com/cspiel1/baresip/tree/rx_thread_activate, e.g. when executing these two tests in a row

./test/selftest test_call_rtcp test_call_rtp_timeout

then, not always but often I get:

==9791== Invalid read of size 8
==9791==    at 0x4894A17: async_work_main (rtprecv.c:741)
==9791==    by 0x48E3CC1: queueh (async.c:150)
==9791==    by 0x490D643: event_handler (mqueue.c:87)
==9791==    by 0x4909ACC: fd_poll (main.c:921)
==9791==    by 0x4909DD5: re_main (main.c:1059)
==9791==    by 0x13CF7F: re_main_timeout (test.c:39)
==9791==    by 0x117270: test_call_rtp_timeout_base (call.c:1001)
==9791==    by 0x117645: test_call_rtp_timeout (call.c:1024)
==9791==    by 0x13D2F8: run_one_test (main.c:86)
==9791==    by 0x13D8CE: main (main.c:273)
==9791==  Address 0x966e3e8 is 200 bytes inside a block of size 440 free'd
==9791==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==9791==    by 0x490C7E4: mem_deref (mem.c:390)
==9791==    by 0x4894AF3: async_work_main (rtprecv.c:763)
==9791==    by 0x48E3CC1: queueh (async.c:150)
==9791==    by 0x490D643: event_handler (mqueue.c:87)
==9791==    by 0x4909ACC: fd_poll (main.c:921)
==9791==    by 0x4909DD5: re_main (main.c:1059)
==9791==    by 0x13CF7F: re_main_timeout (test.c:39)
==9791==    by 0x1296F6: test_call_rtcp_base (call.c:2619)
==9791==    by 0x129A0C: test_call_rtcp (call.c:2648)
==9791==    by 0x13D2F8: run_one_test (main.c:86)
==9791==    by 0x13D8CE: main (main.c:273)

Which states that test_call_rtcp creates the async work and the callback (for the result) is called during next test test_call_rtp_timeout. Are there multiple bugs?

The mqueue of re_async is not cleared by re_async_cancel(async, id). A simple clear would not be correct, because the id has to be respected.
More general (maybe better suited as baresip issue): How should selftest use libre_init(), libre_close() resp. re_thread_async_init(), re_thread_async_close() in order to have each test independent from previous?
Or should we add a function libre_flush(void) that flushes the complete mainloop, clears all async worker and clears all file descriptor handlers?

The text was updated successfully, but these errors were encountered:

sreimers · 2023-10-27T12:05:38Z

The mqueue of re_async is not cleared by re_async_cancel(async, id). A simple clear would not be correct, because the id has to be respected.

I have to take a closer look. But normaly work->cb is cleared by re_async_cancel and queueh should not execute.
Can you debug if the job is really canceled?

More general (maybe better suited as baresip issue): How should selftest use libre_init(), libre_close() resp. re_thread_async_init(), re_thread_async_close() in order to have each test independent from previous?

These functions should only called once I think. It's more realistic in real app usage and each test should leave a clean state
by itself.

Or should we add a function libre_flush(void) that flushes the complete mainloop, clears all async worker and clears all file descriptor handlers?

I think this hides only possible problems.

cspiel1 · 2023-10-27T12:51:47Z

Thinking generally: If the last re_main of a test case is canceled and then more events are queued in the file descriptors, then they would be handled in the re_main of the next test case. Sure, if everything is correct then all file descriptors should be closed by the first test case.

Regarding mqueue: It uses a pipe which is handled in re_main. If a network event leads to re_cancel() and during processing of this network event something writes to the mqueue pipe, then the processing will be transferred to the next re_main.

Edit: I forgot: The debugging showed that the worker thread processed the workh very fast (I had none). Then the mqueue_push() was done during the first test case, while the cb was called in the first re_main of the following test case. And the re_async_cancel also worked correct. But re_async_cancel does not change the mqueue.

sreimers · 2023-10-27T13:16:46Z

So the job entry is reused after re_async_cancel and then triggered by old mqueue push, right? But the stack strace shows already freed data access, I would normally guess this is old arg with old callback.

cspiel1 · 2023-10-27T16:13:41Z

So the job entry is reused after re_async_cancel and then triggered by old mqueue push, right? But the stack strace shows already freed data access, I would normally guess this is old arg with old callback.

I guess both is true.

Test 1 does:

The worker_thread moves the async_work into a->curl and with mqueue_push() into the mqueue.
Before the pipe of the mqueue is read re_thread_async_main_cancel() is called and this frees the work->arg.
- Thus the work object of rtprecv.c is freed.
re_cancel is called and re_main is left. The test 1 ends here.

Test 2:

Starts re_main().
The pipe is processed and the async_work object is read from the pipe.
The function where cb points to is called with already freed async_work->arg.

This is my theory derived from the valgrind callstack. I try to verify this with debugging or maybe with a retest.

cspiel1 · 2023-10-28T10:49:34Z

The debugging commits of branch shows that the async work callback is called twice. So there is no interference from previous test case. It seems to be a different problem with async_work or mqueue. I update the title of the issue.

[ RUN      ] test_call_rtp_timeout
pass_rtcp_work:89 HUUUUUU rx=0x54a91c0
pass_rtcp_work:103 rx=0x54bf680 re_thread_async_main_id arg=0x54ea500
async_work_main:745 HUUUUUU arg=0x54ea500
async_work_main:745 HUUUUUU arg=0x54ea500
==122747== Invalid read of size 8
==122747==    at 0x4894AC9: async_work_main (rtprecv.c:746)
==122747==    by 0x48E3CC1: queueh (async.c:150)
==122747==    by 0x490D643: event_handler (mqueue.c:87)
==122747==    by 0x4909ACC: fd_poll (main.c:921)
==122747==    by 0x4909DD5: re_main (main.c:1059)
==122747==    by 0x13CDFE: re_main_timeout (test.c:39)
==122747==    by 0x117270: test_call_rtp_timeout_base (call.c:1001)
==122747==    by 0x117645: test_call_rtp_timeout (call.c:1027)
==122747==    by 0x13D177: run_one_test (main.c:86)
==122747==    by 0x13D74D: main (main.c:273)
==122747==  Address 0x54ea508 is 200 bytes inside a block of size 440 free'd
==122747==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==122747==    by 0x490C7E4: mem_deref (mem.c:390)
==122747==    by 0x4894BA5: async_work_main (rtprecv.c:768)
==122747==    by 0x48E3CC1: queueh (async.c:150)
==122747==    by 0x490D643: event_handler (mqueue.c:87)
==122747==    by 0x4909ACC: fd_poll (main.c:921)
==122747==    by 0x4909DD5: re_main (main.c:1059)
==122747==    by 0x13CDFE: re_main_timeout (test.c:39)
==122747==    by 0x117270: test_call_rtp_timeout_base (call.c:1001)
==122747==    by 0x117645: test_call_rtp_timeout (call.c:1027)
==122747==    by 0x13D177: run_one_test (main.c:86)
==122747==    by 0x13D74D: main (main.c:273)

sreimers · 2023-10-28T10:56:15Z

Just a quick guess but can you try this. it should not be needed, but maybe there is a strange race condition:

diff --git a/src/async/async.c b/src/async/async.c
index 7d37ac7f..35fb0814 100644
--- a/src/async/async.c
+++ b/src/async/async.c
@@ -70,6 +70,7 @@ static int worker_thread(void *arg)
                mtx_lock(work->mtx);
                if (work->workh)
                        work->err = work->workh(work->arg);
+               work->workh = NULL;
                mtx_unlock(work->mtx);
 
                mtx_lock(&a->mtx);
@@ -151,6 +152,7 @@ static void queueh(int id, void *data, void *arg)
        mtx_unlock(work->mtx);
 
        mtx_lock(&async->mtx);
+       work->cb = NULL;
        list_move(&work->le, &async->freel);
        mtx_unlock(&async->mtx);
 }

cspiel1 · 2023-10-28T11:10:59Z

I tied ~10 times. This solves the crash with double cb calls. But better within one lock?

        mtx_lock(work->mtx);
-       if (work->cb)
+       if (work->cb) {
                work->cb(work->err, work->arg);
+               work->cb =NULL;
+       }
        mtx_unlock(work->mtx);

Do you have an idea why the same object is read twice from the pipe?

sreimers · 2023-10-28T21:33:14Z

I tied ~10 times. This solves the crash with double cb calls. But better within one lock?

Yes looks fine.

Do you have an idea why the same object is read twice from the pipe?

Is mqueue_push called twice for this job?

cspiel1 · 2023-10-29T06:40:06Z

One mqueue_push and two calls of queueh.

[ RUN      ] test_call_rtp_timeout
pass_rtcp_work:89 HUUUUUU rx=0x54cedf0
pass_rtcp_work:103 rx=0x54e53f0 re_thread_async_main_id arg=0x5511a90
worker_thread:78 HUUUUUU arg=0x5511a90
queueh:155 HUUUUUU call cb of arg 0x5511a90
async_work_main:745 HUUUUUU arg=0x5511a90
queueh:155 HUUUUUU call cb of arg 0x5511a90
async_work_main:745 HUUUUUU arg=0x5511a90
==199974== Invalid read of size 8

Can be reproduced with branch cspiel1/rx_thread_activate:

valgrind --leak-check=full --show-reachable=yes --error-exitcode=42 ./test/selftest test_call_rtcp test_call_rtp_timeout

This solve some strange race condition which results in calling `cb` twice for one async work object.

cspiel1 · 2023-10-30T07:06:27Z

Workaround: #993

The workaround does not solve the root of the problem which seems to be in mqueue or how pipe is used. It avoids only the worse result that leads to a freed memory access.

sreimers · 2023-10-30T08:43:20Z

I will try to reproduce this today, maybe there is a strange bug within event loop handling.

cspiel1 · 2023-10-30T09:03:01Z

Thanks! Maybe the pipe is handled in two re_main loops.

sreimers · 2023-10-31T08:58:30Z

I think I have found the root cause. Looks like both epoll loops are
handling the same event. Before refactoring this was not a big problem since there was a separate table lookup per event loop. Now epoll provides a pointer to fhs object instead.

Will try to fix this.

edit: wrong assumption, it's executed by the same main thread. Very strange.

sreimers · 2023-10-31T10:00:33Z

mqueue: ... [DATA Pointer] [Thread ID]

mqueue: push 0x612000000580 140364056876736                                                                                                                 
mqueue: push 0x612000000700 140364036912832                                                                                                                 
mqueue: push 0x612000000280 140364073662144                                                                                                                 
mqueue: push 0x612000000400 140364065269440  PUSH 1                                 
mqueue: push 0x612000040180 140364056876736                                   
mqueue: push 0x612000040300 140364036912832                                                                                                                 
mqueue: push 0x612000040480 140364036912832                                   
mqueue: push 0x612000000580 140364065269440                                                                                                                 
mqueue: push 0x612000000700 140364065269440                                                                                                                 
mqueue: execute 0x612000000580 140364156978176                                                                                                              
mqueue: execute 0x612000000700 140364156978176                                                                                                              
mqueue: execute 0x612000000280 140364156978176                                                                                                              
mqueue: push 0x612000000400 140364036912832   PUSH 2                                                                                                              
mqueue: execute 0x612000000400 140364156978176   EXEC 1                                                                                                          
mqueue: push 0x612000040180 140364073662144                                                                                                                 
mqueue: push 0x612000040300 140364065269440                                                                                                                 
mqueue: execute 0x612000040180 140364156978176                                                                                                              
mqueue: execute 0x612000040300 140364156978176                                                                                                              
mqueue: execute 0x612000040480 140364156978176                                                                                                              
mqueue: execute 0x612000000580 140364156978176                                                                                                              
mqueue: execute 0x612000000700 140364156978176                                                                                                              
mqueue: execute 0x612000000400 140364156978176  EXEC 2 crash

Now we are getting closer, there is another mqueue push from a different thread before execute, hard to see on first look.
Not sure how this could happen.

This solve some strange race condition which results in calling `cb` twice for one async work object.

cspiel1 changed the title ~~async_work: worker callbacks are called after cancel~~ async_work: worker callback is called twice Oct 28, 2023

cspiel1 added a commit to cspiel1/re that referenced this issue Oct 30, 2023

async: clear callback function pointer after use (baresip#992)

4412b96

This solve some strange race condition which results in calling `cb` twice for one async work object.

sreimers mentioned this issue Oct 31, 2023

async: fix re_async_cancel mqueue handling #995

Merged

sreimers closed this as completed in #995 Oct 31, 2023

sreimers pushed a commit that referenced this issue Oct 31, 2023

async: clear callback function pointer after use (#992) (#993)

6ac2e91

This solve some strange race condition which results in calling `cb` twice for one async work object.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

async_work: worker callback is called twice #992

async_work: worker callback is called twice #992

cspiel1 commented Oct 27, 2023 •

edited

Loading

sreimers commented Oct 27, 2023 •

edited

Loading

cspiel1 commented Oct 27, 2023 •

edited

Loading

sreimers commented Oct 27, 2023

cspiel1 commented Oct 27, 2023 •

edited

Loading

cspiel1 commented Oct 28, 2023

sreimers commented Oct 28, 2023 •

edited

Loading

cspiel1 commented Oct 28, 2023 •

edited

Loading

sreimers commented Oct 28, 2023

cspiel1 commented Oct 29, 2023 •

edited

Loading

cspiel1 commented Oct 30, 2023

sreimers commented Oct 30, 2023

cspiel1 commented Oct 30, 2023

sreimers commented Oct 31, 2023 •

edited

Loading

sreimers commented Oct 31, 2023 •

edited

Loading

async_work: worker callback is called twice #992

async_work: worker callback is called twice #992

Comments

cspiel1 commented Oct 27, 2023 • edited Loading

sreimers commented Oct 27, 2023 • edited Loading

cspiel1 commented Oct 27, 2023 • edited Loading

sreimers commented Oct 27, 2023

cspiel1 commented Oct 27, 2023 • edited Loading

cspiel1 commented Oct 28, 2023

sreimers commented Oct 28, 2023 • edited Loading

cspiel1 commented Oct 28, 2023 • edited Loading

sreimers commented Oct 28, 2023

cspiel1 commented Oct 29, 2023 • edited Loading

cspiel1 commented Oct 30, 2023

sreimers commented Oct 30, 2023

cspiel1 commented Oct 30, 2023

sreimers commented Oct 31, 2023 • edited Loading

sreimers commented Oct 31, 2023 • edited Loading

cspiel1 commented Oct 27, 2023 •

edited

Loading

sreimers commented Oct 27, 2023 •

edited

Loading

cspiel1 commented Oct 27, 2023 •

edited

Loading

cspiel1 commented Oct 27, 2023 •

edited

Loading

sreimers commented Oct 28, 2023 •

edited

Loading

cspiel1 commented Oct 28, 2023 •

edited

Loading

cspiel1 commented Oct 29, 2023 •

edited

Loading

sreimers commented Oct 31, 2023 •

edited

Loading

sreimers commented Oct 31, 2023 •

edited

Loading