Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

async_work: worker callback is called twice #992

Closed
cspiel1 opened this issue Oct 27, 2023 · 14 comments · Fixed by #995
Closed

async_work: worker callback is called twice #992

cspiel1 opened this issue Oct 27, 2023 · 14 comments · Fixed by #995

Comments

@cspiel1
Copy link
Collaborator

cspiel1 commented Oct 27, 2023

Observed this with baresip selftest in branch https://github.com/cspiel1/baresip/tree/rx_thread_activate, e.g. when executing these two tests in a row

./test/selftest test_call_rtcp test_call_rtp_timeout

then, not always but often I get:

==9791== Invalid read of size 8
==9791==    at 0x4894A17: async_work_main (rtprecv.c:741)
==9791==    by 0x48E3CC1: queueh (async.c:150)
==9791==    by 0x490D643: event_handler (mqueue.c:87)
==9791==    by 0x4909ACC: fd_poll (main.c:921)
==9791==    by 0x4909DD5: re_main (main.c:1059)
==9791==    by 0x13CF7F: re_main_timeout (test.c:39)
==9791==    by 0x117270: test_call_rtp_timeout_base (call.c:1001)
==9791==    by 0x117645: test_call_rtp_timeout (call.c:1024)
==9791==    by 0x13D2F8: run_one_test (main.c:86)
==9791==    by 0x13D8CE: main (main.c:273)
==9791==  Address 0x966e3e8 is 200 bytes inside a block of size 440 free'd
==9791==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==9791==    by 0x490C7E4: mem_deref (mem.c:390)
==9791==    by 0x4894AF3: async_work_main (rtprecv.c:763)
==9791==    by 0x48E3CC1: queueh (async.c:150)
==9791==    by 0x490D643: event_handler (mqueue.c:87)
==9791==    by 0x4909ACC: fd_poll (main.c:921)
==9791==    by 0x4909DD5: re_main (main.c:1059)
==9791==    by 0x13CF7F: re_main_timeout (test.c:39)
==9791==    by 0x1296F6: test_call_rtcp_base (call.c:2619)
==9791==    by 0x129A0C: test_call_rtcp (call.c:2648)
==9791==    by 0x13D2F8: run_one_test (main.c:86)
==9791==    by 0x13D8CE: main (main.c:273)

Which states that test_call_rtcp creates the async work and the callback (for the result) is called during next test test_call_rtp_timeout. Are there multiple bugs?

  • The mqueue of re_async is not cleared by re_async_cancel(async, id). A simple clear would not be correct, because the id has to be respected.
  • More general (maybe better suited as baresip issue): How should selftest use libre_init(), libre_close() resp. re_thread_async_init(), re_thread_async_close() in order to have each test independent from previous?
  • Or should we add a function libre_flush(void) that flushes the complete mainloop, clears all async worker and clears all file descriptor handlers?
@sreimers
Copy link
Member

sreimers commented Oct 27, 2023

  • The mqueue of re_async is not cleared by re_async_cancel(async, id). A simple clear would not be correct, because the id has to be respected.

I have to take a closer look. But normaly work->cb is cleared by re_async_cancel and queueh should not execute.
Can you debug if the job is really canceled?

  • More general (maybe better suited as baresip issue): How should selftest use libre_init(), libre_close() resp. re_thread_async_init(), re_thread_async_close() in order to have each test independent from previous?

These functions should only called once I think. It's more realistic in real app usage and each test should leave a clean state
by itself.

  • Or should we add a function libre_flush(void) that flushes the complete mainloop, clears all async worker and clears all file descriptor handlers?

I think this hides only possible problems.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 27, 2023

Thinking generally: If the last re_main of a test case is canceled and then more events are queued in the file descriptors, then they would be handled in the re_main of the next test case. Sure, if everything is correct then all file descriptors should be closed by the first test case.

Regarding mqueue: It uses a pipe which is handled in re_main. If a network event leads to re_cancel() and during processing of this network event something writes to the mqueue pipe, then the processing will be transferred to the next re_main.

Edit: I forgot: The debugging showed that the worker thread processed the workh very fast (I had none). Then the mqueue_push() was done during the first test case, while the cb was called in the first re_main of the following test case. And the re_async_cancel also worked correct. But re_async_cancel does not change the mqueue.

@sreimers
Copy link
Member

So the job entry is reused after re_async_cancel and then triggered by old mqueue push, right? But the stack strace shows already freed data access, I would normally guess this is old arg with old callback.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 27, 2023

So the job entry is reused after re_async_cancel and then triggered by old mqueue push, right? But the stack strace shows already freed data access, I would normally guess this is old arg with old callback.

I guess both is true.

Test 1 does:

  • The worker_thread moves the async_work into a->curl and with mqueue_push() into the mqueue.
  • Before the pipe of the mqueue is read re_thread_async_main_cancel() is called and this frees the work->arg.
    • Thus the work object of rtprecv.c is freed.
  • re_cancel is called and re_main is left. The test 1 ends here.

Test 2:

  • Starts re_main().
  • The pipe is processed and the async_work object is read from the pipe.
  • The function where cb points to is called with already freed async_work->arg.

This is my theory derived from the valgrind callstack. I try to verify this with debugging or maybe with a retest.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 28, 2023

The debugging commits of branch shows that the async work callback is called twice. So there is no interference from previous test case. It seems to be a different problem with async_work or mqueue. I update the title of the issue.

[ RUN      ] test_call_rtp_timeout
pass_rtcp_work:89 HUUUUUU rx=0x54a91c0
pass_rtcp_work:103 rx=0x54bf680 re_thread_async_main_id arg=0x54ea500
async_work_main:745 HUUUUUU arg=0x54ea500
async_work_main:745 HUUUUUU arg=0x54ea500
==122747== Invalid read of size 8
==122747==    at 0x4894AC9: async_work_main (rtprecv.c:746)
==122747==    by 0x48E3CC1: queueh (async.c:150)
==122747==    by 0x490D643: event_handler (mqueue.c:87)
==122747==    by 0x4909ACC: fd_poll (main.c:921)
==122747==    by 0x4909DD5: re_main (main.c:1059)
==122747==    by 0x13CDFE: re_main_timeout (test.c:39)
==122747==    by 0x117270: test_call_rtp_timeout_base (call.c:1001)
==122747==    by 0x117645: test_call_rtp_timeout (call.c:1027)
==122747==    by 0x13D177: run_one_test (main.c:86)
==122747==    by 0x13D74D: main (main.c:273)
==122747==  Address 0x54ea508 is 200 bytes inside a block of size 440 free'd
==122747==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==122747==    by 0x490C7E4: mem_deref (mem.c:390)
==122747==    by 0x4894BA5: async_work_main (rtprecv.c:768)
==122747==    by 0x48E3CC1: queueh (async.c:150)
==122747==    by 0x490D643: event_handler (mqueue.c:87)
==122747==    by 0x4909ACC: fd_poll (main.c:921)
==122747==    by 0x4909DD5: re_main (main.c:1059)
==122747==    by 0x13CDFE: re_main_timeout (test.c:39)
==122747==    by 0x117270: test_call_rtp_timeout_base (call.c:1001)
==122747==    by 0x117645: test_call_rtp_timeout (call.c:1027)
==122747==    by 0x13D177: run_one_test (main.c:86)
==122747==    by 0x13D74D: main (main.c:273)

@cspiel1 cspiel1 changed the title async_work: worker callbacks are called after cancel async_work: worker callback is called twice Oct 28, 2023
@sreimers
Copy link
Member

sreimers commented Oct 28, 2023

Just a quick guess but can you try this. it should not be needed, but maybe there is a strange race condition:

diff --git a/src/async/async.c b/src/async/async.c
index 7d37ac7f..35fb0814 100644
--- a/src/async/async.c
+++ b/src/async/async.c
@@ -70,6 +70,7 @@ static int worker_thread(void *arg)
                mtx_lock(work->mtx);
                if (work->workh)
                        work->err = work->workh(work->arg);
+               work->workh = NULL;
                mtx_unlock(work->mtx);
 
                mtx_lock(&a->mtx);
@@ -151,6 +152,7 @@ static void queueh(int id, void *data, void *arg)
        mtx_unlock(work->mtx);
 
        mtx_lock(&async->mtx);
+       work->cb = NULL;
        list_move(&work->le, &async->freel);
        mtx_unlock(&async->mtx);
 }

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 28, 2023

I tied ~10 times. This solves the crash with double cb calls. But better within one lock?

        mtx_lock(work->mtx);
-       if (work->cb)
+       if (work->cb) {
                work->cb(work->err, work->arg);
+               work->cb =NULL;
+       }
        mtx_unlock(work->mtx);

Do you have an idea why the same object is read twice from the pipe?

@sreimers
Copy link
Member

I tied ~10 times. This solves the crash with double cb calls. But better within one lock?

Yes looks fine.

Do you have an idea why the same object is read twice from the pipe?

Is mqueue_push called twice for this job?

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 29, 2023

One mqueue_push and two calls of queueh.

[ RUN      ] test_call_rtp_timeout
pass_rtcp_work:89 HUUUUUU rx=0x54cedf0
pass_rtcp_work:103 rx=0x54e53f0 re_thread_async_main_id arg=0x5511a90
worker_thread:78 HUUUUUU arg=0x5511a90
queueh:155 HUUUUUU call cb of arg 0x5511a90
async_work_main:745 HUUUUUU arg=0x5511a90
queueh:155 HUUUUUU call cb of arg 0x5511a90
async_work_main:745 HUUUUUU arg=0x5511a90
==199974== Invalid read of size 8

Can be reproduced with branch cspiel1/rx_thread_activate:

valgrind --leak-check=full --show-reachable=yes --error-exitcode=42 ./test/selftest test_call_rtcp test_call_rtp_timeout

cspiel1 added a commit to cspiel1/re that referenced this issue Oct 30, 2023
This solve some strange race condition which results in calling `cb` twice for
one async work object.
@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 30, 2023

Workaround: #993

The workaround does not solve the root of the problem which seems to be in mqueue or how pipe is used. It avoids only the worse result that leads to a freed memory access.

@sreimers
Copy link
Member

I will try to reproduce this today, maybe there is a strange bug within event loop handling.

@cspiel1
Copy link
Collaborator Author

cspiel1 commented Oct 30, 2023

Thanks! Maybe the pipe is handled in two re_main loops.

@sreimers
Copy link
Member

sreimers commented Oct 31, 2023

I think I have found the root cause. Looks like both epoll loops are
handling the same event. Before refactoring this was not a big problem since there was a separate table lookup per event loop. Now epoll provides a pointer to fhs object instead.

Will try to fix this.

edit: wrong assumption, it's executed by the same main thread. Very strange.

@sreimers
Copy link
Member

sreimers commented Oct 31, 2023

mqueue: ... [DATA Pointer] [Thread ID]

mqueue: push 0x612000000580 140364056876736                                                                                                                 
mqueue: push 0x612000000700 140364036912832                                                                                                                 
mqueue: push 0x612000000280 140364073662144                                                                                                                 
mqueue: push 0x612000000400 140364065269440  PUSH 1                                 
mqueue: push 0x612000040180 140364056876736                                   
mqueue: push 0x612000040300 140364036912832                                                                                                                 
mqueue: push 0x612000040480 140364036912832                                   
mqueue: push 0x612000000580 140364065269440                                                                                                                 
mqueue: push 0x612000000700 140364065269440                                                                                                                 
mqueue: execute 0x612000000580 140364156978176                                                                                                              
mqueue: execute 0x612000000700 140364156978176                                                                                                              
mqueue: execute 0x612000000280 140364156978176                                                                                                              
mqueue: push 0x612000000400 140364036912832   PUSH 2                                                                                                              
mqueue: execute 0x612000000400 140364156978176   EXEC 1                                                                                                          
mqueue: push 0x612000040180 140364073662144                                                                                                                 
mqueue: push 0x612000040300 140364065269440                                                                                                                 
mqueue: execute 0x612000040180 140364156978176                                                                                                              
mqueue: execute 0x612000040300 140364156978176                                                                                                              
mqueue: execute 0x612000040480 140364156978176                                                                                                              
mqueue: execute 0x612000000580 140364156978176                                                                                                              
mqueue: execute 0x612000000700 140364156978176                                                                                                              
mqueue: execute 0x612000000400 140364156978176  EXEC 2 crash

Now we are getting closer, there is another mqueue push from a different thread before execute, hard to see on first look.
Not sure how this could happen.

sreimers pushed a commit that referenced this issue Oct 31, 2023
This solve some strange race condition which results in calling `cb` twice for
one async work object.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants