Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eio_luv: Process.spawn → “Switch hook removed from wrong domain!” #446

Closed
smondet opened this issue Feb 22, 2023 · 11 comments
Closed

Comments

@smondet
Copy link
Contributor

smondet commented Feb 22, 2023

Using Eio_luv.Low_level.Process.spawn 1000s of times, I sometimes (apparently randomly) get:

Invalid_argument("Switch hook removed from wrong domain!")
Raised at Stdlib.invalid_arg in file "stdlib.ml", line 30, characters 20-45
Called from Eio__core__Switch.remove_hook in file "vendor/eio/lib_eio/core/switch.ml", line 20, characters 33-85
Called from Stdlib__Option.iter in file "option.ml" (inlined), line 26, characters 32-35
Called from Eio_luv.Low_level.Process.spawn.on_exit in file "vendor/eio/lib_eio_luv/eio_luv.ml", line 610, characters 8-44
Called from Luv__Process.spawn.(fun) in file "src/process.ml", line 117, characters 12-54

I think I have removed all uses of Domains in the application and I still get it (seems less often though).

(I know Luv is going away but reporting in case it is a deeper issue.)

@patricoferris
Copy link
Collaborator

I'll see if I can look into this a little more to make sure it isn't a deeper issue, but I put some debugging statements in the remove_hook function and was able to reproduce where the switch's domain ID was 2323485528 and the call to Domain.self () in that function was 0...

@patricoferris
Copy link
Collaborator

What's strange is that the only way for the switch domain ID to be set, as far as I can tell, is through the cancellation context here:

{ state = Finished; children; protected; fibers; domain = Domain.self () }
and with printf debugging there all of the ids going in are 0 but when they come back out the other side through remove_hook it has become a different number. I'm wondering if this is some nasty interaction with the callback retention mechanism, shared state and system threads in Luv... in which case perhaps it is another one for #434 ?

@talex5
Copy link
Collaborator

talex5 commented Feb 23, 2023

Can you record it doing that with rr or similar? Would be interesting to see when the 0 changes to another value.

There are certainly some errors in luv's ctypes bindings (e.g. aantron/luv#144), but it's possible the corruption is elsewhere so it would be good to know for sure.

@patricoferris
Copy link
Collaborator

As an update I took a quick look at using rr -- it seems that the thread-sigmask dance in eio_luv was getting in the way a bit (maybe I'm just using rr incorrectly) but when I removed it I couldn't reproduce the bug. It seems the hook is getting completely corrupted somewhere though and we're just lucky that it's managing to read an integer value from that place in memory I think. Adding some debug printing around the hook and not raising the exception seems to segfault but then a lot of the backtraces are lost to caml_callbacks e.g.

#0  0x000055d44c56507a in caml_apply2 ()
#1  <signal handler called>
#2  0x000055d44c6a2f62 in caml_callback2_exn (closure=<optimised out>, closure@entry=100317559916616, arg1=<optimised out>, arg2=<optimised out>, arg2@entry=1)
    at runtime/callback.c:190
#3  0x000055d44c6a342d in caml_callback2 (closure=closure@entry=100317559916616, arg1=<optimised out>, arg2=arg2@entry=1) at runtime/callback.c:261
#4  0x000055d44c672c7d in luv_exit_trampoline (c_handle=<optimised out>, exit_status=0, term_signal=0) at helpers.c:105
#5  0x000055d44c680912 in uv__chld (handle=<optimised out>, signum=<optimised out>) at src/unix/process.c:114
#6  0x000055d44c681fd6 in uv__signal_event (loop=0x55d44db168a0, w=<optimised out>, events=<optimised out>) at src/unix/signal.c:461
#7  0x000055d44c68af00 in uv__io_poll (loop=loop@entry=0x55d44db168a0, timeout=-1) at src/unix/epoll.c:388
#8  0x000055d44c6789fc in uv_run (loop=loop@entry=0x55d44db168a0, mode=mode@entry=UV_RUN_DEFAULT) at src/unix/core.c:389
#9  0x000055d44c66e3a4 in luv_stub_blocking_1_uv_run (x1377=<optimised out>, x1376=<optimised out>) at c_generated_functions.c:1800
#10 <signal handler called>
#11 0x000055d44c596728 in camlLuv_c_generated_functions__fun_6621 () at src/c/luv_c_generated_functions.ml:3216
#12 0x000055d44c570a18 in camlEio_luv__run2_5705 () at lib_eio_luv/eio_luv.ml:1302
#13 0x000055d44c6504eb in camlStdlib__Fun__protect_319 () at fun.ml:33
#14 0x000055d44c56534e in camlDune__exe__Main__entry () at main.ml:2
#15 0x000055d44c560270 in caml_program ()
#16 <signal handler called>
#17 0x000055d44c6c8d9d in caml_startup_common (pooling=<optimised out>, argv=0x7fffba429808) at runtime/startup_nat.c:129
#18 caml_startup_common (argv=0x7fffba429808, pooling=<optimised out>) at runtime/startup_nat.c:85
#19 0x000055d44c6c8e2f in caml_startup_exn (argv=<optimised out>) at runtime/startup_nat.c:136
#20 caml_startup (argv=<optimised out>) at runtime/startup_nat.c:141
--Type <RET> for more, q to quit, c to continue without paging--
#21 caml_main (argv=<optimised out>) at runtime/startup_nat.c:148
#22 0x000055d44c55f8d2 in main (argc=<optimised out>, argv=<optimised out>) at runtime/main.c:37

Because of adding this in on_exit:

(try Option.iter Switch.remove_hook !hook with _ -> Eio.traceln "%a %a" Switch.dump sw (Fmt.option Switch.dump_hook) !hook);

Without catching that exception, we end up falling through to the finally block in the main Eio_luv run function:

Invalid_argument("Switch hook removed from wrong domain!")
[New Thread 640819.640820]

Thread 2 received signal SIGKILL, Killed.
[Switching to Thread 640819.640820]
__read_chk (fd=fd@entry=3, buf=buf@entry=0x6a9a5ecddcc0, nbytes=1, buflen=buflen@entry=65536) at ./debug/read_chk.c:33
33      ./debug/read_chk.c: No such file or directory.
(rr) bt
#0  __read_chk (fd=fd@entry=3, buf=buf@entry=0x6a9a5ecddcc0, nbytes=1, buflen=buflen@entry=65536) at ./debug/read_chk.c:33
#1  0x000055e818df596b in read (__nbytes=<optimised out>, __buf=0x6a9a5ecddcc0, __fd=<optimised out>) at /usr/include/x86_64-linux-gnu/bits/unistd.h:38
#2  caml_unix_read (fd=<optimised out>, buf=<optimised out>, ofs=1, len=<optimised out>) at read_unix.c:32
#3  <signal handler called>
#4  0x000055e818d3380f in camlUnix__read_662 () at unix.ml:262
#5  0x000055e818cd0d5e in camlEio_luv__loop_5984 () at lib_eio_luv/eio_luv.ml:1319
#6  0x000055e818d30067 in camlThread__fun_840 () at thread.ml:48
#7  <signal handler called>
#8  0x000055e818e00850 in caml_callback_exn (closure=<optimised out>, closure@entry=121698424279728, arg=<optimised out>, arg@entry=1) at runtime/callback.c:168
#9  0x000055e818df1d06 in caml_thread_start (v=<optimised out>) at st_stubs.c:547
#10 0x00006a9a5ed83b43 in start_thread (arg=<optimised out>) at ./nptl/pthread_create.c:442
#11 0x00006a9a5ee14bb4 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:100

@talex5
Copy link
Collaborator

talex5 commented Feb 24, 2023

As an update I took a quick look at using rr -- it seems that the thread-sigmask dance in eio_luv was getting in the way a bit (maybe I'm just using rr incorrectly) but when I removed it I couldn't reproduce the bug.

That's odd. I'd expect unmasking signals in the Eio thread would cause memory corruption, because libuv installs signal handlers without using the alt stack (https://github.com/libuv/libuv/blob/988f2bfc4defb9a85a536a3e645834c161143ee0/src/unix/signal.c#L228). For example, that's why Lwt programs suffer memory corruption on OCaml 5 when spawning processes (ocsigen/lwt#981).

Adding some Gc.full_major calls may be another way to make it fail sooner.

@smondet
Copy link
Contributor Author

smondet commented Feb 28, 2023

Do you know a (temporary? ugly? slow?) work around for this?

(That way I can keep testing other things further while waiting for #435 )

@talex5
Copy link
Collaborator

talex5 commented Mar 2, 2023

Do you know a (temporary? ugly? slow?) work around for this?

Not easily, without knowing what's causing it. Would be interesting to know if #435 works correctly if you want to test that. I'll be taking a look at the process stuff next, now that #448 is ready for review.

@talex5
Copy link
Collaborator

talex5 commented Mar 3, 2023

This looks wrong (from https://github.com/aantron/luv/blob/master/src/c/helpers.c):

// CAMLlocal is (probably?) not needed here, because there is already a gc root
// pointing to the callback.
#define GET_REFERENCES(callback_index) \
    value reference_array = *gc_root; \
    value callback = Field(reference_array, callback_index);

#define GET_HANDLE_CALLBACK(callback_index) \
    value *gc_root = uv_handle_get_data((uv_handle_t*)c_handle); \
    GET_REFERENCES(callback_index)

static void luv_exit_trampoline(
    uv_process_t *c_handle, int64_t exit_status, int term_signal)
{
    caml_acquire_runtime_system();
    GET_HANDLE_CALLBACK(LUV_GENERIC_CALLBACK);
    caml_callback2(
        callback, caml_copy_int64(exit_status), Val_int(term_signal));
    caml_release_runtime_system();
}

So, luv gets callback (a local pointer to an OCaml value, possibly in the minor heap), doesn't register it with CAMLlocal (so the GC doesn't know about it), then calls caml_copy_int64, which can trigger a GC and move the callback.

@talex5
Copy link
Collaborator

talex5 commented Mar 3, 2023

This makes it easier to reproduce it:

static void luv_exit_trampoline(
    uv_process_t *c_handle, int64_t exit_status, int term_signal)
{
    caml_acquire_runtime_system();
    GET_HANDLE_CALLBACK(LUV_GENERIC_CALLBACK);
    for (int i = 1; i < 1000; i++ ) caml_copy_int64(1);		// Trigger bug faster
    caml_callback2(
        callback, caml_copy_int64(exit_status), Val_int(term_signal));
    caml_release_runtime_system();
}

@aantron
Copy link

aantron commented Apr 6, 2023

This should now be fixed in Luv master and will be out in 0.5.12.

@talex5
Copy link
Collaborator

talex5 commented Apr 6, 2023

Thanks!

@talex5 talex5 closed this as completed Apr 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants