Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ocaml5-issue] Windows failures on threadomain #203

Closed
jmid opened this issue Nov 18, 2022 · 49 comments
Closed

[ocaml5-issue] Windows failures on threadomain #203

jmid opened this issue Nov 18, 2022 · 49 comments
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime

Comments

@jmid
Copy link
Collaborator

jmid commented Nov 18, 2022

We have also observed failures on Windows on the threadomain test combining Domains and Threads.

Here is a fresh occurrence on 5.0.0~beta1:
https://github.com/jmid/multicoretests/actions/runs/3487203320/jobs/5834552533

random seed: 152027115
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule
16 |  (alias runtest)
17 |  (package multicoretests)
18 |  (deps threadomain.exe)
19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
random seed: 450754698

This exits with a weird exit code -1073741819

@jmid
Copy link
Collaborator Author

jmid commented Nov 18, 2022

This was previously observed on #179 - also on the Windows 5.0.0~beta1 combination and with the same symptom of a weird exit code:
#179 (comment)

@shym
Copy link
Collaborator

shym commented Nov 18, 2022

Following a discussion with @dra27, I wrote a dummy executable to trigger a segfault:

PS> .\a
PS> echo $LASTEXITCODE
-1073741819

(-1073741819 is 0xc0000005, by the way).

@dra27
Copy link

dra27 commented Nov 19, 2022

(-1073741819 is 0xc0000005, by the way).

Which is indeed access violation / segfault 👍

@jmid
Copy link
Collaborator Author

jmid commented Nov 20, 2022

Saw what looks like a deadlock and a 6 hour timeout on threadomain on Windows 5.0.0+trunk - with none of the previous tests haven taken substantial amounts of time:
https://github.com/jmid/multicoretests/actions/runs/3497938069/jobs/5857667163

random seed: 528980161
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   52    0    0   52 /  500    60.4s Mash up of threads and domains
[ ]  106    0    0  106 /  500   [120](https://github.com/jmid/multicoretests/actions/runs/3497938069/jobs/5857667163#step:8:121).6s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Nov 22, 2022

Again experienced a likely deadlock and a 6 hour timeout on threadomain, Windows, 5.0.0+trunk
https://github.com/jmid/multicoretests/actions/runs/3516210001/jobs/5892468944

random seed: 392404446
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   59    0    0   59 /  500    60.7s Mash up of threads and domains
[ ]  119    0    0  119 /  500   120.8s Mash up of threads and domains
[ ]  167    0    0  167 /  500   182.1s Mash up of threads and domains
[ ]  219    0    0  219 /  500   244.4s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Nov 23, 2022

Another likely deadlock and 6 hour timeout on threadomain, Windows, 5.0.0~beta1
https://github.com/jmid/multicoretests/actions/runs/3521871766/jobs/5904199298

random seed: 345763773
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   53    0    0   53 /  500    60.2s Mash up of threads and domains
[ ]  100    0    0  100 /  500   123.1s Mash up of threads and domains
[ ]  157    0    0  157 /  500   183.4s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Nov 28, 2022

Another timeout on Windows 5.0.0+trunk
https://github.com/jmid/multicoretests/actions/runs/3551218595/jobs/5965232850

random seed: 131534711
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   50    0    0   50 /  500    61.6s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Nov 28, 2022

A Windows crash on Windows 5.0.0+beta2
https://github.com/jmid/multicoretests/actions/runs/3564269738/jobs/5988029211

random seed: 156968988
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule

16 |  (alias runtest)

17 |  (package multicoretests)

18 |  (deps threadomain.exe)

19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@jmid
Copy link
Collaborator Author

jmid commented Nov 29, 2022

@jmid
Copy link
Collaborator Author

jmid commented Nov 30, 2022

Triggered again on Windows 5.0.0~beta2
https://github.com/jmid/multicoretests/actions/runs/3575340404/jobs/6011752541

random seed: 102722974
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule

16 |  (alias runtest)

17 |  (package multicoretests)

18 |  (deps threadomain.exe)

19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@jmid
Copy link
Collaborator Author

jmid commented Dec 1, 2022

Spotted again on Windows 5.0.0~beta2
https://github.com/jmid/multicoretests/actions/runs/3584338103/jobs/6030878640

random seed: 284870337
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   56    0    0   56 /  500    61.4s Mash up of threads and domains
[ ]  120    0    0  120 /  500   123.0s Mash up of threads and domains
[ ]  187    0    0  187 /  500   183.4s Mash up of threads and domains
[ ]  259    0    0  259 /  500   244.5s Mash up of threads and domains
[ ]  318    0    0  318 /  500   304.9s Mash up of threads and domains
File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule

16 |  (alias runtest)

17 |  (package multicoretests)

18 |  (deps threadomain.exe)

19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]  377    0    0  377 /  500   365.0s Mash up of threads and domains

@jmid
Copy link
Collaborator Author

jmid commented Dec 1, 2022

Likely deadlock on Windows 5.0.0+trunk causing 6h timeout:
https://github.com/jmid/multicoretests/actions/runs/3590624408/jobs/6044204690

random seed: 273255058
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   46    0    0   46 /  500    61.4s Mash up of threads and domains
Fatal error: exception User interruption
[ ]  100    0    0  100 /  500   121.7s Mash up of threads and domainsTerminate batch job (Y/N)? 
^C
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Dec 2, 2022

Another Windows 5.0.0+trunk crash observed:
https://github.com/jmid/multicoretests/actions/runs/3600691148/jobs/6065625839

random seed: 304025706
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule

16 |  (alias runtest)

17 |  (package multicoretests)

18 |  (deps threadomain.exe)

19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@jmid
Copy link
Collaborator Author

jmid commented Dec 2, 2022

TImeout again after 6h on Windows 5.0.0+trunk:
https://github.com/jmid/multicoretests/actions/runs/3602179393/jobs/6068821426

random seed: 501902873
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
^CFatal error: exception User interruption
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)Terminate batch job (Y/N)? 
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Dec 5, 2022

Timeout after 6h on Windows 5.0.0+trunk:
https://github.com/jmid/multicoretests/actions/runs/3603978829/jobs/6072823161

random seed: 114004587
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   60    0    0   60 /  500    60.6s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Dec 5, 2022

Same test failing on Windows 5.0.0+trunk - but in this one the symptom is different:
https://github.com/jmid/multicoretests/actions/runs/3615216149/jobs/6092171452

random seed: 289823404
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
Fatal error: Fatal error during lock: Invalid argument

File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule

16 |  (alias runtest)

17 |  (package multicoretests)

18 |  (deps threadomain.exe)

19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code 3.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@jmid
Copy link
Collaborator Author

jmid commented Dec 5, 2022

Crash observed on Windows 5.0.0~beta2
https://github.com/ocaml-multicore/multicoretests/actions/runs/3618483166/jobs/6098392996

random seed: 57119427
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 15, characters 0-113:
15 | (rule

16 |  (alias runtest)

17 |  (package multicoretests)

18 |  (deps threadomain.exe)

19 |  (action (run ./%{deps} --verbose)))

(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@shym
Copy link
Collaborator

shym commented Dec 8, 2022

For the record, deadlock observed locally (mingw64 non-debug version, on powershell) with the following backtraces for the (only!) 4 threads:

Thread 4 (Thread 5116.0x1e2c):
#0  0x00007ff90e610bb1 in ntdll!DbgBreakPoint () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff90e63cc2e in ntdll!DbgUiRemoteBreakin () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#2  0x00007ff90cd274b4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
#3  0x00007ff90e5c26a1 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#4  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 3 (Thread 5116.0xa64):
#0  0x00007ff90e60dc14 in ntdll!ZwWaitForMultipleObjects () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff90c120460 in WaitForMultipleObjectsEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ff90c12035e in WaitForMultipleObjects () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#3  0x00007ff636cc9eb8 in do_sema_b_wait_intern (sema=sema@entry=0xbc, nointerrupt=nointerrupt@entry=0, timeout=timeout@entry=4294967295) at /usr/src/debug/mingw64-x86_64-winpthreads-10.0.0-1/src/cond.c:647
#4  0x00007ff636cca1f6 in do_sema_b_wait (val=0x1a86a1463f8, cs=0x1a86a1463d0, timeout=4294967295, nointerrupt=0, sema=0xbc) at /usr/src/debug/mingw64-x86_64-winpthreads-10.0.0-1/src/cond.c:606
#5  do_sema_b_wait (sema=0xbc, nointerrupt=0, timeout=4294967295, cs=0x1a86a1463d0, val=0x1a86a1463f8) at /usr/src/debug/mingw64-x86_64-winpthreads-10.0.0-1/src/cond.c:596
#6  0x00007ff636cca81c in pthread_cond_wait (c=c@entry=0x7ff636d92be0 <all_domains+32>, external_mutex=0x7ff636d92bd8 <all_domains+24>) at /usr/src/debug/mingw64-x86_64-winpthreads-10.0.0-1/src/cond.c:461
#7  0x00007ff636cbf2cd in caml_plat_wait (cond=cond@entry=0x7ff636d92be0 <all_domains+32>) at runtime/platform.c:111
#8  0x00007ff636caaa33 in backup_thread_func (v=0x7ff636d92bc0 <all_domains>) at runtime/domain.c:957
#9  0x00007ff636cccdc3 in pthread_create_wrapper (args=0x1a86a0d3cc0) at /usr/src/debug/mingw64-x86_64-winpthreads-10.0.0-1/src/thread.c:1533
#10 0x00007ff90d2faf5a in msvcrt!_beginthreadex () from /cygdrive/c/Windows/System32/msvcrt.dll
#11 0x00007ff90d2fb02c in msvcrt!_endthreadex () from /cygdrive/c/Windows/System32/msvcrt.dll
#12 0x00007ff90cd274b4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
#13 0x00007ff90e5c26a1 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#14 0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 2 (Thread 5116.0x13a0):
#0  0x00007ff90e60d744 in ntdll!ZwDelayExecution () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff90c11b03e in SleepEx () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ff636c95af0 in st_msleep (msec=50) at /cygdrive/c/opam/5b2/.opam-switch/build/ocaml-variants.5.0.0~beta2/otherlibs/systhreads/st_win32.h:24
#3  caml_thread_tick (arg=<optimized out>) at /cygdrive/c/opam/5b2/.opam-switch/build/ocaml-variants.5.0.0~beta2/otherlibs/systhreads/st_pthreads.h:300
#4  0x00007ff636cccdc3 in pthread_create_wrapper (args=0x1a87a392960) at /usr/src/debug/mingw64-x86_64-winpthreads-10.0.0-1/src/thread.c:1533
#5  0x00007ff90d2faf5a in msvcrt!_beginthreadex () from /cygdrive/c/Windows/System32/msvcrt.dll
#6  0x00007ff90d2fb02c in msvcrt!_endthreadex () from /cygdrive/c/Windows/System32/msvcrt.dll
#7  0x00007ff90cd274b4 in KERNEL32!BaseThreadInitThunk () from /cygdrive/c/Windows/System32/KERNEL32.DLL
#8  0x00007ff90e5c26a1 in ntdll!RtlUserThreadStart () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#9  0x0000000000000000 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

Thread 1 (Thread 5116.0x1e60):
#0  0x00007ff90e60d1c4 in ntdll!ZwWriteFile () from /cygdrive/c/Windows/SYSTEM32/ntdll.dll
#1  0x00007ff90c105136 in WriteFile () from /cygdrive/c/Windows/System32/KERNELBASE.dll
#2  0x00007ff90d2e0007 in msvcrt!_write () from /cygdrive/c/Windows/System32/msvcrt.dll
#3  0x00007ff90d2dfb67 in msvcrt!_write () from /cygdrive/c/Windows/System32/msvcrt.dll
#4  0x00007ff636cc5d1a in caml_write_fd (fd=1, flags=<optimized out>, buf=buf@entry=0x1a86a0902ac, n=75) at runtime/win32.c:123
#5  0x00007ff636cb6795 in caml_flush_partial (channel=channel@entry=0x1a86a090260) at runtime/io.c:248
#6  0x00007ff636cb7548 in caml_flush (channel=<optimized out>) at runtime/io.c:263
#7  caml_ml_flush (vchannel=<optimized out>) at runtime/io.c:763
#8  0x00007ff636cc9487 in caml_c_call ()
#9  0x000001a87a1c7f90 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)

@jmid
Copy link
Collaborator Author

jmid commented Dec 8, 2022

Thanks! The fewer threads the better! 😅

  • Thread 3 looks like a backup thread (backup_thread_func)
  • Thread 2 looks like a tick thread (caml_thread_tick)
  • Thread 1 looks like it is printing/flushing

I suspect Thread 4 is starting up a new pthread?

  • Is it possible to access and inspect caml_state through gdb?
  • Could this be related to the race reported in #11800 on ocaml/ocaml?

@shym
Copy link
Collaborator

shym commented Dec 8, 2022

I think I understood from an analysis by @dra27 of another sets of backtraces that thread 4 is just gdb.
Putting myself in the context of thread 1, I unfortunately get:

(gdb) p caml_state
Missing COFF symbol "caml_state".

but maybe there’s a smart way to know where it is located?

@shym
Copy link
Collaborator

shym commented Dec 8, 2022

#11800 is supposedly about the debug runtime. This trace is in the normal (not debug) runtime.

@shym
Copy link
Collaborator

shym commented Dec 8, 2022

Thread 1 is trying to display:

0x1a86a0902ac:  "\033[2K\r[ ]  181    0    0  181 /  500   120.5s Mash up of threads and domains (generating)"

As I don’t understand just how such a write could block, I wonder how ANSI sequences such as "\033[2K" are actually dealt with here.

@jmid
Copy link
Collaborator Author

jmid commented Dec 8, 2022

#11800 is supposedly about the debug runtime. This trace is in the normal (not debug) runtime.

AFAIU, the debug runtime is the same runtime - just running with additional assertions enabled.
So I was wondering whether it could be the same bug showing up with different symptoms - assert failure under debug runtime (catching the problem earlier) - or deadlock under regular runtime 🤷

@shym
Copy link
Collaborator

shym commented Dec 8, 2022

This should be the caml_state of the only domain running at that moment.

(gdb) p *all_domains[0].state
$12 = {young_limit = 18446744073709551615, young_ptr = 0x1a86a34f7a8, young_start = 0x1a86a150000, young_end = 0x1a86a350000,
  current_stack = 0x1a86a037550, exn_handler = 0x1a86a03f370, action_pending = 0, c_stack = 0x90bd3ff920, stack_cache = 0x1a86a14ef30,
  gc_regs_buckets = 0x1a86a03f8c0, gc_regs = 0x1a86a03f8c0, minor_tables = 0x1a86a14eda0, mark_stack = 0x1a86a147ab0, marking_done = 0,
  sweeping_done = 0, allocated_words = 0, swept_words = 0, major_work_computed = 0, major_work_todo = 0, major_gc_clock = 0,
  local_roots = 0x90bd3ff8a0, ephe_info = 0x1a86a14eed0, final_info = 0x1a86a14ee50, backtrace_pos = 0, backtrace_active = 0,
  backtrace_buffer = 0x0, backtrace_last_exn = 1, compare_unordered = 0, oo_next_id_local = 21, requested_major_slice = 0,
  requested_minor_gc = 0, requested_external_interrupt = 1, parser_trace = 0, minor_heap_wsz = 262144, shared_heap = 0x1a86a0270d0, id = 0,
  unique_id = 0, pools_to_rescan = 0x1a86a14ef00, pools_to_rescan_len = 4, pools_to_rescan_count = 0, dls_root = 1823114657176,
  extra_heap_resources = 0, extra_heap_resources_minor = 0, dependent_size = 0, dependent_allocated = 0, extern_state = 0x0,
  intern_state = 0x0, stat_minor_words = 3228687, stat_promoted_words = 1804365, stat_major_words = 1804389, stat_minor_collections = 1850,
  stat_forced_major_collections = 0, stat_blocks_marked = 1158945, inside_stw_handler = 0, trap_sp_off = 0, trap_barrier_off = 0,
  trap_barrier_block = 0, external_raise = 0x0, extra_params = {1822847990936, 1822847990952, 140695458153832, 0 <repeats 61 times>}}

young_limit is 0xffffffffffffffff ???

@jmid
Copy link
Collaborator Author

jmid commented Dec 20, 2022

Timeout observed again on Windows 5.0.1+trunk
https://github.com/ocaml-multicore/multicoretests/actions/runs/3735075361/jobs/6337885090

random seed: 368379199
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   53    0    0   53 /  500    60.3s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Dec 21, 2022

Another timeout on Windows 5.0.1+trunk:
https://github.com/ocaml-multicore/multicoretests/actions/runs/3743623784/jobs/6356003585

random seed: 277335132
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)Terminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@shym
Copy link
Collaborator

shym commented Dec 22, 2022

A timeout in spawntree on Windows 5.0.0:
https://github.com/ocaml-multicore/multicoretests/actions/runs/3751703370/jobs/6373017120

random seed: 241483421
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic
[ ]    0    0    0    0 /  100     0.0s domain_spawntree - with Atomic (generating)
Fatal error: exception User interruption
[ ]    7    0    0    7 /  100    65.7s domain_spawntree - with AtomicTerminate batch job (Y/N)? 
^C

I suspect it might be the same cause, as many test runs let me think that the failures in threadomain come from stressing domains and not so much interactions with threads.

@jmid
Copy link
Collaborator Author

jmid commented Dec 22, 2022

Seeing a crash on Threadomain with Windows trunk:
https://github.com/ocaml-multicore/multicoretests/actions/runs/3760352114/jobs/6390963989

random seed: 152292718
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 4, characters 7-18:
4 |  (name threadomain)
           ^^^^^^^^^^^
(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@shym
Copy link
Collaborator

shym commented Jan 23, 2023

Seen a segfault in that test in Windows bytecode version 5.0.0 + PR11846 😵‍💫
https://github.com/shym/multicoretests/actions/runs/3967786934/jobs/6800154168#step:12:55

random seed: 177187292
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
File "src/threadomain/dune", line 4, characters 7-18:
4 |  (name threadomain)
           ^^^^^^^^^^^
(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741819.
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)

@jmid
Copy link
Collaborator Author

jmid commented Sep 7, 2023

Timeout seen again on MingW bytecode 5.1 yesterday on merging #390 to main:
https://github.com/ocaml-multicore/multicoretests/actions/runs/6093487147/job/16533243721

random seed: 294174862
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   22    0    0   22 /  500    61.6s Mash up of threads and domains
[ ]   47    0    0   47 /  500   123.2s Mash up of threads and domains
[ ]   74    0    0   74 /  500   184.1s Mash up of threads and domains
[ ]  101    0    0  101 /  500   246.0s Mash up of threads and domains
[ ]  130    0    0  130 /  500   308.0s Mash up of threads and domains
[ ]  157    0    0  157 /  500   370.0s Mash up of threads and domains
[ ]  179    0    0  179 /  500   431.6s Mash up of threads and domains
[ ]  211    0    0  211 /  500   492.8s Mash up of threads and domains
[ ]  242    0    0  242 /  500   554.1s Mash up of threads and domains
Fatal error: exception User interruption
[ ]  264    0    0  264 /  500   614.1s Mash up of threads and domainsTerminate batch job (Y/N)? 
^C
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Oct 12, 2023

This triggered twice on the 0.3 branch on both Win bytecode 5.1 and trunk/5.2

5.1 https://github.com/ocaml-multicore/multicoretests/actions/runs/6481561494/job/17599327572

random seed: 48067126
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   26    0    0   26 /  500    60.8s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

trunk/5.2 https://github.com/ocaml-multicore/multicoretests/actions/runs/6481561482/job/17599327577

random seed: 509542723
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)Terminate batch job (Y/N)? 
^CFile "src/threadomain/dune", line 4, characters 7-18:
4 |  (name threadomain)
           ^^^^^^^^^^^
(cd _build/default/src/threadomain && ./threadomain.exe --verbose)
Command exited with code -1073741510.
Fatal error: exception User interruption
Error: The operation was canceled.

@jmid
Copy link
Collaborator Author

jmid commented Oct 12, 2023

The merge of 0.3 to main triggered a timeout again on Win bytecode trunk:
https://github.com/ocaml-multicore/multicoretests/actions/runs/6483174713/job/17604168009

random seed: 530080289
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   27    0    0   27 /  500    62.0s Mash up of threads and domains
[ ]   52    0    0   52 /  500   122.2s Mash up of threads and domains
[ ]   80    0    0   80 /  500   183.1s Mash up of threads and domains
[ ]  106    0    0  106 /  500   243.4s Mash up of threads and domains
[ ]  135    0    0  135 /  500   304.1s Mash up of threads and domains
[ ]  160    0    0  160 /  500   369.7s Mash up of threads and domains
[ ]  184    0    0  184 /  500   430.0s Mash up of threads and domains
[ ]  208    0    0  208 /  500   490.2s Mash up of threads and domains
[ ]  235    0    0  235 /  500   550.8s Mash up of threads and domains
[ ]  262    0    0  262 /  500   611.5s Mash up of threads and domainsTerminate batch job (Y/N)? 
^CFatal error: exception User interruption
Error: The operation was canceled.

@shym
Copy link
Collaborator

shym commented Oct 25, 2023

On the branch restoring support for MSVC (e55a77a30699ef15ce23bc2d53d9500d40b0aa8c), livelock with the bytecode version:

random seed: 442357763
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
...
[ ]   77    0    0   77 /  500   687.5s Mash up of threads and domains
[ ]   82    0    0   82 /  500   748.8s Mash up of threads and domains
Error: The operation was canceled.

(with more than 3 hours between the last two lines) and with the native version:

random seed: 339093797
generated error fail pass / total     time test name

[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains
[ ]    0    0    0    0 /  500     0.0s Mash up of threads and domains (generating)
[ ]   76    0    0   76 /  500    60.1s Mash up of threads and domains
[ ]  172    0    0  172 /  500   121.1s Mash up of threads and domains
^C
[ ]  256    0    0  256 /  500   181.2s Mash up of threads and domains
Error: The operation was canceled.

(again with more than 3 hours between the last two lines). Also seen with seed 150874017.

Logs:

@jmid
Copy link
Collaborator Author

jmid commented Mar 12, 2024

Closing this long-standing issue as this has been fixed in ocaml/ocaml#12882 🎉
More details are available in ocaml/ocaml#12230 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ocaml5-issue A potential issue in the OCaml5 compiler/runtime
Projects
None yet
Development

No branches or pull requests

3 participants