Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Distributed] error in running finalizer: ConcurrencyViolationError(msg="lock must be held") (any_gc_flag) #42126

Closed
krynju opened this issue Sep 5, 2021 · 11 comments · Fixed by #42240
Labels
bug Indicates an unexpected problem or unintended behavior

Comments

@krynju
Copy link
Contributor

krynju commented Sep 5, 2021

This is a concurrency violation on this condition

const any_gc_flag = Condition()

Happens when running tests on krynju/Dagger.jl@a5b5267 (DTable groupby tests with Dagger on processes)

Windows, Julia master

Seems like the notify needs to be wrapped in a while->trylock loop (because it's running in finalizer context), since the lock on that condition doesn't seem to be obtained anywhere and the docs say that the Condition structure is not threadsafe

Will look more into that and open a PR sometime next week

From worker 2:    error in running finalizer: ConcurrencyViolationError(msg="lock must be held")
From worker 2:    concurrency_violation at .\condition.jl:8
From worker 2:    assert_havelock at .\condition.jl:25 [inlined]
From worker 2:    assert_havelock at .\condition.jl:48 [inlined]
From worker 2:    assert_havelock at .\condition.jl:72 [inlined]
From worker 2:    notify at .\condition.jl:132
From worker 2:    #notify#564 at .\condition.jl:130 [inlined]
From worker 2:    notify at .\condition.jl:130 [inlined]
From worker 2:    notify at .\condition.jl:130 [inlined]
From worker 2:    process_worker at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:288 [inlined] 
From worker 2:    send_del_client_no_lock at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:280  
From worker 2:    finalize_ref at C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.8\Distributed\src\remotecall.jl:94
From worker 2:    jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1771 [inlined]
From worker 2:    run_finalizer at /cygdrive/c/buildbot/worker/package_win64/build/src\gc.c:278
From worker 2:    jl_gc_run_finalizers_in_list at /cygdrive/c/buildbot/worker/package_win64/build/src\gc.c:365
From worker 2:    run_finalizers at /cygdrive/c/buildbot/worker/package_win64/build/src\gc.c:394 [inlined]
From worker 2:    run_finalizers at /cygdrive/c/buildbot/worker/package_win64/build/src\gc.c:372
From worker 2:    jl_gc_run_pending_finalizers at /cygdrive/c/buildbot/worker/package_win64/build/src\gc.c:405
From worker 2:    jl_mutex_unlock at /cygdrive/c/buildbot/worker/package_win64/build/src\locks.h:133 [inlined]
From worker 2:    jl_generate_fptr at /cygdrive/c/buildbot/worker/package_win64/build/src\jitlayers.cpp:357
From worker 2:    jl_compile_method_internal at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:1978
From worker 2:    jl_compile_method_internal at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:1932 [inlined]
From worker 2:    _jl_invoke at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2237 [inlined]
From worker 2:    jl_apply_generic at /cygdrive/c/buildbot/worker/package_win64/build/src\gf.c:2427
From worker 2:    jl_apply at /cygdrive/c/buildbot/worker/package_win64/build/src\julia.h:1771 [inlined]
From worker 2:    start_task at /cygdrive/c/buildbot/worker/package_win64/build/src\task.c:881
@jpsamaroo
Copy link
Member

We've tried fixing this before in #38405 and other earlier PRs, and always ended up reverting them due to hangs during testing. @vchuravy has looked into this again somewhat recently, so he may have an idea of why the most recent PR broke things, and what we need to do to fix it.

@krynju
Copy link
Contributor Author

krynju commented Sep 5, 2021

So I assume this is more complicated than just dirty fixing this notify with an active waiting loop on the lock.
I'll try that later, but I suppose it's just an entry to a rabbit hole of other errors down the line

@vchuravy
Copy link
Member

Can you give #42240 a try?

@Sacha0
Copy link
Member

Sacha0 commented Sep 13, 2021

Assuming this issue is reproducible on 1.7 (@krynju?), perhaps we should mark this issue milestone-1.7, as a representative for the Distributed.jl threading issues that have been reported in various places, and have been touched on in the related series of now-reverted PRs, but have fallen by the wayside?

@Sacha0
Copy link
Member

Sacha0 commented Sep 13, 2021

Ref. related earlier tracking issue, which should perhaps also receive the milestone-1.7 label? JuliaLang/Distributed.jl#73

@krynju
Copy link
Contributor Author

krynju commented Sep 13, 2021

@vchuravy Not getting the error now, thanks!
I also ran some other checks and everything seems to work just fine.

@Sacha0 I initially found it on master, but it most likely appears in 1.7 as well. I can check later if needed

@Sacha0 Sacha0 added this to the 1.7 milestone Sep 15, 2021
@Sacha0 Sacha0 added the bug Indicates an unexpected problem or unintended behavior label Sep 15, 2021
@KristofferC
Copy link
Member

So is this a regression vs 1.6? If that is the case, it would be helpful if someone could bisect the offending commit.

@vchuravy
Copy link
Member

Not an regression, just a side-effect of people running things with multiple threads and Distributed.jl not being threadsafe.

@KristofferC
Copy link
Member

I don't think this should be on the milestone then which blocks 1.7 from being released until it is fixed.

@Sacha0
Copy link
Member

Sacha0 commented Sep 15, 2021

Though perhaps not a regression in some sense, folks are hitting this issue on 1.7 where they were not hitting it on 1.6. My best guess is that flipping the task migration bit effectively elevated the severity of this issue.

@vchuravy vchuravy removed this from the 1.7 milestone Sep 16, 2021
@vchuravy
Copy link
Member

Should not block 1.7, but backporting it to 1.7.1 would be good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants