Skip to content

Conversation

ghost
Copy link

@ghost ghost commented Nov 6, 2019

As far as I can tell, this doesn't much affect parallel performance, and
prevents this exception from triggering if you attempt to define new
functions from multiple threads:

ERROR: TaskFailedException:
cannot eval a new struct type definition while defining another type

I'm opening this as a draft PR to provide an easier short-term fix for #33183, one that can be easily backported to v1.3 (as an alternative to #33553)

EDIT: Oops, dammit, I used the wrong github account again... >.<

@NHDaly
Copy link
Member

NHDaly commented Nov 6, 2019

Oh, I probably should've tried running the tests earlier...:

Test Failed at /buildworker/worker/tester_linux64/build/share/julia/test/channels.jl:313
  Expression: fetch(errstream) == "error in running finalizer: ErrorException(\"task switch not allowed from inside gc finalizer\")\nerror in running finalizer: ErrorException(\"task switch not allowed from inside gc finalizer\")\nerror in running finalizer: ErrorException(\"task switch not allowed from inside gc finalizer\")\n"
   Evaluated: "" == "error in running finalizer: ErrorException(\"task switch not allowed from inside gc finalizer\")\nerror in running finalizer: ErrorException(\"task switch not allowed from inside gc finalizer\")\nerror in running finalizer: ErrorException(\"task switch not allowed from inside gc finalizer\")\n"

I guess this needs to be a spinlock, then.. i'll give that a shot.
EDIT: Oh, wait, jk, i think JL_LOCK is a spinlock? :/ this might be beyond my current knowledge.. will do a bit more reading 😅

@NHDaly
Copy link
Member

NHDaly commented Nov 7, 2019

Debugging Notes:

  • Ah, the test failures above are caused by me putting the JL_UNLOCKs in the wrong scope. I put them inside the JL_TRY/JL_CATCH which reset some global state when they exit, so the state set by JL_LOCK was never fully reset. This would be fixed by just pulling the JL_UNLOCK up to the top scope.

    • I didn't do this originally b/c i was worried about the jl_rethrow() returning without unlocking, but it turns out the JL_LOCK infrastructure is set up to automatically release the lock when throwing an exception.
  • There is also a concern that this is the wrong kind of lock. Those mutexes are meant to be very internal, and disable all sorts of things, such as disallowing executing arbitrary julia, which this code might be doing. Still need to consider what would be better.

  • Finally, the codegen lock might not be the right lock for this... it's not codegen. I originally was wary of adding another lock due to potential deadlocks, but it might be okay.


I've just pushed up two commits that:

  1. Fix the JL_UNLOCK to be in the right scope
  2. Make a "fake" toplevel_lock (currently aliased to codegen lock) since this really should be a higher-level.

@NHDaly
Copy link
Member

NHDaly commented Nov 8, 2019

Cool, so that does seem to have fixed (most of) the test failures. :)
Now windows is failing, with:

Error in testset REPL:
Error During Test at D:\buildbot\worker-tabularasa\tester_win32\build\share\julia\test\testdefs.jl:22
  Got exception outside of a @test
  LoadError: LoadError: "hard kill repl test"

Which from what i can tell is a hard timeout on the test. So it's very possible it's stuck, maybe in a deadlock. :( sigh...


The other thing I learned from offline conversations yesterday that I didn't write up here yet, is that this was never supposed to work. The multithreading released in julia v1.3 makes no promises about supporting compilation or code generation in parallel, which means it doesn't support evaling new functions in parallel from multiple threads. I wasn't aware of this before filing #33183, and we should probably find a good spot to call this out in the v1.3 documentation.

@KristofferC
Copy link
Member

The test error was reported somewhere else so should be unrelated.

@NHDaly
Copy link
Member

NHDaly commented Nov 8, 2019

The test error was reported somewhere else so should be unrelated.

Ah, thanks good to know. :) In that case, I think it would be really nice to get this, or something like it, merged. I know that v1.3 isn't targeting parallel compilation support, but ideally it would at least work slowly (ie lock), rather than not work at all. Thanks for your help yesterday! :)

@ghost ghost marked this pull request as ready for review November 11, 2019 23:10
As far as I can tell, this doesn't much affect parallel performance, and
prevents this exception from triggering if you attempt to define new
functions from multiple threads:

```
ERROR: TaskFailedException:
cannot eval a new struct type definition while defining another type
```
@NHDaly NHDaly force-pushed the nhdaly-interpreter-typedef-lock branch from 30eb908 to 0830f8a Compare November 11, 2019 23:14
@NHDaly
Copy link
Member

NHDaly commented Nov 13, 2019

Rebased to give the tests another shot, but got the same failures. I'm not sure if my code is related.

After talking more with my team about it, the feeling is that this is an important PR to merge. We recognize that 1.3 won't be able to eval or compile in parallel, but it should at least be thread-safe to attempt to do so. We just ask that julia locks around these sections rather than throwing an error.

As one example of work this frustrates, this prevents us from running unrelated test files in parallel because the include invokes eval and this error can occur.

@JeffBezanson
Copy link
Member

Replaced by #33553.

@NHDaly NHDaly deleted the nhdaly-interpreter-typedef-lock branch May 3, 2020 03:58
@NHDaly NHDaly assigned NHDaly and unassigned NHDaly Dec 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants