-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Occasional segfaults when running with @threads #44460
Comments
Running under an external It looks like this is an https://www.dropbox.com/s/ln805f9tfq3tsau/julia-rr-crash-nightly.tar.zst?dl=0 |
There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. Fixes #44460
There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. The uv_return_spawn callback also needs to be synchronized with the constructor, since we might have arrived there before we finished allocating the Process struct here, leading to missed exit events. Fixes #44460
There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. The uv_return_spawn callback also needs to be synchronized with the constructor, since we might have arrived there before we finished allocating the Process struct here, leading to missed exit events. Fixes #44460 (cherry picked from commit c591bf2)
There appeared to be a possibility they could race and the data field might already be destroyed before we reached the close callback, from looking at the state of the program when reproducing #44460. This is because the uv_return_spawn set the handle to NULL, which later can cause the uvfinalize to exit early (if the finalizer gets run on another thread, since we have disabled finalizers on our thread). Then the GC can reap the julia Process object prior to uv_close cleaning up the object. We solve this by calling disassociate_julia_struct before dropping the reference to the handle. But then we also fully address any remaining race condition by having uvfinalize acquire a lock also. The uv_return_spawn callback also needs to be synchronized with the constructor, since we might have arrived there before we finished allocating the Process struct here, leading to missed exit events. Fixes #44460 (cherry picked from commit c591bf2)
I get a crash also on my application when pushing into a array local variable within Threads.@thread.
This is what I get with valgrind on my application:
This is what I get without valgrind on my application with julia 1.9-dev:
With Julia 1.8-beta3
|
Could you be a bit more explicit (perhaps with a code snippet)? As you describe it, that sounds like a data race. |
The crash involves a local variable only accessed from a single thread. The snipped below illustrates what my application does, but julia never crashes with the snippet (nor valgrind reports any issue).
|
I uploaded the result of running #44460 (comment) with julia 1.9-dev with --bug- |
It looks like you need to run |
|
Yeah, the example snippet is not really the best because it generates a lot of stuff in rr and doesn't always crash quickly. Something that I found sometimes persuades it to crash quicker is clearing all caches and buffer from Linux before starting a recording, but it's not really foolproof. |
@dpinol there doesn't seem to be anything remarkable in your trace that I can see. The process runs for a while (about 4138 iterations of the loop happen), then something external terminated it with a SIGKILL (9). |
@dpinol I'm not able to reproduce the crash running with the latest Julia. Based on the description provided by vtjnash, t's possible that my example is crashing for you simply by running out of memory. |
Yes, I agree in this case it runs out of memory. I created a different MWE at #45196. thanks |
This is mostly a condensed repost of #44019. I was told that julia 1.7.1 was a "very old release to be relying on threading," but the issue still occurs on the latest nightly.
Running the following code will sometimes segfault. It is more common with 32 threads, but will sometimes occur with 16 or fewer threads.
Current Code
Output of `versioninfo()`
Stack trace from the one time in many that the program managed to print one before dying
The line that originates in my code (line 65 at the top) is just the
@threads
loop.A valgrind report from the previous issue
Taken from this comment
An rr trace of the bug under Julia 1.7.1 can be found here. However, I have not been able to catch the program crashing under
rr
under the nightly build: trying to use BugReporting.jl will simply complete the program successfully, while runningjulia
itself underrr
has not managed to produce any useful results yet (it has been running for over 24 hours at this point with no apparent progress).The text was updated successfully, but these errors were encountered: