Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

threads: move safepoint into loop #41411

Merged
merged 1 commit into from
Jun 30, 2021
Merged

threads: move safepoint into loop #41411

merged 1 commit into from
Jun 30, 2021

Conversation

vtjnash
Copy link
Member

@vtjnash vtjnash commented Jun 29, 2021

@PallHaraldsson, can you see if this fixes #41407 for you? The stacktrace it printed on interrupt suggests this was the problem.

@vtjnash vtjnash added multithreading Base.Threads and related functionality GC Garbage collector bugfix This change fixes an existing bug backport 1.7 labels Jun 29, 2021
@PallHaraldsson
Copy link
Contributor

@vtjnash I can check when this is merged/there's a (nightly) binary. It should be merged (first), right, since Jeff approved?

@vtjnash
Copy link
Member Author

vtjnash commented Jun 30, 2021

Yes, note that you can download a binary from any CI run, master or PR. The download URL will be in the first steps of the testing bots.

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jun 30, 2021

@vtjnash, I'm not sure if your commit made slower, or there would be many other in there I guess?

It's for sure slower (also with -O3), while it may have fixed the hang (I get it on neither commit):

$ hyperfine '~/julia-1.8-DEV-a9412439c3/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000 > /dev/null'
Benchmark #1: ~/julia-1.8-DEV-a9412439c3/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000 > /dev/null
  Time (mean ± σ):      4.625 s ±  0.094 s    [User: 10.357 s, System: 0.918 s]
  Range (min … max):    4.457 s …  4.735 s    10 runs
 
vs.

$ hyperfine '~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000 > /dev/null'
Benchmark #1: ~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000 > /dev/null
  Time (mean ± σ):      3.193 s ±  0.113 s    [User: 8.701 s, System: 0.911 s]
  Range (min … max):    3.039 s …  3.371 s    10 runs



$ hyperfine '~/julia-1.8-DEV-a9412439c3/bin/julia -p1 -e ""'
Benchmark #1: ~/julia-1.8-DEV-a9412439c3/bin/julia -p1 -e ""
  Time (mean ± σ):     10.977 s ±  0.113 s    [User: 17.644 s, System: 1.247 s]
  Range (min … max):   10.841 s … 11.217 s    10 runs
 
$ hyperfine '/home/pharaldsson_sym/julia-1.7-DEV-f2ea26d1a1/bin/julia -p1 -e ""'
Benchmark #1: /home/pharaldsson_sym/julia-1.7-DEV-f2ea26d1a1/bin/julia -p1 -e ""
  Time (mean ± σ):      8.193 s ±  0.375 s    [User: 13.235 s, System: 1.253 s]
  Range (min … max):    7.938 s …  9.007 s    10 runs


and this slower:

$ hyperfine '~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta_wo_io.jl 25000000 > /dev/null'
Benchmark #1: ~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta_wo_io.jl 25000000 > /dev/null
  Time (mean ± σ):     827.8 ms ±  97.1 ms    [User: 1.790 s, System: 0.550 s]
  Range (min … max):   680.6 ms … 975.7 ms    10 runs

$ hyperfine '~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta_wo_io.jl 25000000 > /dev/null'
Benchmark #1: ~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta_wo_io.jl 25000000 > /dev/null
  Time (mean ± σ):     762.7 ms ±  98.9 ms    [User: 1.702 s, System: 0.530 s]
  Range (min … max):   623.8 ms … 918.2 ms    10 runs

Thanks, you mean at:
https://build.julialang.org/#/builders/69/builds/777

[edited]

Found it: https://julialangnightlies-s3.julialang.org/assert_pretesting/linux/x64/1.8/julia-a9412439c3-linux64.tar.gz

@PallHaraldsson
Copy link
Contributor

PallHaraldsson commented Jun 30, 2021

A.
I can't debug your "move safepoint into loop" code. I just note it's strictly not in the loop while just ahead and I see the "goto retry;" so in some loop.

I don't know enough about this, to see if this small change or other is responsible for the slowdown. If that's the price to pay to get rid of a hang, then I guess better until an even better solution. How confident are you in that it's correct, and incorrect before?

B.
I really would like to replicate the hangs I got very repeatedly before, and am frustrated that I can't. There was some hanging julia process with 400% CPU load I killed, and it didn't help. I'm not sure, should the hang be more or less likely with more, or less load? Could this bug have been in Julia for a long time, and e.g. still be there in 1.6? Just the likelihood of triggering it differs?

@PallHaraldsson
Copy link
Contributor

The stacktrace it printed on interrupt suggests this was the problem.

Can that only happen on I/O? I confirmed the hang with no I/O, narrowed down to only "wait" being the problem.

@PallHaraldsson
Copy link
Contributor

@vchuravy this may be a fix for #41407 that you closed.

And I actually replicated the hang just now with old commit, also located it here (i.e. first commit there):

$ ps -ef |grep julia
pharald+  8402 18133  0 jún29 pts/11  00:00:00 sh -c ~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000 > /dev/null
pharald+  8403  8402 14 jún29 pts/11  03:45:50 /home/pharaldsson_sym/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000
pharald+  8682 18133  0 jún29 pts/11  00:00:00 sh -c ~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000 > /dev/null
pharald+  8683  8682 13 jún29 pts/11  03:36:01 /home/pharaldsson_sym/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -t4 -O2 fasta.jl 25000000
pharald+ 22881 18133  0 jún29 pts/11  00:00:00 sh -c ~/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -p2 -O2 bt.jl 21

The last line also must be a hang from yesterday, with -p2 rather than -t4 and other program. It's also slowed down:

$ hyperfine '~/julia-1.8-DEV-a9412439c3/bin/julia --startup-file=no -p2 -O2 bt.jl 21'
Benchmark #1: ~/julia-1.8-DEV-a9412439c3/bin/julia --startup-file=no -p2 -O2 bt.jl 21
  Time (mean ± σ):     25.268 s ±  1.619 s    [User: 47.272 s, System: 2.329 s]
  Range (min … max):   23.995 s … 29.276 s    10 runs
 
(base) pharaldsson_sym@SYMLINUX011:~/discretionary_dash$ hyperfine '/home/pharaldsson_sym/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -p2 -O2 bt.jl 21'
Benchmark #1: /home/pharaldsson_sym/julia-1.8-DEV-7553ca13cc/bin/julia --startup-file=no -p2 -O2 bt.jl 21
 ⠹ Current estimate: 19.710 s     ███████████░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░ ETA 00:02:58
^C

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This change fixes an existing bug GC Garbage collector multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Julia can hang when using threading, on 1.7 beta2 and 1.8
5 participants