Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

no-copy stacks #13099

Merged
merged 2 commits into from
Oct 2, 2018
Merged

no-copy stacks #13099

merged 2 commits into from
Oct 2, 2018

Conversation

vtjnash
Copy link
Sponsor Member

@vtjnash vtjnash commented Sep 13, 2015

this is a experimental exploration of the ramifications of the restoration of the !COPY_STACKS makefile option, for discussion and analysis.

this changes thread state to be a unw_context_t instead of a jmp_buf. this would allow unwinding a moved stack (for temporary inspection / remote backtrace of an arbitrary task). however, this increases the size of a jl_task_t object, so I'm not sure this is worthwhile. plus, while it still shouldn't need to do a sigprocmask syscall, it still requires more code to do the context switch. it might be better to figure out how to just make due with the current

this change removes all of the inline assembly code from this file, which is good for portability.

this change also makes it possible / makes use of the possibility of having more than 1 stack, but fewer than N (where N is the number of Tasks that have ever been created). This removes the need to guess jl_stackbase in julia_init and would makes it possible to create additional tasks with a "permanent" stack (for example, for better C-interop with the Gtk event loop to solve JuliaGraphics/Gtk.jl#161), without needing to reserve a new stack for every Task. I believe this change may be beneficial for embedding julia into other contexts (not just Gtk), where modifying the primary stack may be undesirable. plus, since it allows julia_init to be called from anywhere, it may be easier to call it via an FFI (which might not be particularly near the top of the stack) or from the static initializer for a dll.

i have compiled this, with and without !COPY_STACKS, on Ubuntu, Apple, and Windows (all x86_64). I don't have any more exotic machines like FreeBSD to test on, nor have I tried it on ARM or PPC yet.

ping @JeffBezanson @carnaval

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Sep 15, 2015

A nice discussion of these possibilities, as implemented in the GNU Pth library, is described in the following paper www.gnu.org/s/pth/rse-pmt.ps

@vtjnash vtjnash force-pushed the jn/no_copy_stacks branch 2 times, most recently from 9388766 to 30cd4b1 Compare September 24, 2015 03:39
src/task.c Outdated
@@ -1,9 +1,22 @@
// This file is a part of Julia. License is MIT: http://julialang.org/license

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this whitespace is significant to the add_license_headers script

@vtjnash vtjnash changed the title RFC: ~copy stacks ~copy stacks Sep 24, 2015
@vtjnash vtjnash force-pushed the jn/no_copy_stacks branch 2 times, most recently from 1a8676b to 4f063d8 Compare September 26, 2015 02:02
@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Sep 28, 2015

a created a simple benchmark for involving this change in set of related expensive operations. this is intended to be a model of the client message_handler_loop:

julia> for i = 1:6
       let io = Pipe(), bigdata = tuple(collect(big(1):1000)), stk=0
       Base.link_pipe(io)
       @time @sync begin
         schedule(Base.sync_add(Task( () -> begin
           for i = 1:10
            serialize(io, bigdata)
           end
           close(io.in)
         end, stk)))
         schedule(Base.sync_add(Task( () -> let cnt = 0
           while !eof(io)
             r = deserialize(io)
             #@assert r == bigdata
             cnt += 1
           end
           #@assert cnt == 10
         end, stk)))
       end
       close(io)
       end
       end
stk = 0
  3.585497 seconds (4.26 M allocations: 131.958 MB, 3.11% gc time)
  2.583458 seconds (3.52 M allocations: 101.017 MB, 1.27% gc time)
  2.539079 seconds (3.52 M allocations: 101.017 MB, 0.84% gc time)
  2.535257 seconds (3.52 M allocations: 101.017 MB, 1.17% gc time)
  2.545820 seconds (3.52 M allocations: 101.017 MB, 1.04% gc time)
  2.550698 seconds (3.52 M allocations: 101.017 MB, 0.88% gc time)
stk = 1 MB
  3.647399 seconds (4.26 M allocations: 131.958 MB, 3.30% gc time)
  2.448537 seconds (3.52 M allocations: 101.017 MB, 1.43% gc time)
  2.439833 seconds (3.52 M allocations: 101.017 MB, 0.88% gc time)
  2.446417 seconds (3.52 M allocations: 101.017 MB, 1.21% gc time)
  2.452492 seconds (3.52 M allocations: 101.017 MB, 1.13% gc time)
  2.422734 seconds (3.52 M allocations: 101.017 MB, 0.89% gc time)

this represents an approximate 4.5% speedup.

one more benefit of this not mentioned above, is that these tasks are not pinned to a thread (unlike the usual copy_stack tasks), so it is possible they could be restarted on a different thread than the one that created them.

plus, the obligatory raw numbers for just a ping-pong context switch:

julia> @time @sync let stk = 1024*128, task = Vector{Task}(2)
         for i = 1:2
          task[i] = schedule(Base.sync_add(Task(()->for j = 1:10^7; yieldto(task[mod1(i+1,2)]); end, stk)))
         end
       end
  1.407526 seconds (2.23 k allocations: 113.653 KB)
# ~14 M task switches / second (70 ns / switch)

julia> @time @sync let stk = 0, task = Vector{Task}(2)
         for i = 1:2
          task[i] = schedule(Base.sync_add(Task(()->for j = 1:10^7; yieldto(task[mod1(i+1,2)]); end, stk)))
         end
       end
  4.069411 seconds (2.23 k allocations: 113.591 KB)
# ~4.9 M task switches / second (200 ns / switch)

(numbers computed on an Intel(R) Xeon(R) CPU E7- 8850 @ 2.00GHz)

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Sep 28, 2015

the ONLY commit that should be backport is 279f0a2 6470a26 "avoid running finish_task inside of throw_internal"

@tkelman
Copy link
Contributor

tkelman commented Sep 28, 2015

before or after 0.4.0? is there a test and/or failure case that's fixing?

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Sep 28, 2015

it should fix f() = f(); @async f() on ubuntu (so that it throws an error and doesn't segfault) and just generally handle task failure a bit more gracefully. it's not essential right now, I just wanted to add the tag so it doesn't get lost.

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Sep 28, 2015

status update: this now appears to be stable on all platforms (the travis failure was due to 35c5234, not this PR), and it better aligns Julia's behavior with how many other systems implement this same functionality (for example: the D runtime, win32, ucontext, GNU Pth).

however, while the win32 Fiber API is quite convenient, internally, it is has a large number of implementation flaws. The D language runtime (Boost licensed) has an implementation that we can borrow from to deal with this shortcoming: https://github.com/D-Programming-Language/druntime/blob/1f957372e5dadb92ab1d621d68232dbf8a2dbccf/src/core/thread.d#L4482-L4503

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Nov 13, 2015

I've squashed the history, so if you want to see the individual commits, look at 0aec198.

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Nov 16, 2015

this will close #7824

@astrieanna
Copy link
Contributor

When is this likely to be merged? JuliaGraphics/Gtk.jl#161 is an issue for me, and I'd like to know how much effort to invest in rearranging my program to work around it.

@bjarthur
Copy link
Contributor

+1 on JuliaGraphics/Gtk.jl#161 being a problem and wanting this merged if it fixes it.

@timholy
Copy link
Sponsor Member

timholy commented May 10, 2016

Bump (want re JuliaGraphics/Gtk.jl#161). I'm not familiar with the task infrastructure, so unfortunately I can't provide a useful review. Anyone, anyone? Bueller?

@tknopp
Copy link
Contributor

tknopp commented Sep 8, 2016

@vtjnash: What is the status of this? Using @sigatomic is kind of a workaround in Gtk.jl but it would be really neat if that actual issue would be solved.

@vtjnash
Copy link
Sponsor Member Author

vtjnash commented Sep 8, 2016

I've been very low on time to write the custom windows support code to work-around the buggy native version on that platform. the sigatomic will probably stick around in Gtk.jl in some similar form, since it also ensure proper stack return order and some useful exception handling, so I don't see that as too high importance right now to avoid.

@tknopp
Copy link
Contributor

tknopp commented Sep 8, 2016

Well its a little bit annoying to put @sigatomic everywhere in the code just to avoid that the application segfaults. I am currently building a larger Gtk.jl application where this is not so easy. The rule when in segfaults put @sigatomic in the code is also not that satisfying while coding.

@@ -101,10 +145,11 @@ struct _jl_tls_states_t {
struct _jl_module_t *current_module;
struct _jl_task_t *volatile current_task;
struct _jl_task_t *root_task;
//#ifdef COPY_STACKS
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional can't be used in public facing code, but if you do delete COPY_STACKS, then this would get deleted to.

ui/repl.c Outdated
// pwine_get_version = GetProcAddress(hntdll, "wine_get_version");
// return pwine_get_version != 0;
//}
//#endif
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete?

jl_ucontext_t ctx; // saved thread state
void *stkbuf; // malloc'd memory (either copybuf or stack)
size_t bufsz; // actual sizeof stkbuf
unsigned int copy_stack:31; // sizeof stack for copybuf
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be called something like copy_amount? This name sounds really similar to the boolean COPY_STACKS.

Copy link
Sponsor Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it probably could. I use this as a boolean often enough though too (as the runtime version of COPY_STACKS) that I'm being intentionally a bit ambiguous.

@StefanKarpinski StefanKarpinski changed the title ~copy stacks no-copy stacks Sep 10, 2018
never copy over the root stack:
this is a hybrid approach to COPY_STACK where the root task is never
moved or copied, and all other task stacks are layered into the same
memory area (ptls->basestack + basesize)

several strategies exist for making new stacks:
ucontext_t (where it is available, aka linux)
unw_context_t (as an alternative to ucontext_t that avoids a syscall on task-switch)
makecontext (as a posix standard implemention)
setjmp/longjmp-based implementation (for systems where this is sufficient)
Windows Fibers (implemented here, since we can be more efficient and
    reliable than the official Fibers API)

also, uses an alternate stack for use in collecting stack-overflow backtraces
like posix, but managed manually
Copy link
Sponsor Member

@vchuravy vchuravy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have some more devdocs, but LGTM!

(I think I have read through it 5 times now, so I start to see what is happening here)

static void memcpy_a16(uint64_t *to, uint64_t *from, size_t nb)
{
memcpy((char*)jl_assume_aligned(to, 16), (char*)jl_assume_aligned(from, 16), nb);
//uint64_t *end = (uint64_t*)((char*)from + nb);
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this here just as a explanation of the above?

// select an implementation of stack switching.
// currently only COPY_STACKS is recommended.
#ifndef COPY_STACKS
// select whether to enable the COPY_STACKS stack switching optimization
#define COPY_STACKS
Copy link
Sponsor Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be great to have an explanation here or in the devdocs of the high-level strategy for task-switching.

I also wonder if we need this flag (and what the actual behavior change is that is toggled by it),
should I have the expectation that it generally works?

make -C test read fails and I managed to run out of memory during testall

      From worker 5:	fatal: error thrown and no exception handler available.
      From worker 5:	OutOfMemoryError()
      From worker 5:	rec_backtrace at /home/vchuravy/src/julia/src/stackwalk.c:94
      From worker 5:	record_backtrace at /home/vchuravy/src/julia/src/task.c:185
      From worker 5:	jl_throw at /home/vchuravy/src/julia/src/task.c:374
      From worker 5:	jl_malloc_stack at /home/vchuravy/src/julia/src/gc-stacks.c:138
      From worker 5:	ctx_switch at /home/vchuravy/src/julia/src/task.c:645 [inlined]
      From worker 5:	jl_switchto at /home/vchuravy/src/julia/src/task.c:322
      From worker 5:	unknown function (ip: 0x7f1a41ac0f78)
      From worker 5:	unknown function (ip: 0x7f1a41ac1204)
      From worker 5:	unknown function (ip: 0x7f1a41b2d052)
      From worker 5:	jl_apply_generic at /home/vchuravy/src/julia/src/gf.c:2198
      From worker 5:	jl_apply at /home/vchuravy/src/julia/src/julia.h:1562 [inlined]
      From worker 5:	finish_task at /home/vchuravy/src/julia/src/task.c:172
      From worker 5:	start_task at /home/vchuravy/src/julia/src/task.c:512
      From worker 5:	unknown function (ip: 0xffffffffffffffff)

@vchuravy
Copy link
Sponsor Member

vchuravy commented Oct 2, 2018

I am going to merge this so that we can shake out bugs (if any) and that we can base PARTR off this

@vchuravy vchuravy merged commit d769ad2 into master Oct 2, 2018
@vchuravy vchuravy deleted the jn/no_copy_stacks branch October 2, 2018 12:41
@StefanKarpinski
Copy link
Sponsor Member

Very exciting!

@tknopp
Copy link
Contributor

tknopp commented Oct 6, 2018

@vtjnash @vchuravy: Is this expected to solve
JuliaGraphics/Gtk.jl#161 (as indicated in the OP)? I have tried it and still get segfaults (using the example from JuliaGraphics/Gtk.jl#391 without sigatom).

@StefanKarpinski
Copy link
Sponsor Member

I don’t think so... it just changes the way stack switching is implemented. You still can’t run Julia code not on the main thread.

@tknopp
Copy link
Contributor

tknopp commented Oct 7, 2018

We only have one thread so that is not the issue. Once upon a time everything was working but with the introduction of COPY_STACKS Gtk broke (see
JuliaGraphics/Gtk.jl#391).

@JeffBezanson
Copy link
Sponsor Member

We're still using COPY_STACKS by default; you can try turning it off in src/options.h.

@tknopp
Copy link
Contributor

tknopp commented Oct 9, 2018

@JeffBezanson Many thanks Jeff. It is now working without a crash! This is absolutely fantastic. Can we please please make this default and rush this into a release? Julia deserves high quality Gtk bindings that are not crashing. I know that this is a feature, but at the same time it is a bugfix for JuliaGraphics/Gtk.jl#161 (from 2015) which still leads to major limitations.

@vtjnash: Thanks so much!

@JeffBezanson
Copy link
Sponsor Member

We are working towards that.

@tknopp
Copy link
Contributor

tknopp commented Oct 9, 2018

Is there something missing which needs to be done other than undef COPY_STACKS? I could make a PR that toggles that if you like.

@JeffBezanson
Copy link
Sponsor Member

See #13099 (comment)

@tknopp
Copy link
Contributor

tknopp commented Oct 9, 2018

ok thanks. Then I stay patient and wait for it.

@JeffBezanson
Copy link
Sponsor Member

Working on this now (we're doing a sprint on #22631, so it came up). During the read test we start 65536 tasks, so allocating a stack for each exceeds the kernel's maximum number of allowed mappings per process. We can fix this by falling back to stack-copying if we're running out of mappings. The user can request that a Task have a dedicated stack using Task(func, stack_size), which would error if no mappings are available.

Things to try in the future:

  • Pool allocations, with more than one stack per mmap. This might work around the limit, unless the mprotects somehow prevent that.
  • Allocate smaller stacks by default.
  • Reduce the default stack size as Task allocation pressure increases.
  • Run inference (which can use a lot of stack) on its own stack, allowing most stacks to be smaller.

@StefanKarpinski
Copy link
Sponsor Member

Running inference on its own large stack combined with static analysis of how much stack space a task could potentially need could potentially allow allocating very small "stack" objects. I'm not sure how common tasks where the maximum stack size is statically knowable would actually turn out to be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants