Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow PTLS to be changed in a function (preparation for task migration) #39168

Closed
wants to merge 4 commits into from

Conversation

tkf
Copy link
Member

@tkf tkf commented Jan 10, 2021

This PR tweaks how PTLS getters are generated and lets us insert multiple PTLS getters at arbitrary points in a function. There are two goals motivating this change:

  1. Task migration across threads. To migrate the tasks across threads, we need to refetch PTLS after every point that a function can yield (= most of the function calls).

  2. For Julia-Tapir integration (or, more in general, LLVM-level compiler support for the task system) we are working on, we need to be able to split out chunks of LLVM IR as functions. I think relaxing the current requirement that the PTLS has to be fetched only once at the entry block helps this part of the LLVM pass.

Note that this PR does not cover these goals and so I'm not 100% sure if it is the right approach yet. Also, there are some failures in tests due to this change and so it's not merge-ready. But it'd be great if I can get some feedbacks in the design before I spend more time fixing the bugs.

Demo

I build julia with MIGRATE_TASKS enabled:

diff --git a/src/options.h b/src/options.h
index 3ffbf05b22..5ea220900b 100644
--- a/src/options.h
+++ b/src/options.h
@@ -114,7 +114,7 @@
 #endif

 // allow a suspended Task to restart on a different thread
-//#define MIGRATE_TASKS
+#define MIGRATE_TASKS

 // threading options ----------------------------------------------------------

and then

julia> function f()
           a = Threads.threadid()
           yield()
           b = Threads.threadid()
           (a, b)
       end
f (generic function with 1 method)

julia> @code_llvm debuginfo=:none f()

prints

; Function Attrs: sspstrong
define void @julia_f_175([2 x i64]* noalias nocapture sret([2 x i64]) %0) #0 {
top:
  %thread_ptr4 = call i8* asm "movq %fs:0, $0", "=r"() #5
  %1 = getelementptr i8, i8* %thread_ptr4, i64 -32752
  %2 = bitcast i8* %1 to i16*
  %3 = load i16, i16* %2, align 2
  %4 = sext i16 %3 to i64
  %5 = add nsw i64 %4, 1
  %6 = call nonnull {}* @j_yield_177()
  %thread_ptr = call i8* asm "movq %fs:0, $0", "=r"() #5
  %7 = getelementptr i8, i8* %thread_ptr, i64 -32752
  %8 = bitcast i8* %7 to i16*
  %9 = load i16, i16* %8, align 2
  %10 = sext i16 %9 to i64
  %11 = add nsw i64 %10, 1
  %.sroa.0.0..sroa_idx = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 0
  store i64 %5, i64* %.sroa.0.0..sroa_idx, align 8
  %.sroa.2.0..sroa_idx1 = getelementptr inbounds [2 x i64], [2 x i64]* %0, i64 0, i64 1
  store i64 %11, i64* %.sroa.2.0..sroa_idx1, align 8
  ret void
}

Note that %thread_ptr is loaded twice. Once at the beginning (as usual) and also just after @j_yield_177().

(Using multiple threads in julia with MIGRATE_TASKS crashes julia. I don't know if it's from the scheduler or this PR or both.)

How it works

This patch introduces new LLVM placeholder functions julia.reuse_ptls_states and julia.refetch_ptls_states whose signature is identical to julia.ptls_states. During the codegen (emit_function etc.), it inserts julia.reuse_ptls_states just before each instruction that needs PTLS. It inserts julia.refetch_ptls_states if we need to refetch the PTLS (but it's not used in the normal build that doesn't enable MIGRATE_TASKS). So, in the above example, emit_function produces this LLVM IR (@code_llvm debuginfo=:none optimize=false f()):

; Function Attrs: sspstrong
define void @julia_f_235([2 x i64]* noalias nocapture sret([2 x i64]) %0) #0 {
top:
  %1 = alloca [2 x i64], align 8
  %2 = call {}*** @julia.reuse_ptls_states()
  %3 = bitcast {}*** %2 to i16*
  %4 = getelementptr inbounds i16, i16* %3, i64 8
  %5 = load i16, i16* %4, align 2
  %6 = sext i16 %5 to i64
  %7 = add i64 %6, 1
  %8 = call nonnull {}* @j_yield_237()
  %9 = call {}*** @julia.refetch_ptls_states()
  %10 = call {}*** @julia.reuse_ptls_states()
  %11 = bitcast {}*** %10 to i16*
  %12 = getelementptr inbounds i16, i16* %11, i64 8
  %13 = load i16, i16* %12, align 2
  %14 = sext i16 %13 to i64
  %15 = add i64 %14, 1
  %16 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i32 0, i32 0
  store i64 %7, i64* %16, align 8
  %17 = getelementptr inbounds [2 x i64], [2 x i64]* %1, i32 0, i32 1
  store i64 %15, i64* %17, align 8
  %18 = bitcast [2 x i64]* %0 to i8*
  %19 = bitcast [2 x i64]* %1 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 8 %18, i8* %19, i64 16, i1 false)
  ret void
}

In this patch, julia.ptls_states is not emitted by the initial codegen anymore. Instead, I created a pass LowerPTLSReuse that inserts julia.ptls_states if required and merges all @julia.refetch_ptls_states() using phi nodes. I think it means that we are manually denoting a particular "task-pure" semantics of the function @julia.reuse_ptls_states. (Ideally, LLVM has the notion of pure/const appropriate for this use case. But, looking at isLoadFromConstGV and its comment, I'm guessing LLVM doesn't have this and implementing the pass ourselves is a reasonable approach.)

The LowerPTLSReuse pass is implemented in such a way that it is (supposed to be) safe to run multiple times. This is because LateLowerGCFrame can introduce new instructions that require PTLS (i.e., new @julia.reuse_ptls_states() calls) so that we need to re-run LowerPTLSReuse again. LowerPTLSReuse is called at the beginning of the LLVM pass to avoid interfering with the optimization passes that may give up when there are calls to opaque function like @julia.reuse_ptls_states.

Next steps

Ideally:

  1. Get OK for the design from compiler devs
  2. Fix the bugs and make the CI happy
  3. Get OK for the implementation and then merge it

@tkf tkf added compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality labels Jan 10, 2021
Comment on lines +2562 to +2569
// Fetch or insert the call to `julia.ptls_states` in the entry block.
//
// Note: It is OK to use the PTLS of the entry block since the runtime
// (`ctx_switch`) is responsible for maintaining the association of PTLS
// and GC fame. What the lowering cares is the uses of the PTLS after
// re-fetches *other than* the GC frame.
auto ptlsStates = ensureEntryBlockPtls(*F);
// TODO: Ask for a review.
Copy link
Member Author

@tkf tkf Jan 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is my comment here correct? By ctx_switch, I was referring to this part:

julia/src/task.c

Lines 402 to 403 in 86639e1

// set up global state for new task
ptls->pgcstack = t->gcstack;

Comment on lines +86 to +94
if (isa<SelectInst>(Ptls)) {
// TODO: Ask if this branch is required. Since we don't introduce select
// for PTLS variables explicitly, this won't be necessary if we can be
// sure that other LLVM passes won't introduces select instructions.
// TODO: Test this branch if we keep this branch.
if (!InsertResult.second) {
auto OldPtls = InsertResult.first->second;
if (OldPtls->comesBefore(Ptls)) {
BbToPtls[BB] = Ptls;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this isa<SelectInst>(Ptls) branch?

Comment on lines -285 to -291
if (!ptls_getter)
return true;

// Look for a call to 'julia.ptls_states'.
ptlsStates = getPtls(F);
if (!ptlsStates)
return true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were these early returns some kind of optimization? If so, we can add something like

if (!usePtls(F))
    return true;

Comment on lines +152 to +153
SmallVector<std::pair<Instruction *, std::unique_ptr<SmallVector<Instruction *>>>>
SourceAndReuses;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea how to do SmallVector-of-SmallVectors in a "modern" C++ (or maybe rather in LLVM infrastructure). Is this an OK approach?

Comment on lines +631 to +634
PM->add(createLowerPTLSReusePass());
#ifdef JL_DEBUG_BUILD
PM->add(createVerifierPass(false));
#endif
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove PM->add(createVerifierPass(false));s before merge.

@yuyichao
Copy link
Contributor

This pushes the cost onto code that does not use any related features. Instead, the cost should be on the code that want to move task between threads by changing setting the correct content of the local storage.

@tkf
Copy link
Member Author

tkf commented Jan 10, 2021

By "the cost should be on the code", do you mean the code in the compiler or the user code generated by the compiler?

@yuyichao
Copy link
Contributor

do you mean the code in the compiler or the user code generated by the compiler

Doesn't matter. It's "the code that want to move task between threads". If you want to do this in generated code then it's the generated code, if you want to do that in the runtime then it's runtime code.

@tkf
Copy link
Member Author

tkf commented Jan 10, 2021

Could you elaborate your reasoning? If you are talking about the cost in the compiler, the cost would be paid only once during the compilation for a given user method. If you are talking about the cost in the generated user code (or also the scheduler), the cost would be paid every time the function is called. I am not following why these distinction "doesn't matter." I think how often we'd need to pay the cost matters when we are assessing the efficiency of the approach.

From your reply, I'm guessing that you are not talking about the cost of compilation but rather the run-time cost of the compiled (generated) code and also the runtime/scheduler. Is this correct?

@yuyichao
Copy link
Contributor

Could you elaborate your reasoning? If you are talking about the cost in the compiler, the cost would be paid only once during the compilation for a given user method. If you are talking about the cost in the generated user code (or also the scheduler), the cost would be paid every time the function is called. I am not following why these distinction "doesn't matter." I think how often we'd need to pay the cost matters when we are assessing the efficiency of the approach.

From your reply, I'm guessing that you are not talking about the cost of compilation but rather the run-time cost of the compiled (generated) code and also the runtime/scheduler. Is this correct?

I'm not sure what explanation you want. All of these are talking about the cost in this PR whereas the previous reply was entirely about what I believe it should be. The two replies are talking about completely different things so I'm confused about what you are asking about.

In any case, the only distinction that matters, as I said at the very beginning, is whether you are adding cost to code that uses the feature vs code that doesn't. I also never said that the frequency doesn't matter. All what I said is that what kind of code it is "that want to move task between threads" doesn't matter at all. It's what it does, i.e. move task between thread, that matters. And it matters exactly because this must happens way less frequent than function calls.

@tkf
Copy link
Member Author

tkf commented Jan 10, 2021

Thanks for the explanation. I think your reply gives me an extra hint that my guess in the previous comment might be correct. I'll reply assuming that your concern is not the cost of compilation time (i.e., the cost paid one for each method; a penalty for complexifying the compiler) and rather the cost at run-time (i.e., the cost paid for each method call). More specifically, I'm guessing you are referring in particular to the insertion of julia.refetch_ptls_states after every function invocation. I have a few comments on that:

First of all, refetch is not enabled unless you switch on the C macro MIGRATE_TASKS. So, I'd expect normal build of Julia to be as efficient as the one without this PR. Of course, the IR won't be identical since it inserts extra phi nodes. I'm hoping something like CFGSimplifyPass can eliminate the possible costs associated with this kind of IR. I think landing this to the main branch without MIGRATE_TASKS enabled is a good way to assess the cost (hopefully 0) of the extra phi nodes in real applications.

Second, since the optimizer on the Julia IR can inline function calls, let me emphasize that a function call in Julia code at syntax level does not imply a PTLS refetch. Having said that, the current implementation does not support elimination of the refetch after LLVM's inliner. It's also conceivable that we'd have some analysis pass that can prove certain function calls do not yield. In that case, we can safely remove julia.refetch_ptls_states. So, I think these points rather farther backup the design decision to postpone the insertion of julia.ptls_states to the LLVM pass rather than in emit_function. Since we emit a flexible IR, we can do some optimizations down the line.

Third, even if (hypothetically) it turned out supporting the migration of arbitrary task is not possible without a significant sacrifice of the single-thread performance, we can still offer opt-in migration by a special macro/function (say) Threads.@migratable akin to GC.safepoint(). I think the current approach would support this scenario as well since we just need to insert a yield (with a hint to the scheduler it supports migration) and a julia.refetch_ptls_states.

@yuyichao
Copy link
Contributor

First of all, refetch is not enabled unless you switch on the C macro MIGRATE_TASKS.

this is not a valid argument unless you promise that this feature will never be on by default.

let me emphasize that a function call in Julia code at syntax level does not imply a PTLS refetch

yes, of course, I did not assume that was the case. Still, you are forcing the cost on code that does not even use task switches because the compiler must make conservative assumptions about function calls to c code and non inlined functions.

we can still offer opt-in migration by a special macro/function (say) Threads.@migratable akin to GC.safepoint()

no you can’t. Safe point is a local property whereas task migration is not unless you guarantee to migrate things back.

@tkf
Copy link
Member Author

tkf commented Jan 10, 2021

we can still offer opt-in migration by a special macro/function (say) Threads.@migratable akin to GC.safepoint()

no you can’t. Safe point is a local property whereas task migration is not unless you guarantee to migrate things back.

I'm interested in why you think it's not possible. If you think task migration is possible to implement (putting aside efficiency consideration), then "migration point" support would need to do exactly how a function containing yield would be supported in such compiler. Since all functions called from the function containing the migration point already have returned at the migration point [1], I'm not sure why it's not possible to process the migration point with just the information of the IR of the function, independent of the properties of the callees.

[1] The only way I can imagine that would be a problem is when someone develops a custom sub-scheduler by directly using yieldto to manually suspend a function call (then a migration point in any of the user would migrate the entire sub-scheduler). AFAICT, such a framework is not widely used yet. So, I'm leaning towards to suppose this is an non-issue and add it as a warning in yieldto docstring.

@yuyichao
Copy link
Contributor

then "migration point" support would need to do exactly how a function containing yield would be supported in such compiler.

Well, the fact that this PR exists at all proves this point wrong.

Since all functions called from the function containing the migration point already have returned at the migration point,

Errrr, no by definition? If a function contains a point, then by definition it hasn't returned yet when reaching that point.

@vchuravy
Copy link
Member

So the underlying issue is that we are caching data and that after a task migration that cached data is invalid.

The three approaches I see are:

  1. Conservatively insert refetches after potential yield points
    • This PR
    • Fairly straightforward to implement by repurposing the current ptls infrastructure
    • As pointed out by Yichao increases the runtime cost regardless of whether a task switch occurred
  2. During compilation derive a stack-table that points to where the ptls pointer lives (or even values derived from the ptls but I fear that's a rabbit hole), and during task migration use that table to fix up the ptls pointer
  • Moves the cost to compilation and actual migration
  • I think this is similar to what Go uses to support preemption
  • No clear implementation strategy -- maybe we can use debug-info but that will be unreliable
  1. Use the ptls pointer once to obtain the current task, and lookup information through the task struct instead of ptls
  • Provides a place for task-migration to change things as needed
  • increases the size of the task struct

@yuyichao do you see an alternative path towards task migration (and perhaps even task preemption?)

@vtjnash
Copy link
Member

vtjnash commented Jan 10, 2021

I've been wanting to do 3 on that list for some time. Just waiting for the release to finish to start working on that. The jl_task_t struct already links back to ptls, so it won't necessarily even change the size. In fact, with a little bit of planning, I think it will work out to be equivalent in every place it matters.

@yuyichao
Copy link
Contributor

Yes, 3 is what I mentioned in the other issue a while ago. It puts the cost at the right place.

@tkf
Copy link
Member Author

tkf commented Jan 10, 2021

A "disadvantage" of approach 3 is that we can't support LICM involving rand() or array[threadid()] in general (which was the reason why I thought we were OK to accept the cost of approach 1, based on the quick discussion in Julia slack last month). Having said that, it is questionable if we should encourage threadid()-based user-space thread-local storage. For thread-safe RNG, there's already #34852 and we can even generalize it by implementing splitter "hyperobject" as suggested in Cilk. It'd be great if we can use approach 3 as an additional incentive to move away from threadid().

@vtjnash
Copy link
Member

vtjnash commented Apr 3, 2021

Superseded by #39220

@vtjnash vtjnash closed this Apr 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler:codegen Generation of LLVM IR and native code multithreading Base.Threads and related functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants