-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow PTLS to be changed in a function (preparation for task migration) #39168
Conversation
// Fetch or insert the call to `julia.ptls_states` in the entry block. | ||
// | ||
// Note: It is OK to use the PTLS of the entry block since the runtime | ||
// (`ctx_switch`) is responsible for maintaining the association of PTLS | ||
// and GC fame. What the lowering cares is the uses of the PTLS after | ||
// re-fetches *other than* the GC frame. | ||
auto ptlsStates = ensureEntryBlockPtls(*F); | ||
// TODO: Ask for a review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is my comment here correct? By ctx_switch
, I was referring to this part:
Lines 402 to 403 in 86639e1
// set up global state for new task | |
ptls->pgcstack = t->gcstack; |
if (isa<SelectInst>(Ptls)) { | ||
// TODO: Ask if this branch is required. Since we don't introduce select | ||
// for PTLS variables explicitly, this won't be necessary if we can be | ||
// sure that other LLVM passes won't introduces select instructions. | ||
// TODO: Test this branch if we keep this branch. | ||
if (!InsertResult.second) { | ||
auto OldPtls = InsertResult.first->second; | ||
if (OldPtls->comesBefore(Ptls)) { | ||
BbToPtls[BB] = Ptls; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this isa<SelectInst>(Ptls)
branch?
if (!ptls_getter) | ||
return true; | ||
|
||
// Look for a call to 'julia.ptls_states'. | ||
ptlsStates = getPtls(F); | ||
if (!ptlsStates) | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were these early returns some kind of optimization? If so, we can add something like
if (!usePtls(F))
return true;
SmallVector<std::pair<Instruction *, std::unique_ptr<SmallVector<Instruction *>>>> | ||
SourceAndReuses; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no idea how to do SmallVector
-of-SmallVector
s in a "modern" C++ (or maybe rather in LLVM infrastructure). Is this an OK approach?
PM->add(createLowerPTLSReusePass()); | ||
#ifdef JL_DEBUG_BUILD | ||
PM->add(createVerifierPass(false)); | ||
#endif |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove PM->add(createVerifierPass(false));
s before merge.
This pushes the cost onto code that does not use any related features. Instead, the cost should be on the code that want to move task between threads by changing setting the correct content of the local storage. |
By "the cost should be on the code", do you mean the code in the compiler or the user code generated by the compiler? |
Doesn't matter. It's "the code that want to move task between threads". If you want to do this in generated code then it's the generated code, if you want to do that in the runtime then it's runtime code. |
Could you elaborate your reasoning? If you are talking about the cost in the compiler, the cost would be paid only once during the compilation for a given user method. If you are talking about the cost in the generated user code (or also the scheduler), the cost would be paid every time the function is called. I am not following why these distinction "doesn't matter." I think how often we'd need to pay the cost matters when we are assessing the efficiency of the approach. From your reply, I'm guessing that you are not talking about the cost of compilation but rather the run-time cost of the compiled (generated) code and also the runtime/scheduler. Is this correct? |
I'm not sure what explanation you want. All of these are talking about the cost in this PR whereas the previous reply was entirely about what I believe it should be. The two replies are talking about completely different things so I'm confused about what you are asking about. In any case, the only distinction that matters, as I said at the very beginning, is whether you are adding cost to code that uses the feature vs code that doesn't. I also never said that the frequency doesn't matter. All what I said is that what kind of code it is "that want to move task between threads" doesn't matter at all. It's what it does, i.e. move task between thread, that matters. And it matters exactly because this must happens way less frequent than function calls. |
Thanks for the explanation. I think your reply gives me an extra hint that my guess in the previous comment might be correct. I'll reply assuming that your concern is not the cost of compilation time (i.e., the cost paid one for each method; a penalty for complexifying the compiler) and rather the cost at run-time (i.e., the cost paid for each method call). More specifically, I'm guessing you are referring in particular to the insertion of First of all, refetch is not enabled unless you switch on the C macro Second, since the optimizer on the Julia IR can inline function calls, let me emphasize that a function call in Julia code at syntax level does not imply a PTLS refetch. Having said that, the current implementation does not support elimination of the refetch after LLVM's inliner. It's also conceivable that we'd have some analysis pass that can prove certain function calls do not yield. In that case, we can safely remove Third, even if (hypothetically) it turned out supporting the migration of arbitrary task is not possible without a significant sacrifice of the single-thread performance, we can still offer opt-in migration by a special macro/function (say) |
this is not a valid argument unless you promise that this feature will never be on by default.
yes, of course, I did not assume that was the case. Still, you are forcing the cost on code that does not even use task switches because the compiler must make conservative assumptions about function calls to c code and non inlined functions.
no you can’t. Safe point is a local property whereas task migration is not unless you guarantee to migrate things back. |
I'm interested in why you think it's not possible. If you think task migration is possible to implement (putting aside efficiency consideration), then "migration point" support would need to do exactly how a function containing [1] The only way I can imagine that would be a problem is when someone develops a custom sub-scheduler by directly using |
Well, the fact that this PR exists at all proves this point wrong.
Errrr, no by definition? If a function contains a point, then by definition it hasn't returned yet when reaching that point. |
So the underlying issue is that we are caching data and that after a task migration that cached data is invalid. The three approaches I see are:
@yuyichao do you see an alternative path towards task migration (and perhaps even task preemption?) |
I've been wanting to do 3 on that list for some time. Just waiting for the release to finish to start working on that. The jl_task_t struct already links back to ptls, so it won't necessarily even change the size. In fact, with a little bit of planning, I think it will work out to be equivalent in every place it matters. |
Yes, 3 is what I mentioned in the other issue a while ago. It puts the cost at the right place. |
A "disadvantage" of approach 3 is that we can't support LICM involving |
Superseded by #39220 |
This PR tweaks how PTLS getters are generated and lets us insert multiple PTLS getters at arbitrary points in a function. There are two goals motivating this change:
Task migration across threads. To migrate the tasks across threads, we need to refetch PTLS after every point that a function can yield (= most of the function calls).
For Julia-Tapir integration (or, more in general, LLVM-level compiler support for the task system) we are working on, we need to be able to split out chunks of LLVM IR as functions. I think relaxing the current requirement that the PTLS has to be fetched only once at the entry block helps this part of the LLVM pass.
Note that this PR does not cover these goals and so I'm not 100% sure if it is the right approach yet. Also, there are some failures in tests due to this change and so it's not merge-ready. But it'd be great if I can get some feedbacks in the design before I spend more time fixing the bugs.
Demo
I build
julia
withMIGRATE_TASKS
enabled:and then
prints
Note that
%thread_ptr
is loaded twice. Once at the beginning (as usual) and also just after@j_yield_177()
.(Using multiple threads in
julia
withMIGRATE_TASKS
crashesjulia
. I don't know if it's from the scheduler or this PR or both.)How it works
This patch introduces new LLVM placeholder functions
julia.reuse_ptls_states
andjulia.refetch_ptls_states
whose signature is identical tojulia.ptls_states
. During the codegen (emit_function
etc.), it insertsjulia.reuse_ptls_states
just before each instruction that needs PTLS. It insertsjulia.refetch_ptls_states
if we need to refetch the PTLS (but it's not used in the normal build that doesn't enableMIGRATE_TASKS
). So, in the above example,emit_function
produces this LLVM IR (@code_llvm debuginfo=:none optimize=false f()
):In this patch,
julia.ptls_states
is not emitted by the initial codegen anymore. Instead, I created a passLowerPTLSReuse
that insertsjulia.ptls_states
if required and merges all@julia.refetch_ptls_states()
using phi nodes. I think it means that we are manually denoting a particular "task-pure" semantics of the function@julia.reuse_ptls_states
. (Ideally, LLVM has the notion of pure/const appropriate for this use case. But, looking atisLoadFromConstGV
and its comment, I'm guessing LLVM doesn't have this and implementing the pass ourselves is a reasonable approach.)The
LowerPTLSReuse
pass is implemented in such a way that it is (supposed to be) safe to run multiple times. This is becauseLateLowerGCFrame
can introduce new instructions that require PTLS (i.e., new@julia.reuse_ptls_states()
calls) so that we need to re-runLowerPTLSReuse
again.LowerPTLSReuse
is called at the beginning of the LLVM pass to avoid interfering with the optimization passes that may give up when there are calls to opaque function like@julia.reuse_ptls_states
.Next steps
Ideally: