WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass #31135

jonathanvdc · 2019-02-21T14:29:05Z

Hello! This is my first PR here. I'm an MSc student working on making CUDAnative interact more organically with the Julia compiler. @maleadt is my supervisor. (Hi, Tim)

Some background: I'm proposing the changes in this PR in order to make GPU codegen more efficient for constructs that require GC support. The LateLowerGCFrame pass performs desirable platform-agnostic optimizations and we'd like to enable it in the CUDAnative pass pipeline. Sadly, we actually can't do that at the moment because LateLowerGCFrame also lowers GC intrinsics to LLVM IR that can only be expected to work on a CPU. See 1 for an issue where this is discussed in a little more depth.

This PR retains only the platform-agnostic aspects of LateLowerGCFrame and moves the CPU-specific parts into a new pass called FinalLowerGC. This doesn't change any of the resulting codegen for CPU platforms, but it should make LateLowerGCFrame eligible for reuse by CUDAnative.

I also added a helper file (llvm-pass-helpers.{cpp,h}) that contains functionality used by both the LateLowerGCFrame and FinalLowerGC passes—creating some helper functions and types seemed like a much more appropriate solution than duplicating code shared by LateLowerGCFrame and FinalLowerGC.

The helper functions/types mostly deal with scanning an llvm::Module& for Julia types, intrinsics and well-known functions. I did my best to design the helpers in such a way that they'll also be useful to other passes. For example, the initialization logic in AllocOpt could be simplified by using the helpers.

Anyway, I'd love to get some feedback on this PR and I'm happy to make changes accordingly. Thanks!

Keno · 2019-02-21T15:16:53Z

What's the reasoning behind two passes as opposed to just doing codegen based on the target triple, or potentially making the pass a parent class with the codegen parts behind a virtual method? Seems like a lot of the complexity here stems from trying to preserve information across the pass boundary.

maleadt · 2019-02-21T15:36:57Z

What's the reasoning behind two passes as opposed to just doing codegen based on the target triple, or potentially making the pass a parent class with the codegen parts behind a virtual method?

Being able to provide an implementation for the second pass in CUDAnative, written in Julia, using the LLVM C API. Kind-of what we do now, providing our own implementation for all of LateLowerGCFrame.
EDIT: I guess that doesn't really mean it needs to be two passes, cfr. createSimpleLoopUnrollPass/createLoopUnrollPass.

I was also imagining this "final" pass taking care of possible other platform-dependent intrinsics, either as the result of other LLVM-level lowering passes or maybe even direct llvmcalls in the stdlib.
GC lowering seemed like a good testing grounds for that approach.

Keno · 2019-02-21T17:32:42Z

I get the motivation. On the other hand, this pass is very tightly coupled to the representation of the GC intrinsics, so I'm not really sure much is lost by just branching on the target architecture in the code generator and putting those details there. That seems far simpler than trying to encode the entire state data structure into intrinsics in the IR.

jonathanvdc · 2019-02-21T20:35:30Z

@Keno thanks for your feedback! :) I'll try and address your concerns.

Regarding branching on the target architecture: wouldn't that fundamentally impose a "right" way to implement a GC on every architecture? Because I don't think there exists one "right" way to do that: at the moment, CUDAnative's GC strategy boils down to "malloc, never free". That's actually not such a crazy thing to do on a GPU, but we might want to implement an actual GC for CUDAnative in the future. At that point, the target-specific GPU codegen will change radically.

If we were to branch on the target architecture, then every tweak to the way the CUDAnative GC works will require changes to the GPU GC codegen in LateLowerGCFrame. CPU GCs are fairly stable, but GPU GCs are much more of an active area of research and we're just getting started with dynamic memory allocation in CUDAnative. It's not at all unreasonable to expect that CUDAnative's GC implementation will take a while to stabilize. We'd like to limit the effects of evolving a GC for CUDAnative to the CUDAnative project itself.

Moreover, we might want to support multiple GC strategies for CUDAnative. For example, some kernels might be better served by a malloc-never-free strategy while others might really need a GC. If LateLowerGCFrame were to select instructions based on the target architecture, then we wouldn't be able to support both strategies at the same time. The changes in this PR allow CUDAnative to support both by customizing its pass pipeline.
I don't really understand what makes you say that my proposed changes try to encode the State data structure into intrinsics—that certainly wasn't my intention.

I'll elaborate. I introduced six new intrinsics that are emitted by LateLowerGCFrame instead of something platform-specific. Each of those intrinsics can be lowered formulaically for both CPU and GPU back-ends. This PR implements the CPU lowerings described here in FinalLowerGC.
1. julia.new_gc_frame allocates a new GC frame. This intrinsic expands to an alloca + memset when targeting the regular Julia GC. When targeting a malloc-never-free GC, we want julia.new_gc_frame to expand to a smaller alloca that will hopefully either get DCE'd or at least be replaced by scalars. The malloc-never-free GPU GC's alloca is smaller than the CPU GC's because the CPU GC needs some extra storage that the "malloc-never-free GPU GC doesn't use.
2. julia.push_gc_frame registers a GC frame with the GC. When targeting the CPU GC, it expands to a complex sequence of instructions that register the GC frame. When targeting a malloc-never-free GC, we simply delete all calls to this intrinsic, i.e., we turn it into a nop.
3. julia.pop_gc_frame does the opposite. It unregisters a GC frame. Its lowering is analogous to julia.push_gc_frame's.
4. julia.get_gc_frame_slot gets a pointer to a slot in a GC frame. It always expands to a GEP, but the CPU GC's GEP looks slightly different from the current CUDAnative lowering, because the layout of GC frames is different there.
5. julia.gc_alloc_bytes allocates storage for an object of a particular size, same as julia.gc_alloc_obj. When targeting the CPU GC, it expands to either gc_pool_alloc or a gc_big_alloc. When targeting a malloc-never-free GC, julia.gc_alloc_bytes expands to a malloc call, plus some pointer arithmetic.
6. julia.queue_gc_root is lowered as a jl_gc_queue_root call when targeting the CPU GC and gets turned into a nop when targeting the malloc-never-free GC.
All of these intrinsics are easy to lower and they have clear semantics yet enough wiggle room to support multiple GC implementations. If anything, they're just a way for LateLowerGCFrame to declare "this is where I want platform-specific code."

EDIT: I forgot about an intrinsic (julia.queue_gc_root) when I wrote this comment. Added it to the list now. Sorry about that.

jonathanvdc · 2019-04-21T10:49:39Z

@Keno Hello. It's been a while since I opened this PR and it's kind of been dormant since then. In the meantime, I've successfully implemented a garbage collector for CUDAnative on top of the changes I proposed here. Garbage collection underpins a wide range of Julia features, so I'd like to upstream my GC into CUDAnative's master branch. However, I'm fairly sure that's just not going to happen until the intrinsics I'm proposing here—or some alternative but equivalent mechanism—are included in Julia itself.

I understand if you're not very excited about this admittedly rather large PR, but I would really like to salvage its core ideas. Code-wise, it consists of two main additions:

I refactored the way Julia intrinsics and well-known functions are resolved by passes. I believe that this is a good thing in a sense that is orthogonal to the GC codegen changes I'm proposing. The refactor takes a lot of accidental complexity out of the optimization passes and moves it to a dedicated helper file. It gets rid of a good chunk of code duplication. Furthermore, the resolution approach taken by the refactored code is simply more robust and arguably more maintainable than the ad-hoc approach previously taken by Julia's LLVM passes.

That being said, if you genuinely don't want the refactor, then I can just tear it out and duplicate some code to achieve the same effect.
The crux of this PR: I introduced six new intrinsics that are produced by LateLowerGCFrame and subsequently lowered to CPU-specific code by FinalLowerGC. @maleadt suggested I take this approach and I truly believe that this is the cleanest way to give CUDAnative and similar packages a window of opportunity to supply their own GC lowerings.

With regard to your previous comments: I'm fairly sure that branching on the target triple to determine which lowering to use won't work. At the time of writing, my GC branch of CUDAnative has two such lowerings: one applies a "malloc never free" strategy, the other introduces calls to the GPU GC. They target the same architectures, so their target triples are identical.

If you really don't want to introduce new intrinsics, then I suppose we could parameterize LateLowerGCFrame somehow and encode a variety of different lowerings in LateLowerGCFrame itself. That would imply moving all of this logic into Julia itself. I don't like this solution, but if you say that this is what you want then I'd be okay with implementing it. My end goal is to get my GPU GC for CUDAnative to work on unmodified Julia. This PR is mostly just a means to achieve that.

I'm not sure how to proceed here. Do you see an avenue for me to upstream some mechanism to support my GPU GC along with a "malloc never free" lowering? Would it help if I split this PR up into the refactor and its core contribution? Would you like me to make changes to the way the lowering process itself works?

I mostly just want a roadmap I can stick to for adding support for my CUDAnative GC to Julia. I'd be happy to rework this PR into one or more PRs that are more to your liking. But for that I need you to tell me what those PRs should look like.

Also, I'd like to point out that my thesis is due by the end of May. I'm very motivated to work on this PR at the moment, but I also have a lot of other things I need to take care of before that end of May deadline. So I can't really afford to experiment with a variety of different approaches before settling on some "optimal" solution. Do you think that—given these constraints—we can make that deadline?

Thanks!

vchuravy · 2019-04-23T14:46:30Z

CI failure is #30365

vchuravy · 2019-04-23T14:48:05Z

src/llvm-final-gc-lowering.cpp

+char FinalLowerGC::ID = 0;
+static RegisterPass<FinalLowerGC> X("FinalLowerGC", "Final GC intrinsic lowering pass", false, false);
+
+Pass *createFinalLowerGCPass()


Please add a C-API function like

julia/src/llvm-muladd.cpp

Line 103 in 5741c7f

extern "C" JL_DLLEXPORT void LLVMExtraAddCombineMulAddPass(LLVMPassManagerRef PM)

Alright. I added one.

vchuravy · 2019-04-23T14:56:19Z

@Keno so far we have been trying to separate the different backend implementations from the main Julia implementation and I think that strategy has served us reasonably well. It allowed the GPU backend to experiment and iterate quicker than Julia release cycles and branching on the target would strongly couple the GPU backend to versions of Julia (in addition to as Jonathan says there are several strategies that backends might want to experiment with.)

jonathanvdc · 2019-04-23T17:06:57Z

@vchuravy Thanks for chiming in! I rebased the PR again and added the C-API function you asked for.

Keno · 2019-04-23T18:19:24Z

If it's useful to the GPU backend I'm ok with this. I'll try to find the time to read through the details this week.

codecov-io · 2019-04-23T19:10:01Z

Codecov Report

Merging #31135 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #31135   +/-   ##
=======================================
  Coverage   81.21%   81.21%           
=======================================
  Files         372      372           
  Lines       62957    62957           
=======================================
  Hits        51131    51131           
  Misses      11826    11826

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5741c7f...d13ae6d. Read the comment docs.

jonathanvdc · 2019-04-23T19:19:02Z

Awesome, thanks!

vchuravy · 2019-04-23T19:21:33Z

Jonathan, it would be great if you can squash some of these commits (the goal would be to have self contained changes that pass tests)

jonathanvdc · 2019-04-23T20:09:57Z

Sure thing Valentin. I'll see what I can do tomorrow. Thanks for helping me navigate this!

Factor a loop out of LateLowerGCFrame::runOnFunction and into GCLoweringRefs Move TBAA NDNode init from LateLowerGCFrame to GCLoweringRefs Rename 'llvm-late-gc-helpers.{cpp,h}' to 'llvm-pass-helpers.{cpp,h}'

Fix a typo in the final GC lowering pass Split up 'julia.push_new_gc_frame' into two intrinsics Steal name from existing GC frame when lowering 'julia.new_gc_frame' Move GC frame pop lowering to final GC lowering pass Change how GC helper intrinsics are accessed Remove fields from GCLoweringRefs Capture module pointer in GCLoweringRefs Make the 'julia.new_gc_frame' intrinsic noalias Move GC frame access lowering to final GC lowering pass Update some outdated comments Add some sanity checks to final GC lowering pass Include 'llvm-version.h' in final GC lowering pass Define tests that test GC frame lowering passes in isolation Use 'i32' instead of 'size_t' for 'julia.{new,push}_gc_frame' size args Try to fix GC lowering tests These tests work fine on most machines, but they break on x86 Travis CI builds. The breakage is almost certainly due to a mismatch between integer types in the tests and 'size_t', exposed by recent changes to the GC lowering pass. Define 'LLVMExtraAddFinalLowerGCPass'

jonathanvdc · 2019-04-24T11:19:48Z

I rebased my branch to move bug-fix commits adjacent to buggy commits (bugs were usually related to ABIs different from my own machine's), then squashed those commits. I also squashed related commits to reduce the amount of noise.

My chronomancy seems to have confused GitHub, though: the commits in this PR are laid out in a more or less random order now. Sorry about that. Didn't see it coming. At least git log seems unaffected.

Here's git log's take on the commits in this PR, if that's at all useful.

commit ef31437ec71d654f0f96cd886d1614822dd81f38
Author: jonathanvdc <jonathan.vdc@outlook.com>
Date:   Wed Feb 20 11:28:56 2019 +0100

    Change intrinsic declaration function signatures
    
    Insert a newline

commit 0f58a60109153a1f0ed5ccfd1abb52a6e8b6b7f1
Author: Jonathan <jonathan.vdc@outlook.com>
Date:   Wed Apr 24 12:53:42 2019 +0200

    Rename 'IntrinsicDescription::getOrDefine' to 'getOrDeclare'
    
    Use getOrDeclare to get 'julia.gc_alloc_obj' intrinsic

commit 7bb118c2d02b642f741609fd187de26644ec6dff
Author: jonathanvdc <jonathan.vdc@outlook.com>
Date:   Tue Feb 19 14:19:23 2019 +0100

    Introduce a namespace for well-known runtime functions
    
    Use helpers to grab well-known functions

commit 1e06be7da07a224dd9364f767cbde722430b51f8
Author: jonathanvdc <jonathan.vdc@outlook.com>
Date:   Thu Feb 21 16:01:09 2019 +0100

    Define a 'julia.queue_gc_root' intrinsic
    
    Delay 'jl_gc_queue_root' call generation until final GC lowering pass
    
    Fix two broken assertions

commit fff80dbdae8e27fc3e303ccc8f41b4c2b2d9bacf
Author: jonathanvdc <jonathan.vdc@outlook.com>
Date:   Wed Apr 24 16:03:57 2019 +0200

    Define a 'julia.gc_alloc_bytes' intrinsic
    
    Delay GC allocation lowering into the final GC lowering pass
    
    Fix a broken LLVM pass test
    
    Define GC alloc lowering pass tests
    
    Cast 'julia.gc_alloc_obj' argument
    
    Add missing annotation to test
    
    
    Relax final GC lowering pass test

commit 1352b973007a9487ac6390f9cc03785ce7bb7d6f
Author: jonathanvdc <jonathan.vdc@outlook.com>
Date:   Tue Apr 23 19:02:55 2019 +0200

    Delay GC frame lowering into a separate pass
    
    Fix a typo in the final GC lowering pass
    
    Split up 'julia.push_new_gc_frame' into two intrinsics
    
    Steal name from existing GC frame when lowering 'julia.new_gc_frame'
    
    Move GC frame pop lowering to final GC lowering pass
    
    Change how GC helper intrinsics are accessed
    
    Remove fields from GCLoweringRefs
    
    Capture module pointer in GCLoweringRefs
    
    Make the 'julia.new_gc_frame' intrinsic noalias
    
    Move GC frame access lowering to final GC lowering pass
    
    Update some outdated comments
    
    Add some sanity checks to final GC lowering pass
    
    Include 'llvm-version.h' in final GC lowering pass
    
    Define tests that test GC frame lowering passes in isolation
    
    Use 'i32' instead of 'size_t' for 'julia.{new,push}_gc_frame' size args
    
    Try to fix GC lowering tests
    
    These tests work fine on most machines, but they break on x86 Travis CI builds.
    The breakage is almost certainly due to a mismatch between integer types in
    the tests and 'size_t', exposed by recent changes to the GC lowering pass.
    
    Define 'LLVMExtraAddFinalLowerGCPass'

commit 0bd560222b30eb07bb73e95a3fff5040cdc5827c
Author: jonathanvdc <jonathan.vdc@outlook.com>
Date:   Mon Feb 18 17:53:33 2019 +0100

    Factor platform-agnostic types and intrinsics out of LateLowerGCFrame
    
    Factor a loop out of LateLowerGCFrame::runOnFunction and into GCLoweringRefs
    
    Move TBAA NDNode init from LateLowerGCFrame to GCLoweringRefs
    
    Rename 'llvm-late-gc-helpers.{cpp,h}' to 'llvm-pass-helpers.{cpp,h}'

Delay GC allocation lowering into the final GC lowering pass Fix a broken LLVM pass test Define GC alloc lowering pass tests Cast 'julia.gc_alloc_obj' argument Add missing annotation to test Relax final GC lowering pass test

Delay 'jl_gc_queue_root' call generation until final GC lowering pass Fix two broken assertions

Use helpers to grab well-known functions

Use getOrDeclare to get 'julia.gc_alloc_obj' intrinsic

Insert a newline

vchuravy · 2019-05-03T20:02:37Z

@Keno good to merge?

jonathanvdc · 2019-05-31T10:27:34Z

@Keno Any chance we can get this merged in the near future?

Keno · 2019-06-03T17:48:03Z

I reviewed this a while back and thought it was broadly fine, so I went ahead and merged this. There were a couple minor things, but we can do those as part of a larger look at how this all works, that I don't think we need to do right this second.

jonathanvdc · 2019-06-03T18:39:05Z

Awesome! Thanks!

vtjnash · 2019-06-03T19:47:02Z

Might have been good to fix the commit message for that merge, so that it didn't still say work-in-progress, but seems good to have this now! (with the proviso that we may break it whenever we find that we need to—but which is unlikely to be frequent)

jonathanvdc · 2019-06-07T09:50:43Z

Hmmm. That's interesting. Your first error ("DEBUG was not declared in this scope") pertains to my changes here. I submitted a PR to fix it: #32265.

However, your second error ("TerminatorInst was not declared in this scope") seems unrelated to my changes: the line that triggers the error was added in 0c89c16, which is over two years old. So I don't really understand why you're not getting that error for 9dceb46.

maleadt added compiler:codegen Generation of LLVM IR and native code gpu Affects running Julia on a GPU labels Feb 21, 2019

maleadt requested a review from Keno February 21, 2019 14:32

maleadt changed the title ~~Split GC lowering pass into an platform-agnostic pass and a CPU-specific pass~~ WIP/RFC: Split GC lowering pass into an platform-agnostic pass and a CPU-specific pass Feb 21, 2019

jonathanvdc changed the title ~~WIP/RFC: Split GC lowering pass into an platform-agnostic pass and a CPU-specific pass~~ WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass Feb 21, 2019

maleadt mentioned this pull request Feb 22, 2019

Make it possible to add optimization passes without lowering intrinsics. #31101

Merged

ararslan added the GC Garbage collector label Feb 22, 2019

jonathanvdc force-pushed the staged-gc-lowering branch from 8a11203 to 8153b08 Compare February 25, 2019 16:21

maleadt mentioned this pull request Mar 21, 2019

Export additional LLVM passes #31428

Merged

jonathanvdc force-pushed the staged-gc-lowering branch from 8153b08 to 7ef54d7 Compare April 10, 2019 10:38

vchuravy reviewed Apr 23, 2019

View reviewed changes

jonathanvdc force-pushed the staged-gc-lowering branch from 9f7a683 to d13ae6d Compare April 23, 2019 17:04

jonathanvdc added 2 commits April 24, 2019 13:04

Factor platform-agnostic types and intrinsics out of LateLowerGCFrame

0bd5602

Factor a loop out of LateLowerGCFrame::runOnFunction and into GCLoweringRefs Move TBAA NDNode init from LateLowerGCFrame to GCLoweringRefs Rename 'llvm-late-gc-helpers.{cpp,h}' to 'llvm-pass-helpers.{cpp,h}'

jonathanvdc force-pushed the staged-gc-lowering branch from d13ae6d to ec5eac9 Compare April 24, 2019 11:10

Define a 'julia.gc_alloc_bytes' intrinsic

fff80db

Delay GC allocation lowering into the final GC lowering pass Fix a broken LLVM pass test Define GC alloc lowering pass tests Cast 'julia.gc_alloc_obj' argument Add missing annotation to test Relax final GC lowering pass test

jonathanvdc added 4 commits April 24, 2019 16:05

Define a 'julia.queue_gc_root' intrinsic

1e06be7

Delay 'jl_gc_queue_root' call generation until final GC lowering pass Fix two broken assertions

Introduce a namespace for well-known runtime functions

7bb118c

Use helpers to grab well-known functions

Rename 'IntrinsicDescription::getOrDefine' to 'getOrDeclare'

0f58a60

Use getOrDeclare to get 'julia.gc_alloc_obj' intrinsic

Change intrinsic declaration function signatures

ef31437

Insert a newline

jonathanvdc force-pushed the staged-gc-lowering branch from ec5eac9 to ef31437 Compare April 24, 2019 14:05

Keno merged commit d6fe9c1 into JuliaLang:master Jun 3, 2019

jonathanvdc mentioned this pull request Jun 7, 2019

Use LLVM_DEBUG instead of DEBUG in final GC lowering pass #32265

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass #31135

WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass #31135

jonathanvdc commented Feb 21, 2019

Keno commented Feb 21, 2019

maleadt commented Feb 21, 2019 •

edited

Loading

Keno commented Feb 21, 2019

jonathanvdc commented Feb 21, 2019 •

edited

Loading

jonathanvdc commented Apr 21, 2019 •

edited

Loading

vchuravy commented Apr 23, 2019

vchuravy Apr 23, 2019

jonathanvdc Apr 23, 2019

vchuravy commented Apr 23, 2019

jonathanvdc commented Apr 23, 2019

Keno commented Apr 23, 2019

codecov-io commented Apr 23, 2019

jonathanvdc commented Apr 23, 2019

vchuravy commented Apr 23, 2019

jonathanvdc commented Apr 23, 2019

jonathanvdc commented Apr 24, 2019 •

edited

Loading

vchuravy commented May 3, 2019

jonathanvdc commented May 31, 2019

Keno commented Jun 3, 2019

jonathanvdc commented Jun 3, 2019

vtjnash commented Jun 3, 2019

jonathanvdc commented Jun 7, 2019

WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass #31135

WIP/RFC: Split GC lowering pass into a platform-agnostic pass and a CPU-specific pass #31135

Conversation

jonathanvdc commented Feb 21, 2019

Keno commented Feb 21, 2019

maleadt commented Feb 21, 2019 • edited Loading

Keno commented Feb 21, 2019

jonathanvdc commented Feb 21, 2019 • edited Loading

jonathanvdc commented Apr 21, 2019 • edited Loading

vchuravy commented Apr 23, 2019

vchuravy Apr 23, 2019

Choose a reason for hiding this comment

jonathanvdc Apr 23, 2019

Choose a reason for hiding this comment

vchuravy commented Apr 23, 2019

jonathanvdc commented Apr 23, 2019

Keno commented Apr 23, 2019

codecov-io commented Apr 23, 2019

Codecov Report

jonathanvdc commented Apr 23, 2019

vchuravy commented Apr 23, 2019

jonathanvdc commented Apr 23, 2019

jonathanvdc commented Apr 24, 2019 • edited Loading

vchuravy commented May 3, 2019

jonathanvdc commented May 31, 2019

Keno commented Jun 3, 2019

jonathanvdc commented Jun 3, 2019

vtjnash commented Jun 3, 2019

jonathanvdc commented Jun 7, 2019

maleadt commented Feb 21, 2019 •

edited

Loading

jonathanvdc commented Feb 21, 2019 •

edited

Loading

jonathanvdc commented Apr 21, 2019 •

edited

Loading

jonathanvdc commented Apr 24, 2019 •

edited

Loading