-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SCFToCalyx] Draft SCFToCalyx pass #1630
Conversation
Thanks for sharing the draft @mortbopet. I have taken a quick look, and have some high level discussion points below. I'll checkout this branch and experiment with it myself. Lowering standard operation primitivesI've been thinking about this problem as well. I think the approach you have to wrap each Standard operation in a component that uses Comb operations makes sense. The alternative to define primitive operations in the Calyx dialect could also be a useful intermediate step, but wouldn't these ultimately lower into Comb operations later on (presumably in the CalyxToHW pass)? At some point, it would be great to lower to Comb, so that we can take advantage of all the optimizations in that dialect. Given that, I think what you have done so far to lower Standard operations to components using Comb operations is a good start. Taking a step back, is the wrapping with components necessary? I was thinking a Let me propose the approach I've been thinking about, which could be used by this pass as well as others. This is about lowering Standard operations in general during HLS flows. I'm imagining something similar to how the Linalg structured op interface works. Specifically w.r.t. property 5. We could have a set of patterns that lower Standard/Math operations into their corresponding Comb operations (e.g. Representing loop kernelsYou touched on having some sort of One option I considered along the lines of %c1 = constant 1 : index
%c0 = constant 0 : index
%c64 = constant 64 : index
scf.for %arg3 = %c0 to %c64 step %c1 {
scf.for %arg4 = %c0 to %c64 step %c1 {
scf.for %arg5 = %c0 to %c64 step %c1 {
"staticlogic.pipeline"(%arg3, %arg4, %arg5) ( {
^bb0(%arg6: index, %arg7: index, %arg8: index): // no predecessors
%0 = memref.load %arg0[%arg6, %arg8] : memref<?x64xf64>
%1 = memref.load %arg1[%arg8, %arg7] : memref<?x64xf64>
br ^bb1
^bb1: // pred: ^bb0
%2 = mulf %0, %1 : f64
%3 = memref.load %arg2[%arg6, %arg7] : memref<?x64xf64>
br ^bb2
^bb2: // pred: ^bb1
%4 = addf %3, %2 : f64
br ^bb3
^bb3: // pred: ^bb2
memref.store %4, %arg2[%arg6, %arg7] : memref<?x64xf64>
"staticlogic.return"() : () -> ()
}) : (index, index, index) -> ()
}
}
} In my mind, the loop nest represents a set of address counters and an FSM controlling their increments, which are fed into the pipeline. The %c1 = constant 1 : index
%c0 = constant 0 : index
%c64 = constant 64 : index
scf.for %arg3 = %c0 to %c64 step %c1 {
scf.for %arg4 = %c0 to %c64 step %c1 {
scf.for %arg5 = %c0 to %c64 step %c1 {
scf.execute_region -> {
^bb0: // no predecessors
%0 = memref.load %arg0[%arg3, %arg5] : memref<?x64xf64>
%1 = memref.load %arg1[%arg5, %arg4] : memref<?x64xf64>
br ^bb1
^bb1: // pred: ^bb0
%2 = mulf %0, %1 : f64
%3 = memref.load %arg2[%arg3, %arg4] : memref<?x64xf64>
br ^bb2
^bb2: // pred: ^bb1
%4 = addf %3, %2 : f64
br ^bb3
^bb3: // pred: ^bb2
memref.store %4, %arg2[%arg3, %arg4] : memref<?x64xf64>
"staticlogic.return"() : () -> ()
}
}
}
} Is there a benefit to Your design point mentions basic block arguments should be used to infer registers. The above examples don't explicitly use block arguments, but they could. Another approach might be to analyze the live-ins for the blocks and infer registers appropriately. Perhaps there are other approaches. Curious what will make it easiest for this pass. At the moment, I'm leaning towards "block arguments represent registers", which could let frontends move Values that should be registered into block arguments, and any Values passed into a block through SSA dominance could be combinational in the hardware. Finally, these examples represent the loops using Scheduling infrastructureI've been playing around with the Scheduling infrastructure in CIRCT. I imagine we would want to apply that earlier in the pipeline, before this pass, to determine the schedule at a high level. This seems to align with your design around basic blocks: we could use the Scheduling infrastructure to determine pipeline stages, and create a block for each stage. That's actually how the above examples were generated. Does this align with your thinking? Pass infrastructureI've been thinking about how to structure the phased application of patterns in passes like these. Have you had a look at the Linalg CodegenStrategy, which is built on applyStagedPatterns? It seems like we could build our own HLS "CodegenStrategy", using The first stage has a list of RewritePatternSets, and applies each set of patterns in the list in order. This could manage the places you have sequential calls to The second stage applies a single RewritePatternSet, which is responsible for cleanups/canonicalizations after the first stage. This could handle the CleanupFuncOps and runControlFlowSimplifications step. It could also include the Comb canonicalization patterns, if that makes sense. The final stage applies a callback function to do any last minute processing that might not have fit into the earlier stages. I'm not sure there is a need for this. Anyway, curious if you think it is worth exploring. I think the CodegenStrategy has been a success, so perhaps following a similar structure in this pass would make sense. MemoriesI have a local patch to add a MemoryOp primitive to Calyx, similar to the RegisterOp. I can share a PR for that. Cell interfaceIt's mentioned as a TODO, and something I was planning to work on after the MemoryOp is added. I can implement a CellOpInterface, and make all the CalyxPrimitive operations (RegisterOp/MemoryOp) implement it. I think that would be useful for this as well. |
@mikeurbach Thank you for the elaborate reply, it's great to get a discussion started on this.
The main motivator for doing it this way (inside a component) is to be able to share component instances. This is not so much a concern for this pass in its current state (since a new instance of an operator is instantiated for each operator in the circuit). However, if a significant Calyx infrastructure is built within CIRCT, there would be a need for the ability to share functional units during binding, and to have this expressible in the IR. On lowering Next, as you mention, there are arithmetic operations which cannot be lowered, but might map to hard logic units on a target architecture. This could probably tie in with the discussion on target triples, where information on hard IP available on the target is tabulated. This, however, I think should be distinct from a StdToComb pass, to separate concerns between lowering "simple" arithmetic to "simple" combinational logic, and the more complex process of binding operators to a limited set of resources on a target device.
I must admit that i am not a big fan of
Yes, i see scheduling into pipeline stages as coming before SCFToCalyx. In essence, i think the SCFToCalyx pass should be kept as "dumb" as possible and not try to include too many circuit transformations. Such transformations should be kept to a (to be) HLS area within circuit, containing i.e., transformation of
These are some great pointers; the implementation in this PR was mainly guided by what i found when working on fixing InstanceOp lowering in StandardToHandshake, but if partial lowering is already a solved problem, then we should use that method. I'll look into it and come back with a more informed comment.
Cool! does it follow the memories available in the native calyx compiler?
Great! There is a lot of places where this interface is needed, and it would clean up things alot. |
/// ... -> } | ||
/// } | ||
/// } | ||
struct NestedSeqPattern : mlir::OpRewritePattern<calyx::SeqOp> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem like it strictly needs to be part of the conversion. Did you consider making this a separate pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just passing through and saw this; the native compiler indeed has a pass for collapsing control.
signalPassFailure(); | ||
return; | ||
} | ||
if (failed(runPartialPatternLowering<InlineExecuteRegionOpPattern>())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this could be made more compact by using logical operations on the results from runPartialPatternLowering:
if(successful(status)) status |= runPartialPatternLowering(...)
Or pass the status by reference into runPartialPatternLowering?
PartiallyLowerFuncToComp(mlir::FuncOp funcOp, | ||
PatternRewriter &rewriter) const override { | ||
funcOp.walk([&](mlir::Operation *op) { | ||
if (op->getDialect()->getNamespace() == "comb") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like you'd want an interface or a trait here?
FWIW, the intention of staticlogic.pipeline is that control flow within the loop would already have been reduced to a pure pipeline. This is what allows the use of blocks to represent something that is semantically distinct from control flow. |
I see, as in, the design is expected to be fully lowered to |
Thanks for the point-by-point reply @mortbopet. It sounds like we are more or less on the same page. To me, the most interesting part to discuss is the |
+1 for this!
I'm not sure I'm following. You're saying we can't represent instance sharing of ops if we have the above representation? I'm not quite sure I understand why this is the case. To see if a group is using some primitive, we just need to see if any port of the given primitive is found within the group, right?
This is cool. |
I should have been more clear; i was arguing for why i opted for using component wrapped circt/test/Dialect/Calyx/round-trip.mlir Lines 42 to 44 in eb6b178
The representation used in the PR (wrapping comb ops in a component) was used as a placeholder, to be able to have the in/out port SSA values of the comb operator dominate anything in the calyx.wires/calyx.control ops (to allow for sharing).
The |
I'm wondering if that's too early. Do you (@mortbopet, @mikeurbach) envision transformations of the pipeline, once it's formed? My understanding of Calyx (mostly from their ASPLOS paper, no practical experience yet) is that one of its USPs is being a latency-insensitive target IR that takes the burden of emitting the FSMs off the frontend. Statically scheduling components then is an optimisation to simplify the FSM and omit the handshaking signals. @rachitnigam, you said earlier that you are thinking about adding a data flow operator to the language -- is that still on the table? If yes, then mabye |
@mortbopet Thanks for the awesome writeup! I'm super excited to see this work! @mikeurbach @stephenneuendorffer thanks for the comments and context around the other passes in CIRCT. One top-level suggestion: could we setup a meeting with people interested in Calyx + CIRCT so that the Calyx team can update all of you on our current trajectory? @cgyurgyik mentioned that the 25th August CIRCT meeting is going to have some Calyx discussion already. Unfortunately, I'm currently in an inconvenient time zone (India) and cannot make the meeting. If all of you are interested, I'd be happy to coordinate a meeting. Next, a summary on some new things in Calyx that may address some question in this discussion:
Since I'll be asleep when the CIRCT meeting happens, I'm preemptively putting a when2meet link with my availability. I'll also add the link the CIRCT meeting notes for the 25th. If we end up not needing the meeting, feel free to ignore. |
Following the discussion today, I'll post on Discourse about how we might evolve the |
According to when2meet, |
a98f473
to
dc65f0d
Compare
68d75ca
to
4f06c10
Compare
4f06c10
to
bcc1c06
Compare
@rachitnigam If/when you have time, would you like to sanity check/play around with this pass? There will have to be some changes when invoke/comb groups are supported in the Calyx dialect, but I think we're closing in on a good initial version for the pass. The pass is currently tested with the tests in |
Happy to try it out in the next few days! However, things are likely to break again once calyxir/calyx#635 lands and makes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't review the pass itself in detail, but the integration into CIRCT looks good. Thank you for putting this in its own library!
be84269
to
c5e9be7
Compare
Ran into #1769 & #1770 when trying to push the code generated by this PR into the native Calyx compiler. @mortbopet Looks like we've finished most of the tasks in #1690 and the next reasonable step is getting code generated in this PR working with the native Calyx compiler. I think it'd be helpful if you can start running the native Calyx compiler on the generated code and telling us where things fail. Happy to help you setup the compiler. |
Will do! I have already got the native compiler setup locally, but I'll ping you if anything comes up. |
c77f64e
to
b55193b
Compare
// CHECK-NEXT: } | ||
// CHECK-NEXT: calyx.control { | ||
// CHECK-NEXT: calyx.seq { | ||
// CHECK-NEXT: calyx.enable @assign_while_0_init {compiledGroups = []} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the compiledGroups
attributed used for here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would indicate that the calyx printer for calyx.enable
per default prints the compiledGroups
attribute of calyx.enable
circt/include/circt/Dialect/Calyx/CalyxControl.td
Lines 127 to 133 in 072abc3
let arguments = (ins | |
FlatSymbolRefAttr:$groupName, | |
OptionalAttr<ArrayAttr>:$compiledGroups | |
); | |
let assemblyFormat = "$groupName attr-dict"; | |
let verifier = "return ::verify$cppClass(*this);"; | |
} |
When lowering the following SCF program, the generated Calyx program has unused groups: func @main(%a0 : i32, %a1 : i32) -> i32 {
%b = cmpi uge, %a0, %a1 : i32
cond_br %b, ^bb1(%a0: i32), ^bb2(%a1: i32)
^bb1(%aa: i32):
%b11 = subi %aa, %a1 : i32
%b12 = subi %aa, %a0 : i32
br ^bb3(%b11, %b12: i32, i32)
^bb2(%ab: i32):
%b21 = addi %ab, %a1 : i32
%b22 = addi %ab, %a0 : i32
br ^bb3(%b21, %b22: i32, i32)
^bb3(%r: i32, %r2: i32):
%r3 = addi %r, %r2 : i32
return %r3 : i32
} Such groups can be eliminated using Calyx's |
This should be due to #1802. The output of SCFToCalyx has all groups referenced in the schedule (see checks in |
Let me try to summarize what this pass accepts as input:
I think we agree this pass continues to grow and should be landed so we can do more incremental work. But I want to make sure I'm on the same page w.r.t. the goal. As the new pipeline representation emerges, and the scheduling infrastructure is applied, my hope is this pass can be greatly reduced in scope. For one, I'm still doubting the role of CDFG at this level. It seems like the frontends we are coming from are able to generate Affine/SCF control flow, which we can schedule and represent as a pipeline without ever getting to the level of basic blocks. Is there a client that needs CDFG support? If there is, I'd still expect such programs to be scheduled and converted to a staticlogic pipeline before this. If we don't want to statically schedule a CDFG, then we already have the Handshake path. Down the road, I'm imagining this pass need only accept the following:
Does this target line up? I want to make sure if I do all the work laid out here, we can have a more simplified lowering to Calyx. |
@mikeurbach with regards to the inputs, correct, those are what is currently accepted for the pass. In terms of how this pass fits into the big picture long term, then i hope to sometime within the next couple of months start looking at interactions between the handshake and calyx paths (mixing dynamically and statically scheduled HLS). I'm unsure about whether we want to completely eliminate the option of passing CDFGs to SCFToCalyx, but i do understand the need for separation of concerns. I'm thinking that there might be programs with setup code before i.e. a pipelined loop, which looks and acts like an FSM and the sensible thing would be to just pass it directly to Calyx (through SCFToCalyx), instead of trying to make a pipeline out of something which shouldn't be. |
9a628b8
to
8eb6d28
Compare
I agree, this is a valid concern. Another use-case: Loops with unbalanced control paths (e.g. simple logic on one side of branch, multi-cycle divider on the other) usually also benefit from a FSM-of-basic-blocks execution model, because the pipeline's latency is equal to the latency of the longest control path. The datapaths corresponding to each basic block could be statically scheduled before going into this pass, similar to the new |
I think there is also a simple, understated benefit of keeping a non-pipelined flow working, even in the presence of pipelined flow—there is always a source of truth w.r.t. program execution. Deciding when and what to pipeline feels like optimization and there should be another way to quickly get unoptimized but functional designs working. I add “quickly” because traditional SDC + SAT scheduling can get pretty slow with complex programs. |
Thanks for the ideas w.r.t. CDFGs and this pass. I can see how they have a place, but I still wonder if that belongs here in the long term, as they aren't really "SCF". Perhaps the pass just needs to be split up into ArithmeticToCalyx, CDFGToCalyx, SCFToCalyx, etc. I'm also wondering about the interactions with Handshake, because when I think about a CDFG that doesn't make sense as a pipeline, it seems Handshake can fill the need. So far we've been talking about Calyx because it can represent a static schedule nicely, but taking advantage of its ability to do latency insensitive things would be interesting too. Perhaps we need to add HandshakeToCalyx to the mix.
This is already supported in the Handshake workflow, which goes all the way to Verilog that simulates. As far as I'm concerned, we are focused on the optimized part now. What we're doing here had better be an improvement on the dynamically scheduled dataflow, (for the programs we can analyze and statically schedule). Anyway, I'm rambling. My real concern is this pass has become quite monolithic. I think this is a reasonable place to start, and we can break it up as we go. This is what happened with the StandardToLLVM pass upstream, for example. |
Ah, got it. I guess in that case an interesting but tangential question would be if it makes sense to lower handshake to Calyx since there are still some control/structural optimizations Calyx may be able to do that Handshake can't.
Agreed on both points. Not having the pass be monolithic in the long term would be great but hats off to @mortbopet on getting it working! |
I see a hierarchy of solutions here: Generally statically scheduled things (whether pipelined or not) can get lowered to an FSMD/Calyx style of representation, while the dynamically scheduled options can get lowered to Handshake. An important question is how we combine statically and dynamically scheduled regions, i.e., a Calyx Module inside an Handshake region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't looked through all of the code yet, and this is mostly nits and style/code changes.
This is really cool!! I think it is some great work for the first frontend to the Calyx dialect. However, the code size is making it difficult to review :-)
There has to be some way this PR can be split off, e.g.
- boiler plate
- lowering a stupid simple function with no basic blocks.
- lowering a function that requires registers.
- MemoryOp uses (and optionally splitting off between single / multiple read).
...// index casting, chaining combinationals, multiple returns, ...?
/// inside other utility functions. | ||
template < | ||
typename F, | ||
std::enable_if_t<!std::is_void<typename std::result_of<F()>::type>::value, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIt: Not entirely necessary, but may be useful to describe why this std::enable_if_t
is being used, e.g. // Expecting F to be a non-void function
. This isn't easy to read for C++ beginners (can't wait for C++20 concepts support).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also question if this is better than the more canonical approach of using IRRewriter::InsertionGuard
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on using IRRewriter::InsertionGuard
, seems like the intention of that is exactly what i intended with persistInsertionPoint
.
auto assignRegInOp = | ||
rewriter.create<calyx::AssignOp>(loc, reg.in(), v, Value()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use a builder that doesn't take in a value for the guard, if that's what you're trying to do here. Feel free to open an issue for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already an issue: #1611
Thank you for the reviews @cgyurgyik; What i'll do is to close this PR and come up with a PR series in line with what you suggested. |
becde79
to
cd895de
Compare
Closing this in favor of a PR series starting from #1812 |
SCF to Calyx conversion
This pass is intended to be part of a high-level synthesis flow for lowering SCF-based code into a statically scheduled FSMD model, through use of Calyx.
Preface: this is early work and as such is heavily subject to change; I'm mainly hoping for comments on the big-picture and general lowering flow. Various things such as the lowering of
comb
andreturn
operations are based on assumptions, and not necessarily the current gospel of the Calyx dialect. The hope for this PR is also to spur discussion regarding what is missing in the currect Calyx dialect to be able to write a complete front-end.Design
ComponentLoweringState/ProgramLoweringState
objects, which act as key-value stores. (see: 1)scf
+std
control flow operations and arithmetic operations.