Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SCFToCalyx] Draft SCFToCalyx pass #1630

Closed
wants to merge 1 commit into from

Conversation

mortbopet
Copy link
Contributor

@mortbopet mortbopet commented Aug 24, 2021

SCF to Calyx conversion

This pass is intended to be part of a high-level synthesis flow for lowering SCF-based code into a statically scheduled FSMD model, through use of Calyx.

Preface: this is early work and as such is heavily subject to change; I'm mainly hoping for comments on the big-picture and general lowering flow. Various things such as the lowering of comb and return operations are based on assumptions, and not necessarily the current gospel of the Calyx dialect. The hope for this PR is also to spur discussion regarding what is missing in the currect Calyx dialect to be able to write a complete front-end.

Design

  • The pass follows a partial lowering style, somewhat similar as to what is done in StandardToHandshake. Throughout lowering, state is maintained in the ComponentLoweringState/ProgramLoweringState objects, which act as key-value stores. (see: 1)
  • Basic block arguments are passed through registers (see: 1, 2).
  • Registers are for basic block arguments, while loop iter args and function return values.
  • The contents of basic blocks are executed sequentially in the control schedule.
  • The input program should be a combination of scf + std control flow operations and arithmetic operations.
  • Calyx uses if- and while constructs for specifying control flow in the schedule. To ensure that only while loops reaches the pass, I've proposed a pass for converting SCF for loops to while loops.
  • Control flow backedges through branches are not supported (given the above). Instead, backedges are expected to be raised to while-loops before running the conversion pass.
  • Emitting the control schedule is done in two steps:
    1. All possible control paths are traversed, and written in the schedule (simplifies writing the schedule). (see 1)
    2. A set of simplification patterns are then run to reduce the size of the schedule (see 1)
  • index types used in conjunction with memory accesses are truncated to integers of equal width as the address port which they target. Index types used in arithmetic, which are not trivially linked to some memref, are assumed to be 32-bits. Proper bitwidth inference should be handled in a future pass.

@mortbopet mortbopet changed the title [SCFToCalyx] Initial commit for SCFToCalyx pass [SCFToCalyx] Draft SCFToCalyx pass Aug 24, 2021
@mortbopet mortbopet added Calyx The Calyx dialect enhancement New feature or request labels Aug 24, 2021
@mikeurbach
Copy link
Contributor

Thanks for sharing the draft @mortbopet. I have taken a quick look, and have some high level discussion points below. I'll checkout this branch and experiment with it myself.

Lowering standard operation primitives

I've been thinking about this problem as well. I think the approach you have to wrap each Standard operation in a component that uses Comb operations makes sense.

The alternative to define primitive operations in the Calyx dialect could also be a useful intermediate step, but wouldn't these ultimately lower into Comb operations later on (presumably in the CalyxToHW pass)? At some point, it would be great to lower to Comb, so that we can take advantage of all the optimizations in that dialect. Given that, I think what you have done so far to lower Standard operations to components using Comb operations is a good start.

Taking a step back, is the wrapping with components necessary? I was thinking a std.addi could lower directly to a comb.add placed directly inside the calyx.wires section. Could that work?

Let me propose the approach I've been thinking about, which could be used by this pass as well as others. This is about lowering Standard operations in general during HLS flows. I'm imagining something similar to how the Linalg structured op interface works. Specifically w.r.t. property 5. We could have a set of patterns that lower Standard/Math operations into their corresponding Comb operations (e.g. std.addi -> comb.add). And we could have a mechanism to register patterns that lower known operations to "library calls", which are implemented as external modules in hardware (e.g. std.addf -> Xilinx floating point IP instantiation). If a library call pattern matches, we could lower to the "library call", or else fall back to lowering to generic Comb logic when possible. These patterns could be used by this pass, but they could also be used in the Handshake lowering path (which currently lowers Standard operations to FIRRTL operations, but I'm hoping to slowly evolve off FIRRTL). We could also have a standalone StandardToComb pass using these patterns. Would it make sense to pull out the Standard to Comb/library lowering, and plug that into this as a library?

Representing loop kernels

You touched on having some sort of pipeline_for/while operation to represent this. I've been poking around with what that might look like. I was going to write a Discourse post about this today, but I can just share here.

One option I considered along the lines of pipeline_for would be to have a nest of scf.for loops wrapping a staticlogic.pipeline. Something like this:

    %c1 = constant 1 : index
    %c0 = constant 0 : index
    %c64 = constant 64 : index
    scf.for %arg3 = %c0 to %c64 step %c1 {
      scf.for %arg4 = %c0 to %c64 step %c1 {
        scf.for %arg5 = %c0 to %c64 step %c1 {
          "staticlogic.pipeline"(%arg3, %arg4, %arg5) ( {
          ^bb0(%arg6: index, %arg7: index, %arg8: index):  // no predecessors
            %0 = memref.load %arg0[%arg6, %arg8] : memref<?x64xf64>
            %1 = memref.load %arg1[%arg8, %arg7] : memref<?x64xf64>
            br ^bb1
          ^bb1:  // pred: ^bb0
            %2 = mulf %0, %1 : f64
            %3 = memref.load %arg2[%arg6, %arg7] : memref<?x64xf64>
            br ^bb2
          ^bb2:  // pred: ^bb1
            %4 = addf %3, %2 : f64
            br ^bb3
          ^bb3:  // pred: ^bb2
            memref.store %4, %arg2[%arg6, %arg7] : memref<?x64xf64>
            "staticlogic.return"() : () -> ()
          }) : (index, index, index) -> ()
        }
      }
    }

In my mind, the loop nest represents a set of address counters and an FSM controlling their increments, which are fed into the pipeline.

The staticlogic.pipeline is designed to be IsolatedFromAbove, and have the inputs to the pipeline passed as arguments to the first block. Another approach is to use scf.execute_region, which is sort of taking the opposite: it is not IsolatedFromAbove, and instead relies on SSA dominance to make the inputs available. This composes well with both the SCF dialect and the Affine dialect. This approach was hinted at in this comment: https://llvm.discourse.group/t/better-modelling-of-pipeline-loops/3917/6. That might look something like this:

    %c1 = constant 1 : index
    %c0 = constant 0 : index
    %c64 = constant 64 : index
    scf.for %arg3 = %c0 to %c64 step %c1 {
      scf.for %arg4 = %c0 to %c64 step %c1 {
        scf.for %arg5 = %c0 to %c64 step %c1 {
          scf.execute_region -> {
          ^bb0:  // no predecessors
            %0 = memref.load %arg0[%arg3, %arg5] : memref<?x64xf64>
            %1 = memref.load %arg1[%arg5, %arg4] : memref<?x64xf64>
            br ^bb1
          ^bb1:  // pred: ^bb0
            %2 = mulf %0, %1 : f64
            %3 = memref.load %arg2[%arg3, %arg4] : memref<?x64xf64>
            br ^bb2
          ^bb2:  // pred: ^bb1
            %4 = addf %3, %2 : f64
            br ^bb3
          ^bb3:  // pred: ^bb2
            memref.store %4, %arg2[%arg3, %arg4] : memref<?x64xf64>
            "staticlogic.return"() : () -> ()
          }
        }
      }
    }

Is there a benefit to staticlogic.pipeline versus scf.execute_region? To me they are quite similar, so I'm wondering if the IsolatedFromAbove version makes any transformation/analysis easier.

Your design point mentions basic block arguments should be used to infer registers. The above examples don't explicitly use block arguments, but they could. Another approach might be to analyze the live-ins for the blocks and infer registers appropriately. Perhaps there are other approaches. Curious what will make it easiest for this pass. At the moment, I'm leaning towards "block arguments represent registers", which could let frontends move Values that should be registered into block arguments, and any Values passed into a block through SSA dominance could be combinational in the hardware.

Finally, these examples represent the loops using scf.for, but I think your approach to have all scf.fors turned into scf.whiles as a pre-processing step makes sense. Really, I'm curious about what the body of the innermost loop nest should look like.

Scheduling infrastructure

I've been playing around with the Scheduling infrastructure in CIRCT. I imagine we would want to apply that earlier in the pipeline, before this pass, to determine the schedule at a high level. This seems to align with your design around basic blocks: we could use the Scheduling infrastructure to determine pipeline stages, and create a block for each stage. That's actually how the above examples were generated. Does this align with your thinking?

Pass infrastructure

I've been thinking about how to structure the phased application of patterns in passes like these. Have you had a look at the Linalg CodegenStrategy, which is built on applyStagedPatterns?

It seems like we could build our own HLS "CodegenStrategy", using applyStagedPatterns. To summarize, it uses three stages:

The first stage has a list of RewritePatternSets, and applies each set of patterns in the list in order. This could manage the places you have sequential calls to runPartialPatternLowering.

The second stage applies a single RewritePatternSet, which is responsible for cleanups/canonicalizations after the first stage. This could handle the CleanupFuncOps and runControlFlowSimplifications step. It could also include the Comb canonicalization patterns, if that makes sense.

The final stage applies a callback function to do any last minute processing that might not have fit into the earlier stages. I'm not sure there is a need for this.

Anyway, curious if you think it is worth exploring. I think the CodegenStrategy has been a success, so perhaps following a similar structure in this pass would make sense.

Memories

I have a local patch to add a MemoryOp primitive to Calyx, similar to the RegisterOp. I can share a PR for that.

Cell interface

It's mentioned as a TODO, and something I was planning to work on after the MemoryOp is added. I can implement a CellOpInterface, and make all the CalyxPrimitive operations (RegisterOp/MemoryOp) implement it. I think that would be useful for this as well.

@mortbopet
Copy link
Contributor Author

mortbopet commented Aug 24, 2021

@mikeurbach Thank you for the elaborate reply, it's great to get a discussion started on this.

Lowering standard operation primitives

The main motivator for doing it this way (inside a component) is to be able to share component instances. This is not so much a concern for this pass in its current state (since a new instance of an operator is instantiated for each operator in the circuit). However, if a significant Calyx infrastructure is built within CIRCT, there would be a need for the ability to share functional units during binding, and to have this expressible in the IR.
Benefitting from comb optimizations i would imagine would happen after a CalyxToHW pass, where the (now bound) set of functional units would have been inlined into some design.

On lowering std to comb, i think both of your proposals are great - a standalone pass would make sense seeing that partial lowering of std to comb arithmetic (excluding, for instance, float ops) is already implemented in StandardToHandshake, but in reality has nothing to do with the handshake dialect in and of itself; there is an almost 1:1 correspondence between comb and a subset of std, so a standalone pass would in my view have merit.

Next, as you mention, there are arithmetic operations which cannot be lowered, but might map to hard logic units on a target architecture. This could probably tie in with the discussion on target triples, where information on hard IP available on the target is tabulated. This, however, I think should be distinct from a StdToComb pass, to separate concerns between lowering "simple" arithmetic to "simple" combinational logic, and the more complex process of binding operators to a limited set of resources on a target device.

Representing loop kernels

I must admit that i am not a big fan of staticlogic.pipelines approach, mainly for the reason that it hijacks the use of blocks to be something that they are not. The most critical thing about this is that (as i understand) we remove the option of having control flow within a pipelined loop kernel. A control flow boundary should not imply a pipeline stage boundary - we might have plenty of combinational control flow (using muxes) within a pipeline stage. If pipeline stages are distinct (ordered) regions (like the scf.pipeline_for proposal) This then makes registers be inferred when there are cross-stage dependencies, and not cross basic-block dependencies (Again, this contradicts the approach implemented in this PR which has mainly been done like so given its simplicity).

Scheduling infrastructure

Yes, i see scheduling into pipeline stages as coming before SCFToCalyx. In essence, i think the SCFToCalyx pass should be kept as "dumb" as possible and not try to include too many circuit transformations. Such transformations should be kept to a (to be) HLS area within circuit, containing i.e., transformation of scf.for to scf.pipeline_for.
Lowering to scf.while is simple and doing this lowering mindful of i.e., a scf.pipeline_for body should not complicate things too much, seeing as the while loop only adds a block pre- and post loop body.

Pass infrastructure

These are some great pointers; the implementation in this PR was mainly guided by what i found when working on fixing InstanceOp lowering in StandardToHandshake, but if partial lowering is already a solved problem, then we should use that method. I'll look into it and come back with a more informed comment.

Memories

Cool! does it follow the memories available in the native calyx compiler?

Cell interface

Great! There is a lot of places where this interface is needed, and it would clean up things alot.

/// ... -> }
/// }
/// }
struct NestedSeqPattern : mlir::OpRewritePattern<calyx::SeqOp> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem like it strictly needs to be part of the conversion. Did you consider making this a separate pass?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just passing through and saw this; the native compiler indeed has a pass for collapsing control.

signalPassFailure();
return;
}
if (failed(runPartialPatternLowering<InlineExecuteRegionOpPattern>())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this could be made more compact by using logical operations on the results from runPartialPatternLowering:
if(successful(status)) status |= runPartialPatternLowering(...)

Or pass the status by reference into runPartialPatternLowering?

PartiallyLowerFuncToComp(mlir::FuncOp funcOp,
PatternRewriter &rewriter) const override {
funcOp.walk([&](mlir::Operation *op) {
if (op->getDialect()->getNamespace() == "comb") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like you'd want an interface or a trait here?

@stephenneuendorffer
Copy link
Contributor

I must admit that i am not a big fan of staticlogic.pipelines approach, mainly for the reason that it hijacks the use of blocks to be something that they are not. The most critical thing about this is that (as i understand) we remove the option of having control flow within a pipelined loop kernel. A control flow boundary should not imply a pipeline stage boundary

FWIW, the intention of staticlogic.pipeline is that control flow within the loop would already have been reduced to a pure pipeline. This is what allows the use of blocks to represent something that is semantically distinct from control flow.

@mortbopet
Copy link
Contributor Author

I must admit that i am not a big fan of staticlogic.pipelines approach, mainly for the reason that it hijacks the use of blocks to be something that they are not. The most critical thing about this is that (as i understand) we remove the option of having control flow within a pipelined loop kernel. A control flow boundary should not imply a pipeline stage boundary

FWIW, the intention of staticlogic.pipeline is that control flow within the loop would already have been reduced to a pure pipeline. This is what allows the use of blocks to represent something that is semantically distinct from control flow.

I see, as in, the design is expected to be fully lowered to comb w/ control flow as muxes? It makes sense, however, makes me think about what level to lower to before targeting the Calyx flow.

@mikeurbach
Copy link
Contributor

Thanks for the point-by-point reply @mortbopet. It sounds like we are more or less on the same page. To me, the most interesting part to discuss is the scf.pipeline_for level of abstraction we should be targetting from the higher levels. Looking forward to the ODM tomorrow.

@cgyurgyik
Copy link
Member

cgyurgyik commented Aug 25, 2021

A better representation might be to add a new operand which represent a primitive cell instance:
%left, %right, %out = calyx.prim sub : i32, i32, i32

+1 for this!

The main motivator for doing it this way (inside a component) is to be able to share component instances. This is not so much a concern for this pass in its current state (since a new instance of an operator is instantiated for each operator in the circuit). However, if a significant Calyx infrastructure is built within CIRCT, there would be a need for the ability to share functional units during binding, and to have this expressible in the IR.

I'm not sure I'm following. You're saying we can't represent instance sharing of ops if we have the above representation? I'm not quite sure I understand why this is the case. To see if a group is using some primitive, we just need to see if any port of the given primitive is found within the group, right?

Let me propose the approach I've been thinking about, which could be used by this pass as well as others. This is about lowering Standard operations in general during HLS flows. I'm imagining something similar to how the Linalg structured op interface works. Specifically w.r.t. property 5. We could have a set of patterns that lower Standard/Math operations into their corresponding Comb operations (e.g. std.addi -> comb.add). And we could have a mechanism to register patterns that lower known operations to "library calls", which are implemented as external modules in hardware (e.g. std.addf -> Xilinx floating point IP instantiation).

This is cool.

@mortbopet
Copy link
Contributor Author

mortbopet commented Aug 25, 2021

I'm not sure I'm following. You're saying we can't represent instance sharing of ops if we have the above representation? I'm not quite sure I understand why this is the case. To see if a group is using some primitive, we just need to see if any port of the given primitive is found within the group, right?

I should have been more clear; i was arguing for why i opted for using component wrapped comb ops in this draft PR instead of what is currently done in the Calyx mlir tests, i.e. using comb ops inside groups:

// CHECK: calyx.group_done %c2.done, %0 ? : i1
%guard = comb.and %c1_i1, %c2.out : i1
calyx.group_done %c2.done, %guard ? : i1

The representation used in the PR (wrapping comb ops in a component) was used as a placeholder, to be able to have the in/out port SSA values of the comb operator dominate anything in the calyx.wires/calyx.control ops (to allow for sharing).

The %left, %right, %out = calyx.prim sub : i32, i32, i32 style was then proposed as a way to these wrapped components.

@jopperm
Copy link
Contributor

jopperm commented Aug 25, 2021

Scheduling infrastructure

Yes, i see scheduling into pipeline stages as coming before SCFToCalyx.

I'm wondering if that's too early. Do you (@mortbopet, @mikeurbach) envision transformations of the pipeline, once it's formed?

My understanding of Calyx (mostly from their ASPLOS paper, no practical experience yet) is that one of its USPs is being a latency-insensitive target IR that takes the burden of emitting the FSMs off the frontend. Statically scheduling components then is an optimisation to simplify the FSM and omit the handshaking signals.

@rachitnigam, you said earlier that you are thinking about adding a data flow operator to the language -- is that still on the table?

If yes, then mabye SCFToCalyx could just map the SCF constructs to the ones in Calyx, and wrap their bodies in such a DF operator, which is scheduled "late" (i.e., really close to construction of the controller FSM) by circt::scheduling?

@rachitnigam
Copy link
Contributor

@mortbopet Thanks for the awesome writeup! I'm super excited to see this work! @mikeurbach @stephenneuendorffer thanks for the comments and context around the other passes in CIRCT. One top-level suggestion: could we setup a meeting with people interested in Calyx + CIRCT so that the Calyx team can update all of you on our current trajectory? @cgyurgyik mentioned that the 25th August CIRCT meeting is going to have some Calyx discussion already. Unfortunately, I'm currently in an inconvenient time zone (India) and cannot make the meeting. If all of you are interested, I'd be happy to coordinate a meeting.

Next, a summary on some new things in Calyx that may address some question in this discussion:

  1. The invoke operator: We've added a new operator to Calyx called invoke which corresponds to function calls. It is used by all our frontends (Dahlia, TCAM, Systolic, TVM) and should probably be a part of the Calyx CIRCT. This test shows how invoke can be used to run a component.
  2. Combinational Groups: We've come to the realization that groups with a constant done condition (usually a 1'd1) are really just trying to encode a combinational circuit and that these groups need to be treated fundamentally differently from normal groups. For example, enabling a combinational group from a seq is meaningless because it cannot affect any state in the program. To this end, we want to make combinational groups and components a first class concept in Calyx. The first step towards this has been a new pass (implemented in this PR) that transforms all combinational groups to "real groups". I think it would be cool/useful for the Calyx CIRCT dialect to start off with this concept instead of waiting for Calyx main to implement it first.
  3. Pipelines in Calyx: We have used the linked discussion thread a little bit to outline pipelines in Calyx. I initially thought that we need a new first-class operator to implement pipelines in Calyx but have recently figured out a way to encode pipelines using the existing control language. I think this might be a good place to start our discussion on pipelines Calyx and how they relate to staticlogic.pipeline and other representations.

Since I'll be asleep when the CIRCT meeting happens, I'm preemptively putting a when2meet link with my availability. I'll also add the link the CIRCT meeting notes for the 25th. If we end up not needing the meeting, feel free to ignore.

CC @sampsyo @EclecticGriffin @sgpthomas @cgyurgyik

@mikeurbach
Copy link
Contributor

Following the discussion today, I'll post on Discourse about how we might evolve the staticlogic.pipeline into the most suitable representation for this pass to consume.

@rachitnigam
Copy link
Contributor

rachitnigam commented Aug 26, 2021

According to when2meet, 9.30am-10.30am EST 10am-11am EST seems to work best. Made a discourse post about the meeting: https://llvm.discourse.group/t/calyx-circt-meeting/4174

@mortbopet
Copy link
Contributor Author

@rachitnigam If/when you have time, would you like to sanity check/play around with this pass? There will have to be some changes when invoke/comb groups are supported in the Calyx dialect, but I think we're closing in on a good initial version for the pass.

The pass is currently tested with the tests in test/Conversion/SCFToCalyx, so those could be entry your entry points.

@rachitnigam
Copy link
Contributor

Happy to try it out in the next few days! However, things are likely to break again once calyxir/calyx#635 lands and makes comb group syntax required for if <port> with <comb-group> and while <port> with <comb-group> syntax. I'm going to make an issue in CIRCT repo to track required implementation changes.

Copy link
Collaborator

@lattner lattner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't review the pass itself in detail, but the integration into CIRCT looks good. Thank you for putting this in its own library!

@rachitnigam
Copy link
Contributor

Ran into #1769 & #1770 when trying to push the code generated by this PR into the native Calyx compiler.

@mortbopet Looks like we've finished most of the tasks in #1690 and the next reasonable step is getting code generated in this PR working with the native Calyx compiler. I think it'd be helpful if you can start running the native Calyx compiler on the generated code and telling us where things fail. Happy to help you setup the compiler.

@mortbopet
Copy link
Contributor Author

@mortbopet Looks like we've finished most of the tasks in #1690 and the next reasonable step is getting code generated in this PR working with the native Calyx compiler. I think it'd be helpful if you can start running the native Calyx compiler on the generated code and telling us where things fail. Happy to help you setup the compiler.

Will do! I have already got the native compiler setup locally, but I'll ping you if anything comes up.

// CHECK-NEXT: }
// CHECK-NEXT: calyx.control {
// CHECK-NEXT: calyx.seq {
// CHECK-NEXT: calyx.enable @assign_while_0_init {compiledGroups = []}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the compiledGroups attributed used for here?

Copy link
Contributor Author

@mortbopet mortbopet Sep 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would indicate that the calyx printer for calyx.enable per default prints the compiledGroups attribute of calyx.enable

let arguments = (ins
FlatSymbolRefAttr:$groupName,
OptionalAttr<ArrayAttr>:$compiledGroups
);
let assemblyFormat = "$groupName attr-dict";
let verifier = "return ::verify$cppClass(*this);";
}

@rachitnigam
Copy link
Contributor

When lowering the following SCF program, the generated Calyx program has unused groups:

func @main(%a0 : i32, %a1 : i32) -> i32 {
  %b = cmpi uge, %a0, %a1 : i32
  cond_br %b, ^bb1(%a0: i32), ^bb2(%a1: i32)
^bb1(%aa: i32):
  %b11 = subi %aa, %a1 : i32
  %b12 = subi %aa, %a0 : i32
  br ^bb3(%b11, %b12: i32, i32)
^bb2(%ab: i32):
  %b21 = addi %ab, %a1 : i32
  %b22 = addi %ab, %a0 : i32
  br ^bb3(%b21, %b22: i32, i32)
^bb3(%r: i32, %r2: i32):
  %r3 = addi %r, %r2 : i32
  return %r3 : i32
}

Such groups can be eliminated using Calyx's -p dead-group-removal pass but usually indicate that the frontend forgot to use a group. If SCF is expected to generate some dead groups then this is not a problem.

@mortbopet
Copy link
Contributor Author

When lowering the following SCF program, the generated Calyx program has unused groups:

This should be due to #1802. The output of SCFToCalyx has all groups referenced in the schedule (see checks in convert_controlflow.mlir).

@mikeurbach
Copy link
Contributor

mikeurbach commented Sep 16, 2021

Let me try to summarize what this pass accepts as input:

  • Standard operations that have primitive counterparts in hardware
  • MemRef operations
  • CDFG with basic blocks and branches (i.e. un-structured control flow)
  • Structured control flow with execute region and while

I think we agree this pass continues to grow and should be landed so we can do more incremental work. But I want to make sure I'm on the same page w.r.t. the goal.

As the new pipeline representation emerges, and the scheduling infrastructure is applied, my hope is this pass can be greatly reduced in scope.

For one, I'm still doubting the role of CDFG at this level. It seems like the frontends we are coming from are able to generate Affine/SCF control flow, which we can schedule and represent as a pipeline without ever getting to the level of basic blocks. Is there a client that needs CDFG support? If there is, I'd still expect such programs to be scheduled and converted to a staticlogic pipeline before this. If we don't want to statically schedule a CDFG, then we already have the Handshake path.

Down the road, I'm imagining this pass need only accept the following:

  • Standard operations and MemRef operations, which can be lowered according to a component library
  • StaticLogic pipeline loops that have been scheduled

Does this target line up? I want to make sure if I do all the work laid out here, we can have a more simplified lowering to Calyx.

@mortbopet
Copy link
Contributor Author

@mikeurbach with regards to the inputs, correct, those are what is currently accepted for the pass.

In terms of how this pass fits into the big picture long term, then i hope to sometime within the next couple of months start looking at interactions between the handshake and calyx paths (mixing dynamically and statically scheduled HLS). I'm unsure about whether we want to completely eliminate the option of passing CDFGs to SCFToCalyx, but i do understand the need for separation of concerns. I'm thinking that there might be programs with setup code before i.e. a pipelined loop, which looks and acts like an FSM and the sensible thing would be to just pass it directly to Calyx (through SCFToCalyx), instead of trying to make a pipeline out of something which shouldn't be.

@jopperm
Copy link
Contributor

jopperm commented Sep 17, 2021

I'm thinking that there might be programs with setup code before i.e. a pipelined loop, which looks and acts like an FSM and the sensible thing would be to just pass it directly to Calyx (through SCFToCalyx), instead of trying to make a pipeline out of something which shouldn't be.

I agree, this is a valid concern. Another use-case: Loops with unbalanced control paths (e.g. simple logic on one side of branch, multi-cycle divider on the other) usually also benefit from a FSM-of-basic-blocks execution model, because the pipeline's latency is equal to the latency of the longest control path. The datapaths corresponding to each basic block could be statically scheduled before going into this pass, similar to the new staticlogic.pipeline.while op. (Though I'm not sure whether that would gain much over just passing the code to Calyx, other than consistency.)

@rachitnigam
Copy link
Contributor

I think there is also a simple, understated benefit of keeping a non-pipelined flow working, even in the presence of pipelined flow—there is always a source of truth w.r.t. program execution.

Deciding when and what to pipeline feels like optimization and there should be another way to quickly get unoptimized but functional designs working. I add “quickly” because traditional SDC + SAT scheduling can get pretty slow with complex programs.

@mikeurbach
Copy link
Contributor

Thanks for the ideas w.r.t. CDFGs and this pass. I can see how they have a place, but I still wonder if that belongs here in the long term, as they aren't really "SCF". Perhaps the pass just needs to be split up into ArithmeticToCalyx, CDFGToCalyx, SCFToCalyx, etc.

I'm also wondering about the interactions with Handshake, because when I think about a CDFG that doesn't make sense as a pipeline, it seems Handshake can fill the need. So far we've been talking about Calyx because it can represent a static schedule nicely, but taking advantage of its ability to do latency insensitive things would be interesting too. Perhaps we need to add HandshakeToCalyx to the mix.

there should be another way to quickly get unoptimized but functional designs working

This is already supported in the Handshake workflow, which goes all the way to Verilog that simulates. As far as I'm concerned, we are focused on the optimized part now. What we're doing here had better be an improvement on the dynamically scheduled dataflow, (for the programs we can analyze and statically schedule).

Anyway, I'm rambling. My real concern is this pass has become quite monolithic. I think this is a reasonable place to start, and we can break it up as we go. This is what happened with the StandardToLLVM pass upstream, for example.

@rachitnigam
Copy link
Contributor

This is already supported in the Handshake workflow, which goes all the way to Verilog that simulates.

Ah, got it. I guess in that case an interesting but tangential question would be if it makes sense to lower handshake to Calyx since there are still some control/structural optimizations Calyx may be able to do that Handshake can't.

Anyway, I'm rambling. My real concern is this pass has become quite monolithic

Agreed on both points. Not having the pass be monolithic in the long term would be great but hats off to @mortbopet on getting it working!

@stephenneuendorffer
Copy link
Contributor

I'm also wondering about the interactions with Handshake, because when I think about a CDFG that doesn't make sense as a pipeline, it seems Handshake can fill the need. So far we've been talking about Calyx because it can represent a static schedule nicely, but taking advantage of its ability to do latency insensitive things would be interesting too. Perhaps we need to add HandshakeToCalyx to the mix.

I see a hierarchy of solutions here:
Statically scheduled and pipelined
Statically scheduled and not pipelined
Dynamically scheduled and pipelined
Dynamically scheduled and not pipelined

Generally statically scheduled things (whether pipelined or not) can get lowered to an FSMD/Calyx style of representation, while the dynamically scheduled options can get lowered to Handshake. An important question is how we combine statically and dynamically scheduled regions, i.e., a Calyx Module inside an Handshake region.

Copy link
Member

@cgyurgyik cgyurgyik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked through all of the code yet, and this is mostly nits and style/code changes.

This is really cool!! I think it is some great work for the first frontend to the Calyx dialect. However, the code size is making it difficult to review :-)

There has to be some way this PR can be split off, e.g.

  1. boiler plate
  2. lowering a stupid simple function with no basic blocks.
  3. lowering a function that requires registers.
  4. MemoryOp uses (and optionally splitting off between single / multiple read).
    ... // index casting, chaining combinationals, multiple returns, ...?

/// inside other utility functions.
template <
typename F,
std::enable_if_t<!std::is_void<typename std::result_of<F()>::type>::value,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIt: Not entirely necessary, but may be useful to describe why this std::enable_if_t is being used, e.g. // Expecting F to be a non-void function. This isn't easy to read for C++ beginners (can't wait for C++20 concepts support).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also question if this is better than the more canonical approach of using IRRewriter::InsertionGuard.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on using IRRewriter::InsertionGuard, seems like the intention of that is exactly what i intended with persistInsertionPoint.

Comment on lines 493 to 494
auto assignRegInOp =
rewriter.create<calyx::AssignOp>(loc, reg.in(), v, Value());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use a builder that doesn't take in a value for the guard, if that's what you're trying to do here. Feel free to open an issue for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is already an issue: #1611

@mortbopet
Copy link
Contributor Author

Thank you for the reviews @cgyurgyik; What i'll do is to close this PR and come up with a PR series in line with what you suggested.

@mortbopet
Copy link
Contributor Author

Closing this in favor of a PR series starting from #1812

@mortbopet mortbopet closed this Sep 20, 2021
@mortbopet mortbopet deleted the calyx/scftocalyx branch September 24, 2021 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Calyx The Calyx dialect enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants