-
Notifications
You must be signed in to change notification settings - Fork 310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Arc] Automatic module partition #7650
Conversation
Hey @SpriteOvO, thanks for working on a very challenging and interesting corner of the optimization space 🥳! One of the challenges I see, and you also mentioned, is that different users may have wildly different partitioning requirements. Arcilator probably wants to break the actual simulation workload into parallel tasks, while FPGA partitioning may want to find isolated clusters that have relaxed communication. It's very challenging to build a generic pass that performs well on all needs. You mentioned that your main goals was to find parallelization opportunities for Arcilator. That is awesome 😍 🥳! Arcilator could really benefit from some coarse-grained multi-threading. A lot of the analysis and bookkeeping you have to do comes from the presence of modules and the potential combinational paths that pass through module parts. Have you considered doing your partitioning directly in the Arc dialect? (You can always generalize later if feasible.) After conversion to Arc, the circuit becomes a graph of the necessary computations to advance the design by one cycle. At this stage, state and combinational paths are clearly visible, and there is no module hierarchy or hidden combinational path anymore. This would allow you to focus entirely on the partitioning itself: finding independent regions of computation, maybe grouping them into a new op (maybe some |
Drop in comment: I haven't looked over this PR, but when I had a partitioning pass in the MSFT dialect (it would pull things out deep in the module hierarchy and do cross-module movements but was obnoxiously complex and wasn't actively used), I had an attribute to specify the partition. I required the designer to specify the partition, but I could easily see different passes doing some target-specific heuristics to label the ops then running the design partition pass. |
Thanks @SpriteOvO and @CircuitCoder for doing this job! That's a really important job, T1 is counting on this PR to remove the verilator emulation backend.
We should forbid inout, in the partition boundary, the only usage for inout is for Chip/FPGA IO, which will be emulated via a tri-state IO.
Unlike the overhead of Verilog semantics, it is event driven, thus verilator use the
I personally would see, just pre-analyse all combinational path, and forbid cutting from them, cutting the combinational path delivers a bad performance on simulation and high complexity in the algorithm. |
I apologize for the super delayed response, I fell ill for the last couple of weeks. Partitioning against ARC seems to have a lot of merits. After discussing with @SpriteOvO and @sequencer, we decided to try to migrate this algorithm to the ARC dialect, which @SpriteOvO is already working on, and should result in a PoC after this weekend. My personal reasoning is that: If I understand correctly, partitioning during the ARC pass mainly involves partitioning against states, or more precisely, the new value to be written into states at the end of each cycle. The entire computation subtree would be partitioned into one half, and some operations would be copied into multiple halves. Our original motivation for doing partition with HW dialect is that: (a) We can partition against comb operations, which can be simpler (the graph representation is more intuitive to construct) and reflects well to hardware utilization if the result is synthesized onto FPGA; (b) Can use module boundary (which contains some semantical information) to guide partitioning. It turns out, however, that we overlooked some cases when simple unique partition against operations would not work (combinatory path has to pass through boundary more than one time [2]), thus breaking the correctness of lockstep simulation. So even if partitioning is done against operations, some operations would have to be duplicated. This is actually more difficult than first partitioning against states, and then doing some sort of common subexpression extraction (see [1]). My personal concerns with partitioning with states are:
After finishing the PoC, we intend to use rocket-chip w/ default configuration as a test. [1] We currently implement a simple sub-expression extraction algorithm: [2] Essentially it boils down to this pattern:
If A and B are split apart, then no matter where we place C, there would be a comb path that traverses the boundary at least twice. |
@SpriteOvO has just pushed an update, containing an initial PoC implementation of Arc-based state-directed partitioning. In brief, the partitioning operates on Hence, a multithreaded driver should call the model in the following way: For each thread,
Right now the partitioning happens at the end of the hw -> arc pipeline, contained in two passes: the first one (PartitionPass) is placed before state allocation. This pass marks state writes with assigned chunks and creates shadow states. The second one (PartitionClonePass) is placed after state allocation but before clock -> func. This pass splits the original Changes to other passes introduced in this commit
Implementation detailThe Partition planning involves finding out the dependencies between states, and the computation workload required for updating each state. Based on this information, a partition plan is generated, and each Shadow states are created for each HW state, and are allocated in the chunks' temporary storage. Inside an Between these two passes, states are allocated, and now the size for each storage is fixed. The
Unfinished worksThere are some unfinished (but mostly isolated) works to be done. This list will be updated to reflect the implementation. Most notably, the partitioning planning algorithm is missing.
QuestionsWe have a few questions regarding the inner workings of the other parts of arcilator, which may produce a better implementation for the partitioning procedure:
|
This is some really cool work that you are doing 🥳! Very exciting 🙌 One thing I'm wondering is how this interacts with the LowerState improvements and bugfixes landing as part of #7703. That PR gets rid of the LegalizeStateUpdate pass (which I think you can't use for your data race hazards anyway), and it also gets rid of Do you think we could somehow create an Another thing we need to figure out is how to deal with memory, because that is going to be a major source of headache. I like your idea of aggregating memory writes into a list of changes to be applied later, and then flushing those into simulation memory in the sync phase. That would be very neat! Your implementation currently requires quite a few attributes on operations. These attributes are discardable however, so everything in theory should still work even when you delete these attributes. Is there a way how we could encode the partitioning using something like task ops, where you don't need to put attributes on ops and we could have a single lowering pass that takes all Having a representation of tasks in the IR would also allow you to group together states and memories that are only used inside a thread. The AllocateState pass already does this for you if all state allocation is nested inside a block, and you should be able to trigger this manually using an explicit
should in theory allocate all state in the task into one contiguous block, which is then allocated in the global storage (which should be fine because it's just one region of memory that only this thread touches):
|
Thanks for the insightful reply, @fabianschuiki! 😻 I briefly looked into #7703, and it seems that it can simplify the implementation a lot. All attributes introduced in this PR can actually be dropped if rebased upon #7703 and using the I haven't come up with a good solution to this problem. One way would be to allow @SpriteOvO and I will try to rebase this PR onto #7703, and see if the idea works. Task dependencies
Assuming that we place writes to the same state into the same task, then I think we might be able to infer a finer dependency relation between tasks. If all operations are placed into tasks, then one task needs to be scheduled after another if and only if it reads a state written by another task prior in program order. We don't even need extra structures (operands, attributes) in the IR. Let's use a fancy notation to represent the transitive closure of this relation: This relation can also give us a simple procedure to merge tasks. A and B can be merged if and only if there does not exist C, such that
Therefore we might be able to get an alternative bottom-up partitioning scheme by combining atomic Nevertheless, for the top-down partitioning method, this kind of dependency can still be used to reflect the semantics of State groupingAfter some thought, It seems to me that The downside is that these storages would all be allocated flat, and the user of the model cannot control the allocation. It may be desirable for dynamically scheduled simulations to reuse unused storage spaces between tasks for better cache locality. |
be0a703
to
94ea712
Compare
The rebase is done! In the last force pushed commit, the following modifications are made:
Currently, Ideally, we would want We also found that states depended by control flow structures cannot be written-back within ordinary Now, a single step in statically-scheduled multithread lockstep simulation should look like this:
The added test case is a small GCD calculator. We'll then try to implement supports for memories and test on rocket. |
Really cool stuff! Let me go through this in more detail. The tasks look very interesting 😍. We might be able to split this up into a few smaller PRs such that we can flesh out the different pieces step by step. |
|
||
// The allocation should run in (reversed?) topo-sorted order for preexisting | ||
// substorages to work correctly. | ||
|
||
DenseSet<Value> allocated; | ||
std::function<void(Value)> visit; | ||
visit = [&](Value parent) { | ||
if(allocated.contains(parent)) return; | ||
for(auto childAlloc : opsByStorage[parent]) | ||
if(auto childStorageAlloc = dyn_cast<AllocStorageOp>(childAlloc)) { | ||
auto child = childStorageAlloc.getResult(); | ||
if(opsByStorage.contains(child)) // Used by allocations in this block | ||
visit(child); | ||
} | ||
allocateOps(parent, block, opsByStorage[parent]); | ||
allocated.insert(parent); | ||
}; | ||
|
||
// Actually allocate each operation. | ||
for (auto &[storage, ops] : opsByStorage) | ||
allocateOps(storage, block, ops); | ||
for (auto &[storage, _] : opsByStorage) | ||
visit(storage); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for looking into why this didn't work for your use case! Could you add a small test case to test/Dialect/Arc/allocate-state.mlir
that's broken, and create a separate PR together with your fix for the AllocateState pass?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We added the test in 7678feb, included in this PR. Is it possible that we first got this pr merged into a dev branch, and then break up the changes into smaller PRs?
Thank you!
if (operand.get() == modelStorageArg) { | ||
operand.set(clockStorageArg); | ||
if (storageMapping.contains(operand.get())) { | ||
operand.set(storageMapping.lookup(operand.get())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test to test/Dialect/Arc/lower-clocks-to-funcs.mlir
that checks these new features. One or two new arc.model
ops that trigger these additional arguments to be extracted would be nice. Could you then create a separate PR with just the improvements to LowerClocksToFuncs and that test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking closely, it turns out this code path is no longer used because we're not using two storage arguments in arc.model
. Adding a test for this would trigger a verification error on arc.model
saying it cannot have more than one argument, and I'm not confident to just drop that verification, and there is no longer a good reason to have more than one argument.
Therefore we reverted the changes in this file in the update.
It looks like there are three main pieces of work that you're doing:
Since (1) is going to happen later and (2) does not yet deal with memories yet, would it make sense to focus on the task op in (3) first and fleshing out we can describe parallel computation and synchronization in the IR? Let me collect a few thoughts. |
It would be awesome if we could somehow get the The big question is how to represent the nice Maybe something like this: graph TD
t1c[Task 1 Compute
& Local Writes]
t2c[Task 2 Compute
& Local Writes]
t3c[Task 3 Compute
& Local Writes]
t1s[Task 1 Write]
t2s[Task 2 Write]
t3s[Task 3 Write]
t1c -- Data to Write --> t1s
t1c -.-> t2s
t2c -.-> t1s
t2c -- Data to Write --> t2s
t3c -- Data to Write --> t3s
t1s -.-> t3c
t2s -.-> t3c
Here you'd have 6 separate tasks that depend on each other. The writes to global state are split into seperate "Write" tasks, which allows them to have additional edges from other computation task to encode that other computations have to finish before the write in order to avoid RAW hazards. The corresponding MLIR could look like the following: // Task 1 Compute
%t1_data, %t1 = arc.task : i42, !arc.token {
%0 = arc.state_read %s0 : <i42>
%1 = arc.state_read %s1 : <i42>
%2 = arc.state_read %s2 : <i42> // only accessed in this task
%3 = comb.add %0, %1, %2 : i42
arc.state_write %s2, %3 : <i42>
%4 = arc.token_create // used as scheduling dependency with Task 2 Write
arc.yield %3, %4 : i42, !arc.token
}
// Task 2 Compute
%t2_data, %t2 = arc.task : i42, !arc.token {
%0 = arc.state_read %s0 : <i42>
%1 = arc.state_read %s1 : <i42>
%2 = arc.state_read %s3 : <i42> // only accessed in this task
%3 = comb.mul %0, %1, %2 : i42
arc.state_write %s3, %3 : <i42>
%4 = arc.token_create // used as scheduling dependency with Task 1 Write
arc.yield %3, %4 : i42, !arc.token
}
// Task 1 Write
%w1 = arc.task {
arc.token_use %t2 // avoid RAW with Task 2 Compute
arc.state_write %s0, %t1_data
%0 = arc.token_create // used as scheduling dependency with Task 3 Compute
arc.yield %0 : !arc.token
}
// Task 2 Write
%w2 = arc.task {
arc.token_use %t1 // avoid RAW with Task 1 Compute
arc.state_write %s1, %t2_data
%0 = arc.token_create // used as scheduling dependency with Task 3 Compute
arc.yield %0 : !arc.token
}
// Task 3 Compute
%t3_data = arc.task : i42 {
arc.token_use %w1 // avoid WAR with Task 1 Write
arc.token_use %w2 // avoid WAR with Task 2 Write
// These reads want to see the writes in Task 1/2
%0 = arc.state_read %s0 : <i42>
%1 = arc.state_read %s1 : <i42>
%2 = comb.sub %0, %1 : i42
arc.yield %2 : i42
}
// Task 3 Write
arc.task {
arc.state_write %outputPort, %t3_data : <i42>
} What do you think about a structure like this? The benefit of fleshing this out first independently in a few PRs would be that we can work out an LLVM lowering of these tasks to a loop that executes the next ready task (atomically based on a readiness bit in some shared control word). Once we get that to work we'll have a very nice foundation to automatically derive these tasks with a partitioning scheme and execute simulations across multiple threads. (A task would be ready to execute when all other tasks whose SSA results it uses have finished executing. We can track that using bitmasks which we set/clear atomically at runtime.) |
First a little update: we add supports for memory writes (using shadow states for addr & data & mask), and got rocket-core from the
Objdump shows that computations are duplicated A LOT:
Hopefully after adding the graph partitioning library, things will get better. We could also try do dynamic scheduling, or add more synchronization points within each static partition.
Yes, that would be ideal. However I'm not sure how arcilator would insert the required synchronization operations, due to the fact that there might be different executor design. I'm currently leaning towards that we only encode and emit this information, and leave the actual synchronization to the user of the model.
Allowing tasks to produce and consume values seems like an excellent idea. For values that are used within a task but came from outside, is it possible to abstract them into block arguments? (a little bit similar to
Or we can just omit A small question regarding |
b37f650
to
72c9345
Compare
Co-authored-by: Liu Xiaoyi <circuitcoder0@gmail.com>
Co-authored-by: Liu Xiaoyi <circuitcoder0@gmail.com>
72c9345
to
7678feb
Compare
According to @sequencer @fabianschuiki's private chat discussion, this PR can now be merged as an initial stage of work. Thanks to @fabianschuiki for reviewing and @CircuitCoder for great mentoring and help in this work. |
Thanks a lot for tracking this in a separate feature branch and chopping off multiple PRs from there! 🥳 |
Wow, that's awesome! Congratulations 🥳. I'm sure we can figure out a way to make this run a lot faster 🙂. We might even find some simple greedy partitioning scheme before we pull in more complicated graph partitionings. Very nice!
This is what Arcilator has tried to do initially: it only created functions for each clock, plus a passthrough, and expected the user of the model to interact with it. The problem with this approach turned out to be that as a user you really don't want to interact with the details of the model. As a user, you want Arcilator to just run your simulation, or produce a model that has a very straightforward way of interacting with it. Most of the time since the initial Arcilator implementation has been about undoing this assumption that the user will interact with the model properly. I fear the multithreading is going to be the exact same story: the user will want to be able to pass something like From a tooling point of view, it's often better to develop an end-to-end flow initially, and then allow the user to take more detailed control. This makes the implementation somewhat self-documenting: not only does Arcilator formulate tasks and dependencies among them, but it also has a systematic way of lowering those ops into something that just works in LLVM. This ensures that we can have that
Yes that is an excellent idea! We can definitely do that 😃. I'm very much in favor of making these ops If we want to create tasks earlier without having huge block argument lists, we could let the tasks capture values from the environment and only add // Determine the uses of values defined outside the op.
SmallVector<Value> externalOperands;
op.walk([&](Operation *nestedOp) {
for (auto value : nestedOp->getOperands())
if (!op->isAncestor(value.getParentBlock()->getParentOp()))
externalOperands.push_back(value);
});
I think we need to consider all operands as dependencies, not just We could even make the tasks return an SSA value representing the task itself, as a replacement for the token. Something like this: %task, %result1, %result2 = arc.task %op1, %op2, %otherTask, %op3 :
(i42, i42, !arc.task, i42) -> (!arc.task, i42, i42)
{
} The task op definition could include an explicit
What do you think about this? |
This is a draft PR, which mostly serves the purpose of gathering feedbacks. Implemention is imcomplete and very hacky in places. See the "Unresolved issues" section below.
This PR introduces a new pass (
--hw-partition
) into CIRCT to partition HW top-level modules into two or more parts. Each parts contains parts of the logic and states of the original circuit. Additional ports are created for signals that crosses partition bondaries. The pass tries to perserve origional modules and hierarchies.Motivation
The primary motivation is multithreaded simulation with Arcilator. To be more concrete, the targeted typical usecase of this pass is high performance multithreaded simulation using Arcilator, with HW modules compiled from FIRRTL, on a single CPU, with (mostly) static scheduling. Time stepping happens in a lock-step style.
However, doing this on the HW level may also benifit other use cases, e.g. large scale verification with multiple FPGAs.
Unresolved Issues
inout
?Acknowledgement
Thanks to @CircuitCoder for mentoring this work.