Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Arc] Automatic module partition #7650

Merged
merged 3 commits into from
Oct 31, 2024
Merged

Conversation

SpriteOvO
Copy link
Member

This is a draft PR, which mostly serves the purpose of gathering feedbacks. Implemention is imcomplete and very hacky in places. See the "Unresolved issues" section below.

This PR introduces a new pass (--hw-partition) into CIRCT to partition HW top-level modules into two or more parts. Each parts contains parts of the logic and states of the original circuit. Additional ports are created for signals that crosses partition bondaries. The pass tries to perserve origional modules and hierarchies.

Motivation

The primary motivation is multithreaded simulation with Arcilator. To be more concrete, the targeted typical usecase of this pass is high performance multithreaded simulation using Arcilator, with HW modules compiled from FIRRTL, on a single CPU, with (mostly) static scheduling. Time stepping happens in a lock-step style.

However, doing this on the HW level may also benifit other use cases, e.g. large scale verification with multiple FPGAs.

Unresolved Issues

  • How do we deal with inout?
  • How do we utilize operation duplication, i.e. a single operation appears in multiple parts to reduce communication? Currently no operation is duplicated. We can also approach this from the other end: partition by only by states first, duplicate the entire combinatory subtree, and then do optimization.
  • What graph paritioning optimization goal should we use? Most graph partitioning algorithm / library only optimize for cut-size (or total width of cross module ports in our case, or amount of communication), while the size of each component serves as constraints. METIS supports
  • The implementation of combinaory path is incorrect. Currently it directly edits existing modules. When there are multiple nesting instances of a same module, this approach will fail.

Acknowledgement

Thanks to @CircuitCoder for mentoring this work.

@fabianschuiki
Copy link
Contributor

fabianschuiki commented Sep 30, 2024

Hey @SpriteOvO, thanks for working on a very challenging and interesting corner of the optimization space 🥳!

One of the challenges I see, and you also mentioned, is that different users may have wildly different partitioning requirements. Arcilator probably wants to break the actual simulation workload into parallel tasks, while FPGA partitioning may want to find isolated clusters that have relaxed communication. It's very challenging to build a generic pass that performs well on all needs.

You mentioned that your main goals was to find parallelization opportunities for Arcilator. That is awesome 😍 🥳! Arcilator could really benefit from some coarse-grained multi-threading. A lot of the analysis and bookkeeping you have to do comes from the presence of modules and the potential combinational paths that pass through module parts. Have you considered doing your partitioning directly in the Arc dialect? (You can always generalize later if feasible.) After conversion to Arc, the circuit becomes a graph of the necessary computations to advance the design by one cycle. At this stage, state and combinational paths are clearly visible, and there is no module hierarchy or hidden combinational path anymore. This would allow you to focus entirely on the partitioning itself: finding independent regions of computation, maybe grouping them into a new op (maybe some arc.tasklet?), and then updating the LLVM lowering to allow execution of these tasks on separate threads.

@teqdruid
Copy link
Contributor

One of the challenges I see, and you also mentioned, is that different users may have wildly different partitioning requirements. Arcilator probably wants to break the actual simulation workload into parallel tasks, while FPGA partitioning may want to find isolated clusters that have relaxed communication. It's very challenging to build a generic pass that performs well on all needs

Drop in comment: I haven't looked over this PR, but when I had a partitioning pass in the MSFT dialect (it would pull things out deep in the module hierarchy and do cross-module movements but was obnoxiously complex and wasn't actively used), I had an attribute to specify the partition. I required the designer to specify the partition, but I could easily see different passes doing some target-specific heuristics to label the ops then running the design partition pass.

@sequencer
Copy link
Contributor

Thanks @SpriteOvO and @CircuitCoder for doing this job! That's a really important job, T1 is counting on this PR to remove the verilator emulation backend.
Here is my two cents on the problems, but please follow @fabianschuiki's comments to get it upstreamed.

How do we deal with inout?

We should forbid inout, in the partition boundary, the only usage for inout is for Chip/FPGA IO, which will be emulated via a tri-state IO.

How do we utilize operation duplication, i.e. a single operation appears in multiple parts to reduce communication? Currently no operation is duplicated. We can also approach this from the other end: partition by only by states first, duplicate the entire combinatory subtree, and then do optimization.

  • all states(reg,mem) cannot be duplicated, and they live in the data cache;
  • if module are partitioned, same .text can runs in different CPU/FPGAs. The benefits are coming from saving LLC. but communication always exists.

What graph partitioning optimization goal should we use? Most graph partitioning algorithm / library only optimize for cut-size (or total width of cross module ports in our case, or amount of communication), while the size of each component serves as constraints.

Unlike the overhead of Verilog semantics, it is event driven, thus verilator use the mtask-based way to simulate. The arc simulation is a calculation of connection graph. I personally prefer direction partition it into multiple small non-overlapped graphs, This suit for both FPGA and in-cluster CPU simulation.

The implementation of combinational path is incorrect. Currently it directly edits existing modules. When there are multiple nesting instances of a same module, this approach will fail.

I personally would see, just pre-analyse all combinational path, and forbid cutting from them, cutting the combinational path delivers a bad performance on simulation and high complexity in the algorithm.

@CircuitCoder
Copy link
Contributor

I apologize for the super delayed response, I fell ill for the last couple of weeks.

Partitioning against ARC seems to have a lot of merits. After discussing with @SpriteOvO and @sequencer, we decided to try to migrate this algorithm to the ARC dialect, which @SpriteOvO is already working on, and should result in a PoC after this weekend. My personal reasoning is that:

If I understand correctly, partitioning during the ARC pass mainly involves partitioning against states, or more precisely, the new value to be written into states at the end of each cycle. The entire computation subtree would be partitioned into one half, and some operations would be copied into multiple halves.

Our original motivation for doing partition with HW dialect is that: (a) We can partition against comb operations, which can be simpler (the graph representation is more intuitive to construct) and reflects well to hardware utilization if the result is synthesized onto FPGA; (b) Can use module boundary (which contains some semantical information) to guide partitioning. It turns out, however, that we overlooked some cases when simple unique partition against operations would not work (combinatory path has to pass through boundary more than one time [2]), thus breaking the correctness of lockstep simulation. So even if partitioning is done against operations, some operations would have to be duplicated. This is actually more difficult than first partitioning against states, and then doing some sort of common subexpression extraction (see [1]).

My personal concerns with partitioning with states are:

  • How well does the graph partitioning algorithm perform? The information we give the algorithm is now coarser. Now the graph is constructed using the following procedure:
    • Nodes are states, edge weights are the bit-width of the depended state.
    • The above information alone would push METIS to partition all the hot registers into one half, which results in severe computation imbalance. Therefore two additional node weight constraints: Weight 1 = sum of bitwidth of all operations within the computation subtree. Weight 2 = bit-width of the state itself. These weights reflect the computation and state allocation balance, respectively. For CPU simulations, we can ignore constraints on weight 2 right now.
  • How much computation is duplicated?

After finishing the PoC, we intend to use rocket-chip w/ default configuration as a test.


[1] We currently implement a simple sub-expression extraction algorithm:
If an SSA value is used in multiple halves, and only depends on the states in one half, then the computation of that SSA value is moved into that half, placed after the end-of-cycle state update.
This may break the load balancing, and maybe should only be enabled if the computation subtree is large enough or duplicated enough times.

[2] Essentially it boils down to this pattern:

reg A -- op C -- A
reg B -/      \- B

If A and B are split apart, then no matter where we place C, there would be a comb path that traverses the boundary at least twice.
To avoid this during graph construction, we would have to analyze regs pairwise to figure out if they need to be placed together, which is just too costly.

@CircuitCoder
Copy link
Contributor

CircuitCoder commented Oct 24, 2024

@SpriteOvO has just pushed an update, containing an initial PoC implementation of Arc-based state-directed partitioning.

In brief, the partitioning operates on arc.model, and is done by assigning each "real" state writes (outputs and hardware states) to one of the partitioned arc.models (called "chunk"). The original state WAR legalization pass is not sufficient, as there are also races between chunks, and thus is disabled when partitioning is enabled. Instead, the simulation entry point is split into two functions: The first (top_<chunk>_eval) does all computations and writes state updates into a temporary shadow state. The second (top_<chunk>_sync_eval) copies the shadow state into the persistent state.

Hence, a multithreaded driver should call the model in the following way:

For each thread,

<cycle start> -> Sync -> (the leader thread) updates IO
              -> Sync -> call top_<threadId>_eval
              -> Sync -> call top_<threadId>_sync_eval
              -> <cycle end, loop to cycle start>

Right now the partitioning happens at the end of the hw -> arc pipeline, contained in two passes: the first one (PartitionPass) is placed before state allocation. This pass marks state writes with assigned chunks and creates shadow states. The second one (PartitionClonePass) is placed after state allocation but before clock -> func. This pass splits the original arc.model into 2 * numChunks new arc.models, two for each chunk: top_<chunk> and top_<chunk>_sync.

Changes to other passes introduced in this commit

  • Introduced a new block-like op: arc.sync which resides inside the top-level block in an arc.model, or inside an arc.clock_tree. Its semantics is that the operations inside this op should be executed during the "sync" part of the chunk. This is used to facilitate race prevention (also RAW legalization within a chunk itself).

  • During the arc-lower-state pass, we added an option to create an additional storage for temporaries. It's used for the "old clock" state created in this pass, and also the shadow states created during partitioning. Thus each arc.model would have two storages: one for global data (I/O, hardware states) and is shared by all chunks, and one for temporaries used only in this chunk.

    This will ultimately get transformed into arguments of the model entry point. Now users of partitioned models would need to pass two pointers into the model, one shared by all threads, and one unique for each thread.

    Additionally, the "store old clock" operation is put in an arc.sync block if separate temporary storage is enabled, however adding a separate option may be a better choice.

  • The original WAR legalization is completely skipped if partitioning is requested.

  • arc-lower-clocks-to-funcs are modified to support multiple storages.

  • All AllocState ops would be created with an attribute partition-role to assist the partitioning pass.

Implementation detail

The PartitionPass does two things: Partition planning and shadow state creation.

Partition planning involves finding out the dependencies between states, and the computation workload required for updating each state. Based on this information, a partition plan is generated, and each arc.state_write op is marked with a chunk attribute if it is assigned to a chunk (typically this is the case for writes to global states).

Shadow states are created for each HW state, and are allocated in the chunks' temporary storage. Inside an arc.clock_tree, all state writes are redirected to the shadow storage. An arc.sync block is created inside the arc.clock_tree for each chunk, containing the operations for copying from the shadow state back to the global state.

Between these two passes, states are allocated, and now the size for each storage is fixed.

The PartitionClonePass is easier to explain: it clones the model twice for each chunk and removes operations with a chunk attribute not equal to the current chunk. Then, the two models are processed separately:

  • For the main model, all arc.sync ops are simply dropped.
  • For the sync model, all state writes besides those inside an arc.sync op are removed. arc.passthrough and arc.initial are also removed. Then, all arc.sync ops are moved to the end of the parent block and unwrapped. The following CSE pass is expected to clean up the other unused ops.

Unfinished works

There are some unfinished (but mostly isolated) works to be done. This list will be updated to reflect the implementation. Most notably, the partitioning planning algorithm is missing.

  • The partition planning is not implemented. Right now a random chunk is selected for each state. We plan to add an optional library to arcilator: METIS. When it's enabled, partitioning would use its graph partitioning algorithm. When it's disabled, we plan to sequentially split states into chunks with similar storage size.

  • Temporary state storage allocation is super inefficient. Right now it just uses the original lower state, which would allocate ~$\sum_{c \in chunks} temp_c$. Ideally, we would want ~$\max_{c \in chunks} temp_c$.

    Also, we should sort the global state storage based on who writes the state for less cacheline contention.

  • Partitioning with memories is untested and most certainly broken. For memory writes, we should create shadows for pair(addr, data) for each write port. Besides that, memory supports should be relatively easy to implement.

  • Arcs right now have to be fully inlined for weight statistics to work properly. See questions below.

  • Done in d71d38f: Shadow states should only be created when needed (is accessed outside the writing chunk). See questions below for RAW within a chunk.

  • Once we figured out the algorithm, the partition-role attribute on state writes should use an enum, not a string. We switched to arc.task

Questions

We have a few questions regarding the inner workings of the other parts of arcilator, which may produce a better implementation for the partitioning procedure:

  • Before the inlining pass, is each arc guaranteed to generate exactly one output value? If this assumption is true, then finding out the computation workload for computing each value would be much easier, and we can move the partitioning planning pass before inlining.
  • Is it possible to simply move all state writes to the end of the block to resolve RAW hazards? The original legalization pass also creates shadow states, so it seems that I may need some hint at a counterexample.

@SpriteOvO SpriteOvO changed the title [HW] Automatic HW module partition [Arc] Automatic HW module partition Oct 24, 2024
@SpriteOvO SpriteOvO changed the title [Arc] Automatic HW module partition [Arc] Automatic module partition Oct 24, 2024
@fabianschuiki
Copy link
Contributor

fabianschuiki commented Oct 24, 2024

This is some really cool work that you are doing 🥳! Very exciting 🙌

One thing I'm wondering is how this interacts with the LowerState improvements and bugfixes landing as part of #7703. That PR gets rid of the LegalizeStateUpdate pass (which I think you can't use for your data race hazards anyway), and it also gets rid of arc.clock_tree and arc.passthrough entirely. This means that the model will get a lot simpler and it might be easier to work with.

Do you think we could somehow create an arc.task op or similar to encode the eval -> sync -> eval -> sync -> eval -> ... chain of the different threads you were describing? The data hazards you describe at the thread/task boundaries are challenging, and it would be cool if the synchronization that is necessary would somehow be implicitly encoded or communicated in the IR. If we manage to build up a task structure in the IR, that might be a way to allow for many threads to crunch through the tasks in parallel. 🤔

Another thing we need to figure out is how to deal with memory, because that is going to be a major source of headache. I like your idea of aggregating memory writes into a list of changes to be applied later, and then flushing those into simulation memory in the sync phase. That would be very neat!

Your implementation currently requires quite a few attributes on operations. These attributes are discardable however, so everything in theory should still work even when you delete these attributes. Is there a way how we could encode the partitioning using something like task ops, where you don't need to put attributes on ops and we could have a single lowering pass that takes all arc.tasks and pulls them out into separate functions that can be scheduled in a thread pool?

Having a representation of tasks in the IR would also allow you to group together states and memories that are only used inside a thread. The AllocateState pass already does this for you if all state allocation is nested inside a block, and you should be able to trigger this manually using an explicit arc.alloc_storage (not sure if this is fully implemented though). For example:

arc.task {
  %localState = arc.alloc_storage %arg0
  %0 = arc.alloc_state %localState
  %1= arc.alloc_state %localState
  ...
}

should in theory allocate all state in the task into one contiguous block, which is then allocated in the global storage (which should be fine because it's just one region of memory that only this thread touches):

arc.task {
  %localState = arc.storage.get %arg0 ...  // get pointer to the subregion of memory where all local state is
  ...
}

@CircuitCoder
Copy link
Contributor

Thanks for the insightful reply, @fabianschuiki! 😻

I briefly looked into #7703, and it seems that it can simplify the implementation a lot. All attributes introduced in this PR can actually be dropped if rebased upon #7703 and using the arc.task abstraction. However, it made me realize that the register as clock use-cases may cause problems with our current cloning algorithm. If I understand correctly, after the LowerState pass in #7703, all state reads/writes now obey program order in MLIR. A clock that combinatorially depends on register will get lowered into a scf.if whose conditions depend on at least one arc.state_read, which in turn would observe a prior arc.state_write (the register value in the new phase). Then the writes within the scf.if cannot be arbitrarily made parallel with the prior write to the depended register.

I haven't come up with a good solution to this problem. One way would be to allow arc.tasks arbitrarily depend on each other (see below), but our original targeted use-cases for partitioning include static scheduling, which means that it's better to avoid dependencies. Our current idea is to clone the computation subtree of the written value into the following read, which means some duplicated computation. Our rationale is that in real hardware designs, gated clocks or other clocks that depend on registers should not have deep combinatory trees. This is somewhat equivalent to how we deal with other computations that are used in multiple chunks: they get duplicated (or didn't get DCE'ed).

@SpriteOvO and I will try to rebase this PR onto #7703, and see if the idea works.

Task dependencies

Do you think we could somehow create an arc.task op or similar to encode the eval -> sync -> eval -> sync -> eval -> ... chain of the different threads you were describing?

Assuming that we place writes to the same state into the same task, then I think we might be able to infer a finer dependency relation between tasks. If all operations are placed into tasks, then one task needs to be scheduled after another if and only if it reads a state written by another task prior in program order. We don't even need extra structures (operands, attributes) in the IR. Let's use a fancy notation to represent the transitive closure of this relation: A <_T B means that A needs to be scheduled before B.

This relation can also give us a simple procedure to merge tasks. A and B can be merged if and only if there does not exist C, such that A <_T C and C <_T B. The merge procedure would be (assuming B follows A in program order):

  • Move all tasks between A and B: if C is between A and B in program order, move C before A if C <_T B, or move C after B otherwise.
  • Concat A and B

Therefore we might be able to get an alternative bottom-up partitioning scheme by combining atomic arc.tasks, which contains writes for only one state. It might better fit dynamic scheduling by producing some number of "constant-sized tasks", instead of a constant number of "tasks with similar size" given by a top-down graph partitioning algorithm. Using this bottom-up method, we might be able to relax the clock-depends-on-register cloning requirement, instead allowing dependencies between tasks with minimal loss of performance.

Nevertheless, for the top-down partitioning method, this kind of dependency can still be used to reflect the semantics of arc.sync, so arc.sync can just be replaced by it.

State grouping

After some thought, It seems to me that arc.alloc_storage is a better choice when used together with arc.task, than our method of using another storage object. The benefit is that we can specify (in the IR) which task shares which storage.

The downside is that these storages would all be allocated flat, and the user of the model cannot control the allocation. It may be desirable for dynamically scheduled simulations to reuse unused storage spaces between tasks for better cache locality.

@CircuitCoder
Copy link
Contributor

The rebase is done! In the last force pushed commit, the following modifications are made:

  • Rebased onto [Arc] Improve LowerState to never produce read-after-write conflicts #7703. Dropped logics handling shadow states which can be added by the old lower states pass.
  • Added arc.task op, removed arc.sync op. Now partitioning is done through creating arc.task.
  • We noticed AllocateStatePass sometimes cannot handle pre-existing arc.alloc_storage, due to the fact that the sub-storage might be processed after the parent storage. We added a topo-sort step to avoid this problem.

Currently, arc.task has one optional name attribute and one body block. The semantics is that arc.task would be lowered into a function (<model>_<task>_eval), and tasks with the same name would be merged. If the merge is not possible (due to memory side effects), an error should be emitted. Unnamed tasks are generated an anonymous unique name.

Ideally, we would want arc.task to also have two variadic lists of operands: one for storages that are read within the task, one for storages that are written within the task. This would help the ordering/dependency analysis, and would also allow us to verify if the task really only uses these storages. However, I'm still battling with MLIR over addding multiple variadic operand lists, so this and the order violation during task merging are not implemented yet.

We also found that states depended by control flow structures cannot be written-back within ordinary sync tasks, because other sync tasks may still depend on their old value (basically, the "old clock" states). We added an output task at the end of each lock-step simulation, to write back these states and root outputs.

Now, a single step in statically-scheduled multithread lockstep simulation should look like this:

Threads Start           Sync                Sync                              Sync Loop
T 0     -> model_0_eval | model_0_sync_eval | model_output_eval -> inspect IO |
T 1     -> model_1_eval | model_1_sync_eval |                                 |
T 2     -> model_2_eval | model_2_sync_eval |                                 |
...        ...            ...              

The added test case is a small GCD calculator. We'll then try to implement supports for memories and test on rocket.

@fabianschuiki
Copy link
Contributor

Really cool stuff! Let me go through this in more detail. The tasks look very interesting 😍. We might be able to split this up into a few smaller PRs such that we can flesh out the different pieces step by step.

Comment on lines 65 to 87

// The allocation should run in (reversed?) topo-sorted order for preexisting
// substorages to work correctly.

DenseSet<Value> allocated;
std::function<void(Value)> visit;
visit = [&](Value parent) {
if(allocated.contains(parent)) return;
for(auto childAlloc : opsByStorage[parent])
if(auto childStorageAlloc = dyn_cast<AllocStorageOp>(childAlloc)) {
auto child = childStorageAlloc.getResult();
if(opsByStorage.contains(child)) // Used by allocations in this block
visit(child);
}
allocateOps(parent, block, opsByStorage[parent]);
allocated.insert(parent);
};

// Actually allocate each operation.
for (auto &[storage, ops] : opsByStorage)
allocateOps(storage, block, ops);
for (auto &[storage, _] : opsByStorage)
visit(storage);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for looking into why this didn't work for your use case! Could you add a small test case to test/Dialect/Arc/allocate-state.mlir that's broken, and create a separate PR together with your fix for the AllocateState pass?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added the test in 7678feb, included in this PR. Is it possible that we first got this pr merged into a dev branch, and then break up the changes into smaller PRs?

Thank you!

if (operand.get() == modelStorageArg) {
operand.set(clockStorageArg);
if (storageMapping.contains(operand.get())) {
operand.set(storageMapping.lookup(operand.get()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test to test/Dialect/Arc/lower-clocks-to-funcs.mlir that checks these new features. One or two new arc.model ops that trigger these additional arguments to be extracted would be nice. Could you then create a separate PR with just the improvements to LowerClocksToFuncs and that test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking closely, it turns out this code path is no longer used because we're not using two storage arguments in arc.model. Adding a test for this would trigger a verification error on arc.model saying it cannot have more than one argument, and I'm not confident to just drop that verification, and there is no longer a good reason to have more than one argument.

Therefore we reverted the changes in this file in the update.

@fabianschuiki
Copy link
Contributor

It looks like there are three main pieces of work that you're doing:

  1. Figuring out how to partition the model (not yet implemented)
  2. Grouping ops into different tasks according to the partitioning
  3. Representing tasks, synchronization, and parallel threads executing the model

Since (1) is going to happen later and (2) does not yet deal with memories yet, would it make sense to focus on the task op in (3) first and fleshing out we can describe parallel computation and synchronization in the IR? Let me collect a few thoughts.

@fabianschuiki
Copy link
Contributor

fabianschuiki commented Oct 30, 2024

It would be awesome if we could somehow get the arc.task ops to encode any necessary synchronization, dependencies, and general dataflow as SSA edges between multiple tasks. Ideally, there could be a bunch of arc.task ops and it would be apparent from the use-def chain how these tasks can be scheduled and where synchronization is necessary. Based on that we could add an LLVM lowering that actually implements the synchronization and execution of the tasks. Ideally the arc.model would just work with and without tasks in there, and the pipeline should produce LLVM IR that will either run sequentially or handle the scheduling/execution of the task, without the need for any C++ support code to do that.

The big question is how to represent the nice eval -> sync -> eval -> sync -> ... cycles that you have pointed out. I think eval and sync are actually the same thing from a scheduling/task point of view: the sync is just another task that depends on other tasks to finish before it can execute. But it additionally takes a bunch of values to be written as operands. Maybe we could fully encode the synchronization barrier as a task waiting for all its inputs to become available. The tasks could then pass along some token value to make sure they wait on each other.

Maybe something like this:

graph TD
t1c[Task 1 Compute
& Local Writes]
t2c[Task 2 Compute
& Local Writes]
t3c[Task 3 Compute
& Local Writes]

t1s[Task 1 Write]
t2s[Task 2 Write]
t3s[Task 3 Write]

t1c -- Data to Write --> t1s
t1c -.-> t2s
t2c -.-> t1s
t2c -- Data to Write --> t2s
t3c -- Data to Write --> t3s

t1s -.-> t3c
t2s -.-> t3c
Loading

Here you'd have 6 separate tasks that depend on each other. The writes to global state are split into seperate "Write" tasks, which allows them to have additional edges from other computation task to encode that other computations have to finish before the write in order to avoid RAW hazards. The corresponding MLIR could look like the following:

// Task 1 Compute
%t1_data, %t1 = arc.task : i42, !arc.token {
  %0 = arc.state_read %s0 : <i42>
  %1 = arc.state_read %s1 : <i42>
  %2 = arc.state_read %s2 : <i42>  // only accessed in this task
  %3 = comb.add %0, %1, %2 : i42
  arc.state_write %s2, %3 : <i42>
  %4 = arc.token_create  // used as scheduling dependency with Task 2 Write
  arc.yield %3, %4 : i42, !arc.token
}

// Task 2 Compute
%t2_data, %t2 = arc.task : i42, !arc.token {
  %0 = arc.state_read %s0 : <i42>
  %1 = arc.state_read %s1 : <i42>
  %2 = arc.state_read %s3 : <i42>  // only accessed in this task
  %3 = comb.mul %0, %1, %2 : i42
  arc.state_write %s3, %3 : <i42>
  %4 = arc.token_create  // used as scheduling dependency with Task 1 Write
  arc.yield %3, %4 : i42, !arc.token
}

// Task 1 Write
%w1 = arc.task {
  arc.token_use %t2  // avoid RAW with Task 2 Compute
  arc.state_write %s0, %t1_data
  %0 = arc.token_create  // used as scheduling dependency with Task 3 Compute
  arc.yield %0 : !arc.token
}

// Task 2 Write
%w2 = arc.task {
  arc.token_use %t1  // avoid RAW with Task 1 Compute
  arc.state_write %s1, %t2_data
  %0 = arc.token_create  // used as scheduling dependency with Task 3 Compute
  arc.yield %0 : !arc.token
}

// Task 3 Compute
%t3_data = arc.task : i42 {
  arc.token_use %w1  // avoid WAR with Task 1 Write
  arc.token_use %w2  // avoid WAR with Task 2 Write
  // These reads want to see the writes in Task 1/2
  %0 = arc.state_read %s0 : <i42>
  %1 = arc.state_read %s1 : <i42>
  %2 = comb.sub %0, %1 : i42
  arc.yield %2 : i42
}

// Task 3 Write
arc.task {
  arc.state_write %outputPort, %t3_data : <i42>
}

What do you think about a structure like this? The benefit of fleshing this out first independently in a few PRs would be that we can work out an LLVM lowering of these tasks to a loop that executes the next ready task (atomically based on a readiness bit in some shared control word). Once we get that to work we'll have a very nice foundation to automatically derive these tasks with a partitioning scheme and execute simulations across multiple threads.

(A task would be ready to execute when all other tasks whose SSA results it uses have finished executing. We can track that using bitmasks which we set/clear atomically at runtime.)

@CircuitCoder
Copy link
Contributor

First a little update: we add supports for memory writes (using shadow states for addr & data & mask), and got rocket-core from the arc-tests repo working with a statically scheduled two-thread parallel execution setup, but currently runs slower than non-partitioned serial execution:

  • Baseline serial execution without partitions: 354540 Hz
  • Two partitions with two threads: 294718 Hz
  • Two partitions with one threads: 180212 Hz

Objdump shows that computations are duplicated A LOT:

$ objdump -S ./build/large-master/rocket-partition.o | wc -l
169978
$ objdump -S ./build/large-master/rocket-arc.o | wc -l
90504

Hopefully after adding the graph partitioning library, things will get better. We could also try do dynamic scheduling, or add more synchronization points within each static partition.


Maybe we could fully encode the synchronization barrier as a task waiting for all its inputs to become available.

Yes, that would be ideal. However I'm not sure how arcilator would insert the required synchronization operations, due to the fact that there might be different executor design. I'm currently leaning towards that we only encode and emit this information, and leave the actual synchronization to the user of the model.

The structure & syntax for arc.task

Allowing tasks to produce and consume values seems like an excellent idea. For values that are used within a task but came from outside, is it possible to abstract them into block arguments? (a little bit similar to linalg.generic). This might make it easier to do dependency tracking without looking into the body of tasks:

%w2 = arc.task %1, %2_data {
^bb0 (%a0 : !arc.token, %a1: i32):
  arc.token_use %a0
  arc.state_write %s1, %a1
  %0 = arc.token_create
  arc.yield %0 : !arc.token
}

Or we can just omit arc.token_use in the body, and solely use the operands on arc.task to encode dependencies.

A small question regarding arc.token_use: Are arc.token_use potential yield points that can be arbitrarily placed within a task? Or are they handled just like a no-op, and a token is used to carry an artificial schedule-after relation between tasks?

@SpriteOvO SpriteOvO changed the base branch from main to arc-partition October 31, 2024 05:24
@SpriteOvO SpriteOvO force-pushed the partition branch 2 times, most recently from b37f650 to 72c9345 Compare October 31, 2024 05:55
@SpriteOvO SpriteOvO marked this pull request as ready for review October 31, 2024 06:50
@SpriteOvO SpriteOvO requested a review from maerhart as a code owner October 31, 2024 06:50
@SpriteOvO
Copy link
Member Author

According to @sequencer @fabianschuiki's private chat discussion, this PR can now be merged as an initial stage of work. Thanks to @fabianschuiki for reviewing and @CircuitCoder for great mentoring and help in this work.

@SpriteOvO SpriteOvO merged commit 576a0a8 into llvm:arc-partition Oct 31, 2024
4 checks passed
@SpriteOvO SpriteOvO deleted the partition branch October 31, 2024 15:06
@fabianschuiki
Copy link
Contributor

Thanks a lot for tracking this in a separate feature branch and chopping off multiple PRs from there! 🥳

@fabianschuiki
Copy link
Contributor

fabianschuiki commented Oct 31, 2024

@CircuitCoder [...] and got rocket-core from the arc-tests repo working with a statically scheduled two-thread parallel execution setup

Wow, that's awesome! Congratulations 🥳. I'm sure we can figure out a way to make this run a lot faster 🙂. We might even find some simple greedy partitioning scheme before we pull in more complicated graph partitionings. Very nice!

[...] However I'm not sure how arcilator would insert the required synchronization operations, due to the fact that there might be different executor design. I'm currently leaning towards that we only encode and emit this information, and leave the actual synchronization to the user of the model.

This is what Arcilator has tried to do initially: it only created functions for each clock, plus a passthrough, and expected the user of the model to interact with it. The problem with this approach turned out to be that as a user you really don't want to interact with the details of the model. As a user, you want Arcilator to just run your simulation, or produce a model that has a very straightforward way of interacting with it. Most of the time since the initial Arcilator implementation has been about undoing this assumption that the user will interact with the model properly.

I fear the multithreading is going to be the exact same story: the user will want to be able to pass something like --threading to Arcilator and expect it to just work with or without that option. I totally agree with you that a power user will want to have access ot the threading internals, and schedule things on their own. For example, by writing their own executors.

From a tooling point of view, it's often better to develop an end-to-end flow initially, and then allow the user to take more detailed control. This makes the implementation somewhat self-documenting: not only does Arcilator formulate tasks and dependencies among them, but it also has a systematic way of lowering those ops into something that just works in LLVM. This ensures that we can have that --threading option on or off and things will not break because of it. The user taking more detailed control of the executor is then an additional feature instead of the norm. I wouldn't be surprised if Arcilator ended up having multiple executors to choose from internally, plus an additional mode where the user can do the entire scheduling on their own 🙂.

Allowing tasks to produce and consume values seems like an excellent idea. For values that are used within a task but came from outside, is it possible to abstract them into block arguments? (a little bit similar to linalg.generic). This might make it easier to do dependency tracking without looking into the body of tasks:

Yes that is an excellent idea! We can definitely do that 😃. I'm very much in favor of making these ops IsolatedFromAbove and forcing all data to flow through block arguments and results. If we do this, we might have to wait with creating tasks until after AllocateState, because before that the states are all individual arc.alloc_state ops. If a task writes to 100 states shared with other tasks, it would need 100 block arguments. After AllocateState we can just pass in the !arc.storage. This will get better once @maerhart's planned improvements to the simulation data layout land.

If we want to create tasks earlier without having huge block argument lists, we could let the tasks capture values from the environment and only add IsolatedFromAbove once we have a better solution for arc.alloc_state and friends. There's a little nugget of code you can use to iterate over all dependencies in nested ops that go outside the root op:

  // Determine the uses of values defined outside the op.
  SmallVector<Value> externalOperands;
  op.walk([&](Operation *nestedOp) {
    for (auto value : nestedOp->getOperands())
      if (!op->isAncestor(value.getParentBlock()->getParentOp()))
        externalOperands.push_back(value);
  });

Or we can just omit arc.token_use in the body, and solely use the operands on arc.task to encode dependencies.
[...]
A small question regarding arc.token_use: Are arc.token_use potential yield points that can be arbitrarily placed within a task? Or are they handled just like a no-op, and a token is used to carry an artificial schedule-after relation between tasks?

I think we need to consider all operands as dependencies, not just !arc.token. The token was just an idea to create an additional artificial dependency between tasks in case there is a dependency between side-effects (e.g. RAW, WAR hazards). The token_use was just a no-op as you described, basically to get the task to have a use-def edge to some other task. But I do very much like your idea of just adding these as operands to a task and not having a token_use at all.

We could even make the tasks return an SSA value representing the task itself, as a replacement for the token. Something like this:

%task, %result1, %result2 = arc.task %op1, %op2, %otherTask, %op3 :
  (i42, i42, !arc.task, i42) -> (!arc.task, i42, i42)
{
}

The task op definition could include an explicit token or task result alongside a variadic set of results returned from the task body:

let results = (outs
  TaskType:$task,
  Variadic<AnyType>:$outputs
);

What do you think about this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants