Skip to content
This repository has been archived by the owner on Nov 5, 2024. It is now read-only.

Multiple cdos #10

Merged
merged 3 commits into from
May 24, 2024
Merged

Multiple cdos #10

merged 3 commits into from
May 24, 2024

Conversation

makslevental
Copy link
Collaborator

@makslevental makslevental commented May 24, 2024

The gist is the PR implements two ways to merge xclbins:

  1. by binding the xclbinutil APIs directly (in xaiepy/xclbinutil.cpp, which is a pybind extension)
  2. by shelling out (in examples/dump.py)

Currently it does work in that can run both kernels and you get numerical accuracy (the right numbers come out) but there is some kind of race condition or some lock beind held or something somewhere because it's slow (wouldn't be suprised if it were slower than actually loading individual xclbins). mlir-aie has a PR up (that I emulated) where the same thing does not occur but dump.py is literally their same code 🤷.

FYI I spent like 20 hours on this and it was horrible but what I learned was that the xclbins do literally nothing that isn't immediately undone by the driver. That means eventually we can get rid of all of this.

cc @powderluv

Comment on lines 8 to 11
XCLBIN_FILE = Path(__file__).parent.absolute() / f"twokernels_new.xclbin"
xclbinutil.merge_xclbins(
"pi.xclbin", "twopi.xclbin", "twopi_kernel.json", str(XCLBIN_FILE)
)
Copy link
Collaborator Author

@makslevental makslevental May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nirvedhmeshram you can ignore basically everything in this PR except for this line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested just running the same kernel twice from an xclbin that only has one kernel (in examples/xrt.py) and I see the same thing (actually that just fails). I have no clue but it would not surprise me if the xrt python bindings are just sloppy with respect to ref counting (and hence direct use in C++ doesn't face the problem). Obviously at this point it's safe to say the problem could be absolutely anywhere in the stack 🤷.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored to drop the context every time and now we have this

ert_cmd_state.ERT_CMD_STATE_ERROR
3.14

ert_cmd_state.ERT_CMD_STATE_COMPLETED
6.28

ert_cmd_state.ERT_CMD_STATE_COMPLETED
3.14

ert_cmd_state.ERT_CMD_STATE_COMPLETED
6.28

ert_cmd_state.ERT_CMD_STATE_COMPLETED
3.14

ert_cmd_state.ERT_CMD_STATE_COMPLETED
6.28

ert_cmd_state.ERT_CMD_STATE_COMPLETED
3.14

ert_cmd_state.ERT_CMD_STATE_COMPLETED
6.28

So something goes wrong the first time but not subsequent times. Note, I made a similar sort of change to the single cdo example and this isn't the behavior (no fails there).

@makslevental makslevental force-pushed the multiple_cdos branch 4 times, most recently from 265110b to 9ec92ec Compare May 24, 2024 11:46
@makslevental makslevental enabled auto-merge (squash) May 24, 2024 11:50
@makslevental
Copy link
Collaborator Author

I'm going to merge even though it's not perfect so that the wheels get built and so you can pip install from here https://github.com/nod-ai/prototype-aie-toolchain/releases/tag/release

@makslevental makslevental merged commit 6bb395b into main May 24, 2024
18 checks passed
@makslevental makslevental deleted the multiple_cdos branch May 24, 2024 12:11
@nirvedhmeshram
Copy link

nirvedhmeshram commented May 24, 2024

Thanks Maks! Will try it out.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants