Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with the AXI implementation #1071

Open
andrewb1999 opened this issue Jul 1, 2022 · 7 comments
Open

Issues with the AXI implementation #1071

andrewb1999 opened this issue Jul 1, 2022 · 7 comments
Labels
C: FPGA Changes for the FPGA backend Type: Tracker Track various tasks

Comments

@andrewb1999
Copy link
Collaborator

andrewb1999 commented Jul 1, 2022

I spent some time today getting the dot-product design working with AXI on real FPGAs. I just manually edited the verilog for now, but I'll document the issues I found so they can be fixed in the generator at some point. Here's the repo with the working design: https://github.com/andrewb1999/calyx-axi-ex.git

Actual Calyx issues (dot-product.sv)

It's a bad idea to initialize logic (especially resets) that are used as wires rather than registers. I just commented out the entire initial block in the verilog which prevented issues with std_regs not being reset properly (currently none of the logic defined in the main module is being used as a register so this is fine to do).

XML issues (kernel.xml)

The datawidth of the m axi ports need to be 512 because this is the width we actually use in hardware.

Wrapper (wrapper.v)

Starting from the top of wrapper.v here are all the bugs I found and changes I made.

  1. I changed memories_sent to just a single bit reg that checks if all three memory sends are done. This matters because with the original implementation memories_set would be treated as a true boolean if any of the memory sends are done, causing the axi controllers to reset as soon as any of the sends are done (instead of all of them).
  2. A0_addr0, B0_addr0, and v0_addr0 are the wrong bit width which causes Z signals to appear in simulation (and unpredictable behavior on real hardware). The correct thing to do here is define the wires as the correct bitwidth (4 for A0 and B0 and 1 for v0) and then append the appropriate number of zeros before passing into the Memory_controller.
  3. Changed the line:
assign ap_done = counter > timeout || memories_sent == 3'b1;

to

assign ap_done = memories_sent;

Two things are going on here. First, I removed the timeout as this can trigger ap_done being asserted before the data is ready. This could potentially be added back after verifying the rest of the design, but it just produces confusing behavior where the kernel thinks it finished when it actually didn't. Second, because of the previous change to memories_sent, it is now a single bit register so we don't have to do any comparison before setting it equal to ap_done. The previous implementation was also wrong even before changing the memories_sent semantics because ap_done wasn't asserted when all memories were sent, but rather when only A0 was sent (if any other memory was finished sending before A0, it would never assert ap_done).
4. Actually assigned the Done signal properly in SINGLE_PORT_BRAM (it was unconnected before).
5. Added an ap_idle signal to the control s axi lite. This signal is stored in rdata[2] and initialized to 1, becomes 0 when ap_start is asserted, and becomes 1 when ap_done is asserted. This allows the design to work properly with PYNQ and other implementations that rely on the ap_idle signal.
6. Changed all lines in the memory controllers that look like:

assign ARADDR = BASE_ADDRESS + {{58{1'b0}}, copy_addr_offset} << 2;

or

assign AWADDR = BASE_ADDRESS + {{58{1'b0}}, send_addr_offset} << 2;

to

assign ARADDR = BASE_ADDRESS + ({{58{1'b0}}, copy_addr_offset} << 2);

or

assign AWADDR = BASE_ADDRESS + ({{58{1'b0}}, send_addr_offset} << 2);

The parenthesis are required because verilog does the addition before the left shift to it previously would give the wrong ARADDR or AWADDR if the BASE_ADDRESS was anything other than 0.
7. Changed all assigns to WDATA from:

assign WDATA = {{15{bram_read_data}}, bram_read_data};

to

assign WDATA = {{15{8'd0}}, bram_read_data};

This isn't strictly necessary but it prevents writing multiple copies of the output data to the buffer if some of the other signals are wrong.
8. Changed all assigns to WSTRB from:

assign WSTRB = {{15{4'hF}}, 4'hF};

to

assign WSTRB = {{15{4'h0}}, 4'hF};

WSTRB is an axi signal where each bit corresponds to a single byte of the full axi data width. We are using a 512 bit data width on our axi signals, so we have a 64 bit WSTRB value (512/8). We are only sending one uint32 per axi transfer so only the bottom 4 bytes are valid in the axi transfer (therefore we only want to set the bottom four bits of the WSTRB to 1).
9. In the v0 memory controller (Memory_controller_axi_2), I changed:

assign copy_done = copy_addr_offset == 32;

and

assign send_done = send_addr_offset == 32;

to

assign copy_done = copy_addr_offset == 1;

and

assign send_done = send_addr_offset == 1;

This is the most important change and the primary reason we saw no outputs before. Because of how the module increments the bram address and the buffer address, the previous version would write the correct data on the first axi transfer and then write 0 to the same buffer location 31 times (which would obviously overwrite the value). Since this buffer only contains a single uint32, we only want to write values once anyways. Obviously, there is more work that needs to be done here to ensure that this works for all potential buffer sizes.

We should really also change these values from 32 to 8 in the other two memory controllers, although because of how the buffers are allocated and the memory controllers work, it still works with these values set to 32.

Other comment:
At some point we should update the generator to use a minimal version of bursting over AXI. Essentially all this does is pack multiple elements into one 512 bit data width, which reduces the number of axi transfers required significantly with minimal effort. I have an example for read bursting in the opt branch of the calyx-axi-ex repo, but obviously if we get to this point I can help out more.

@andrewb1999
Copy link
Collaborator Author

@nathanielnrn @sampsyo @rachitnigam I updated this issue with all the bugs I found and explanations. Let me know if you have any questions, I have a pretty good grasp on what's going on now so I should be able to help explain stuff.

@rachitnigam
Copy link
Contributor

Thank you for the incredible debugging work @andrewb1999!! This will be really useful in getting a push button FPGA flows for calyx stuff!

@nathanielnrn nathanielnrn added the C: Calyx Extension or change to the Calyx IL label Jul 5, 2022
@nathanielnrn
Copy link
Contributor

This is amazing! I'm implementing changes to code generation now.

Sorry if this question has an obvious answer, I looked around but couldn't find one. Thought it might save time asking here.

Where exactly does the kernel.xml interface with verilog code? In other words, would changes need to be made to any port widths in verilog with respect to

The datawidth of the m axi ports need to be 512 because this is the width we actually use in hardware.

@andrewb1999
Copy link
Collaborator Author

andrewb1999 commented Jul 5, 2022

@nathanielnrn The kernel.xml doesn't really "interface" with the verilog per-say, it is just a simpler description of the underlying verilog that XRT uses to know how to interact with the kernel. What's important here is that the data width of the RDATA and WDATA channels match the datawidth that we specify in the kernel.xml. So for example, m0_axi_RDATA has a datawidth of 512 in the verilog (wire[511:0]), so in the kernel.xml we need to specify that the m0_axi port has a dataWidth of 512. What is currently generated results in verilog where m0_axi_RDATA has a datawidth of 512 (which is correct), but the kernel.xml says the datawidth is actually 64. Does that make sense?

@sampsyo
Copy link
Contributor

sampsyo commented Jul 6, 2022

Maybe one way of putting it is that the kernel.xml file describes what's going on in the Verilog.

One thing that's still unclear to me is where the width 512 comes from in the generator code. This is where the data width seems to stem from:
https://github.com/cucapra/calyx/blob/943a50ebdb1533dabfc6f82f638b43d5744fe7c0/src/backend/xilinx/toplevel.rs#L46

…and there it's set to 32. So this may require some debugging!

@andrewb1999
Copy link
Collaborator Author

That's actually the Control_axi instantiation, so 32 is the correct value there. The issue is with the memory module here:

https://github.com/cucapra/calyx/blob/943a50ebdb1533dabfc6f82f638b43d5744fe7c0/src/backend/xilinx/toplevel.rs#L50

There is a 512 there, but there's also a 64 and a 32 (for address width and the data width of the argument). I think there's just a mismatch somewhere of what value needs to go into the kernel.xml.

@sampsyo
Copy link
Contributor

sampsyo commented Jul 7, 2022

Aha, thanks for the correction, @andrewb1999!! That would explain it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: FPGA Changes for the FPGA backend Type: Tracker Track various tasks
Projects
None yet
Development

No branches or pull requests

4 participants