Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add procs for interfacing with AXI peripherals in ZSTD Decoder #1613

Merged
merged 12 commits into from
Oct 21, 2024

Conversation

rw1nkler
Copy link
Contributor

@rw1nkler rw1nkler commented Sep 18, 2024

This PR introduces pros that enable the interface between DSLX modules and AXI subordinates. The attached README provides detailed documentation of the new functionality. Provided Verilog simulations show that the new procs allow for reading from and writing to RAM located on the AXI bus.

While working on this PR, we encountered a potential issue linked to afc720d, which appears to cause indefinite parsing and type-checking loops for certain files, such as mem_writer.x (provided in this PR).

Copy link

google-cla bot commented Sep 18, 2024

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@lpawelcz
Copy link
Contributor

Moved the cocotb tests to a separate PR: #1616

@lpawelcz
Copy link
Contributor

Added performance improvements for the MemWriter proc.

The PR is ready for review, @proppy please take a look.

dslx_top = "AxiReaderInst",
library = ":axi_reader_dslx",
opt_ir_args = {
"top": "__axi_reader__AxiReaderInst__AxiReader_0__16_32_4_4_4_3_2_14_next",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that still required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is required for IR benchmarks because XLS does not allow optimizing IRs of procs that have empty next().

This is very often the case for parameterized procs.
We must specify additional Instance proc that spawns the proc in question with specified parameters.
In such case the proc marked as default top in the IR will be the Instance proc which has empty next(). If we want to benchmark the IR, it is required to point to the internal proc as the new top because this proc contains all of the logic.

"streaming_channel_data_suffix": "_data",
"flop_inputs_kind": "skid",
"flop_outputs_kind": "skid",
"clock_period_ps": "1800",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason the clock period is different here? should we keep a list of those at the top for easier tuning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical path delay from the IR benchmark for this proc exceeds the 750ps that we use throughout the ZSTD Decoder. In order to provide valid constraints for the scheduler we increased the clock period to the value that is close to the critical path delay. Moved this to a variable at the top of the file where the 750ps clock period is also defined.


axi_stream_remove_empty_codegen_args = common_codegen_args | {
"module_name": "axi_stream_remove_empty",
"clock_period_ps": "1300",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason the clock period is different here? should we keep a list of those at the top for easier tuning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical path delay from the IR benchmark for this proc exceeds the 750ps that we use throughout the ZSTD Decoder. In order to provide valid constraints for the scheduler we increased the clock period to the value that is close to the critical path delay. Moved this to a variable at the top of the file where the 750ps clock period is also defined.

"streaming_channel_data_suffix": "_data",
"flop_inputs_kind": "skid",
"flop_outputs_kind": "skid",
"clock_period_ps": "2600",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason the clock period is different here? should we keep a list of those at the top for easier tuning?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The critical path delay from the IR benchmark for this proc exceeds the 750ps that we use throughout the ZSTD Decoder. In order to provide valid constraints for the scheduler we increased the clock period to the value that is close to the critical path delay. Moved this to a variable at the top of the file where the 750ps clock period is also defined.

4. Wait for the response submitted on the `resp_s` channel, which indicates
if the write operation was successful or an error occurred.

# Cocotb Simulation
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this section to #1616?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@proppy
Copy link
Member

proppy commented Sep 24, 2024

The tests take quite a long time to execute:

[4,813 / 4,816] 18 / 21 tests; Parsing and type checking DSLX source files of target axi_writer_dslx; 2688s linux-sandbox ... (3 actions running)

is that expected?

@rw1nkler
Copy link
Contributor Author

rw1nkler commented Sep 24, 2024

The tests take quite a long time to execute:

[4,813 / 4,816] 18 / 21 tests; Parsing and type checking DSLX source files of target axi_writer_dslx; 2688s linux-sandbox ... (3 actions running)

is that expected?

This is the problem described in #1615. Reverting afc720d, allowed us to pass the type-checking step, but most probably the issue requires a proper fix, rather than reverting the functionality.

@proppy
Copy link
Member

proppy commented Oct 1, 2024

according to #1615 (comment), does this need to be updated?

@lpawelcz
Copy link
Contributor

lpawelcz commented Oct 4, 2024

@proppy
We rebased the PR and fixed the codebase according to #1615 (comment).

After the rebase we experienced a regression in place_and_route rule for the SequenceExecutor, hence the 9282bfb.

Additionally, we noticed that bazel run calls in https://github.com/google/xls/blob/main/.github/workflows/modules-zstd.yml#L63 execute only the first bazel target from the bazel query output.
In 94a6d6d we fixed this by explicitly looping through all the targets and calling bazel run for one target at a time.

We addressed your review comments and slightly changed the interface of the MemReader/Writer by removing the Ctrl channel that was used to configure the procs with a base address. Now the requests to the procs contain absolute addresses instead of offsets from the base address.

@proppy
Copy link
Member

proppy commented Oct 10, 2024

After the rebase we experienced a regression in place_and_route rule for the SequenceExecutor, hence the 9282bfb.

did you report it to https://github.com/hdl/bazel_rules_hdl ? /cc @mikesinouye @QuantamHD

@proppy
Copy link
Member

proppy commented Oct 10, 2024

Additionally, we noticed that bazel run calls in https://github.com/google/xls/blob/main/.github/workflows/modules-zstd.yml#L63 execute only the first bazel target from the bazel query output.

That sounds weird, can you share the shell-expanded bazel command line that only run the first target?

)

CLOCK_PERIOD_PS = "750"
# Clock periods for modules that exceed the 750ps critical path in IR benchmark
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would increasing pipelining help?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not enough for the scheduler as we constrain it from 3 sides:

  • worst case throughput - 1
  • target clock period - 750ps
  • pipeline stages - as many as required

Combining worst_case_throughput==1 with fixed clock period requirement enforces a condition that each next() evaluation must be possible to compute in a given clock period time limit.

Currently, some of the procs don't meet this requirement and scheduling in such case will fail with a message to increase the worst_case_throughput. We can do that or we can stop enforcing specific clock period for such procs. This way, the module would still evaluate the next() function in a single clock cycle and we would still have an estimate for the max clock period from the IR benchmark.

src = ":axi_reader_verilog.opt.ir",
benchmark_ir_args = axi_reader_codegen_args | {
"pipeline_stages": "10",
"top": "__axi_reader__AxiReaderInst__AxiReader_0__16_32_4_4_4_3_2_14_next",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this rule works mostly for the simple DSLX functions. I was not able generate a correct name of the proc with it. Tried e.g.:

  • get_mangled_ir_symbol("AxiReaderInst", "AxiReader", (16, 32, 4, 4, 4, 3, 2, 14)), got: __AxiReaderInst__AxiReader__16_32_4_4_4_3_2_14
  • get_mangled_ir_symbol("AxiReaderInst", "AxiReader", is_proc_next=True) , got: __AxiReaderInst__AxiReadernext

Looks like the case of a next() function of an instance of the parameterized proc that is spawned in other proc is not supported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah you're right:

  is_proc_next: A boolean flag denoting whether the symbol is a
    next proc function. The argument is mutually exclusive with arguments:
   'parametric_values' and 'is_implicit_token'.

We should file a feature request for this to be supported, one workaround I can see is to define wrapper proc rather than putting the parameters in the BUILD file.

@lpawelcz
Copy link
Contributor

Additionally, we noticed that bazel run calls in https://github.com/google/xls/blob/main/.github/workflows/modules-zstd.yml#L63 execute only the first bazel target from the bazel query output.

That sounds weird, can you share the shell-expanded bazel command line that only run the first target?

That would be for example:

bazel run -c opt -- //xls/modules/zstd:axi_csr_accessor_opt_ir_benchmark //xls/modules/zstd:block_header_dec_opt_ir_benchmark //xls/modules/zstd:csr_config_opt_ir_benchmark //xls/modules/zstd:dec_mux_opt_ir_benchmark //xls/modules/zstd:frame_header_dec_opt_ir_benchmark //xls/modules/zstd:raw_block_dec_opt_ir_benchmark //xls/modules/zstd:repacketizer_opt_ir_benchmark //xls/modules/zstd:rle_block_dec_opt_ir_benchmark //xls/modules/zstd:sequence_executor_opt_ir_benchmark //xls/modules/zstd:window_buffer_opt_ir_benchmark //xls/modules/zstd:zstd_dec_internal_opt_ir_benchmark //xls/modules/zstd/memory:axi_reader_opt_ir_benchmark //xls/modules/zstd/memory:axi_stream_add_empty_opt_ir_benchmark //xls/modules/zstd/memory:axi_stream_downscaler_opt_ir_benchmark //xls/modules/zstd/memory:axi_stream_remove_empty_opt_ir_benchmark //xls/modules/zstd/memory:axi_writer_opt_ir_benchmark //xls/modules/zstd/memory:mem_reader_internal_opt_ir_benchmark

This will execute the IR benchmark only for the axi_csr_accessor proc.

bazel run executes only the first target from the list of targets acquired
from the output of bazel query. In order to properly call all targets
it is required to loop through the targets and run one at a time

Signed-off-by: Pawel Czarnecki <pczarnecki@antmicro.com>
Required to pass place_and_route

Signed-off-by: Pawel Czarnecki <pczarnecki@antmicro.com>
rw1nkler and others added 10 commits October 11, 2024 10:09
Co-authored-by: Michal Czyz <mczyz@antmicro.com>
Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
This comit adds implementation of AxiReader proc that can be used to
to issue AXI read requests as an AXI Manager device.

Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
This commits adds AxiStreamRemoveEmpty proc, that can be used
to remove bytes marked as containing no data in the Axi Stream

Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
This commit adds AxiStreamDownscaler that can be used to convert AxiStream
transactions from a wider bus, to multiple transactions on more narrow bus

Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
This commit adds MemReader and MemReaderAdv procs for handling
read transactions on the AXI bus.

Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
Internal-tag: [#62924]

Co-authred-by: Pawel Czarnecki <pczarnecki@antmicro.com>
Co-authred-by: Robert Winkler <rwinkler@antmicro.com>
Signed-off-by: Michal Czyz <mczyz@antmicro.com>
Signed-off-by: Pawel Czarnecki <pczarnecki@antmicro.com>
Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
Internal-tag: [#64376]
Signed-off-by: Pawel Czarnecki <pczarnecki@antmicro.com>
Internal-tag: [#65205]
Signed-off-by: Pawel Czarnecki <pczarnecki@antmicro.com>
Co-authored-by: Pawel Czarnecki <pczarnecki@antmicro.com>
Signed-off-by: Robert Winkler <rwinkler@antmicro.com>
@proppy
Copy link
Member

proppy commented Oct 17, 2024

This will execute the IR benchmark only for the axi_csr_accessor proc.

Looks like this is by design:
https://bazel.build/docs/user-manual#running-executables

The bazel run command is similar to bazel build, except it is used to build and run a single target.

load("@rules_hdl//place_and_route:build_defs.bzl", "place_and_route")
load("@rules_hdl//synthesis:build_defs.bzl", "benchmark_synth", "synthesize_rtl")
load("@rules_hdl//verilog:providers.bzl", "verilog_library")
load("@xls_pip_deps//:requirements.bzl", "requirement")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like an unused import.

Can you run

buildifier --lint=fix xls/modules/zstd/memory/BUILD

on it ? (Sorry, we should have built into the CI)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixing this now on import, but it is always a good idea to run buildifier before.

@copybara-service copybara-service bot merged commit 1130740 into google:main Oct 21, 2024
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants