Proposal: Good Numbers for Calyx #1453

sampsyo · 2023-05-09T01:15:32Z

sampsyo
May 9, 2023
Maintainer

@bcarlet and @owolabileg and I have been brainstorming for some time about a project that would bring proper support for numerical computation to the Calyx ecosystem. The idea builds on some ideas from #686 but goes a bit farther. I think it's a potentially very exciting direction, so I'm writing this down now as a semi-formal proposal for discussion and iteration.

Motivation

We want to give people the tools they need to design good accelerators, and realistic accelerators almost always have to do some kind of numerical computation. That numerical stuff also plays a big role in the efficiency and effectiveness of the hardware: picking the right numerical representation can often make or break the viability of an accelerator project.

Calyx has never gotten in the way of doing whatever you want with numerical representations in your accelerator, but it has also never been all that helpful. Because numerics are so central to the viability of accelerators, and because we purport to help people design good accelerators, I think we should be helpful.

Here are some ways in which Calyx is currently unhelpful when it comes to numerics:

The only numerical representations that the standard library supports is integers and fixed-point numbers (which generalize integers). We don't really do anything about floating point.
We support basic arithmetic operators (+, -, ×, and ÷) well, but using interesting math functions is painful and case-by-case. For example, our standard library has implementations of sqrt and pow, and we have—at great expense—crafted generator-based implementations of exp and ln. This is not a sustainable way to support general math computations. See Proposal: libm generator for Calyx #686 for a more complete argument.
It is very, very annoying to switch between different numerical representations within Calyx. This is an important thing to do because there is no good way to know up front what representation and precision is right for you, your algorithm, and your data. There is no substitute for trying out a given representation and seeing how it goes. Trying out multiple precisions to get to a good cost/efficiency trade-of is hard because:
- You have to change out all your primitives, and you have to do it carefully based on the names. Good luck remembering the exact name of std_fp_sdiv_pipe when it comes time to replace your std_div_pipe when you want to add a fractional part and a sign bit. You need to change all of these in concert, and nothing (not even a type system) checks that you have consistently changed everything together.
- You also have to simultaneously change the sizes of all your registers, memories, ports, and wires.

The goal of this project is to solve all of these problems. That includes:

Providing at least the scaffolding to support arbitrary numerical formats, including floating point.
Providing a robust way to generate math functions for those formats, à la Proposal: libm generator for Calyx #686, that are optimized for hardware.
Somehow abstracting over numerical representations, to at the very least make it possible to manually experiment with different kinds or to simply "sweep" the accuracy/efficiency Pareto frontier. (This also enables future work on automatically tuning programs, but that's far-future stuff.)

Comparison to SOTA

In the end, we aim to "beat" the approaches to dealing with numerical in both RTL and classic HLS:

In standard RTL, you have to do all this stuff manually or with ad hoc metaprogramming. Automation is good; I will not elaborate further.
Standard HLS tools take a very library-based approach to all this, and I think we can do better by working within the compiler toolchain and by exploiting compiler analysis. For instance:
- Vivado HLS provides arbitrary-precision fixed- and floating-point data types together with a libm substitute for those functions. You can get some of the desired "convenient switching" by doing a simple typedef, like typedef ap_fixed<11, 6> mynum and then using mynum everywhere for all your numbers. However, because its libm substitute consists of a pre-generated library implementations, it is limited: its fixed-point library, for example, only covers a subset the functions, and it only goes up to 32 bits; and its floating-point library only supports standard sizes. Also, every math function implementation is "one-size-fits-all"; there is no way to get different efficiency/accuracy trade-offs for a given function depending on application demands. Finally, all math operations (primitive arithmetic and math library calls) are implemented individually; in contrast, we have an opportunity to analyze the program to understand their use in context.
- Catapult HLS works similarly, via their open-source "AC" libraries. Support for math functions is similarly spotty, especially for ac_float, for similar reasons: the library-based (as opposed to compiler-based) approach.

Uniqueness of the ADL Setting

An important question to ask about this motivation is: "Don't all these problems exist in high-performance software too? And therefore, wouldn't you expect that they'd already be solved in that context?"

There are a few reasons that the accelerator design language (ADL) setting differs from the software setting, and hence reasons why the solutions don't already exist:

In software, it is important to use standard numerical formats, because that's what has hardware support. In ADLs, it matters not at all. Using a 32-bit number is not any more than a single bit better or worse than using a 33- or 31-bit number.
Fixed point is king. Because a fixed-point unit is generally so much smaller than a floating-point one, the cost function is biased heavily in favor of that. The difference is not so great in software.
There is a hardware-specific concern about the need to share expensive hardware resources. Like, if you generate a big 47-bit multiplier, you want to reuse that as many times as you can, rather than generating a separate 45-bit multiplier "for efficiency."

Project Synopsis

The proposed project is to build a "good" compiler from FPCore to Calyx that supports convenient switching between numerical formats. FPCore is the language for the FPBench suite of numerical programs, so that's our benchmark suite (or at least a starting point for it). Our goal is to beat commercial HLS tools, starting with Vivado HLS, on the accuracy/efficiency Pareto frontier: we will generate more efficient (smaller and faster) hardware at every target precision level.

The project does not include any fancy way to automatically tune precision. To the extent that we do try out multiple precision levels, we use bog-standard autotuning ("grid search" the precision parameters and evaluate the resulting compiled program empirically).

MVP Plan

This section proposes a work plan for an "MVP" version of this project, i.e., the minimum I think we need to build to write a good paper or whatever. There are many possible exciting extensions, and I've tried to separate all of them from this MVP version (they are in the next section). I expect that we might end up pursuing some of the extensions during the first phase anyway, but for clarity, I think it's important to focus on the simplest possible version, especially given that even that is pretty engineering-heavy.

1. FPCore to Calyx Compiler

The first phase does the basic work to generate hardware from FPCore code via Calyx. We only use the currently-built-in fixed-point numbers and integers. We do not support (libm) math functions.

The compiler takes, as an argument, the single, global fixed-point precision to use for everything in the entire program. FPCore supports explicit precision annotations like (! :precision binary80 (some subcomputation)). I think we should just reject these programs for now and only consider the ones that are "pure math" expressions. This gives us the freedom to choose the precision ourselves (easily, with a single global knob).

We do differential testing against an existing implementation of FPCore, such as the interpreter.

To run the programs, we need a way to generate (Calyx-compatible) data input files. We will use fud's current data format conversion infrastructure to convert floating-point inputs to fixed-point data for the specific hardware instance.

2. Generate Math Functions

This phase uses the simplest approach we can possibly think of to automatically generate Calyx implementations of arbitrary (libm-style) math functions for fixed-point formats. There are two options for this, and we should do whatever is easier:

Roll our own extremely simple strawperson approach: just lookup tables, for instance, or small-order polynomials with no range reduction.
Borrow an existing open-source implementation for software and generate Calyx code instead of C (e.g., the rlibm generator, megalibm, Sollya, or lolremez).

We extend the FPCore compiler to use these generated functions.

3. Evaluation

First, we make sure that we cover a reasonable subset of FPBench. We consider adding more stuff that is particularly hardware-oriented (and contribute this new stuff back upstream to the FPBench project!).

Then, we implement our competition: a hardware generator via Vivado HLS. Consider using an existing FPCore compiler, such as core2c, and hack it up with some stuff to generate its ap_fixed datatypes and such.

We sweep a range of fixed-point precisions and compare the generated hardware on:

Latency (cycles × clock period).
Area (LUTs, DSPs).
Precision (as hopefully measured by FPBench tools?).

We aim to Pareto-dominate Vivado HLS. That is, for every program we can generate with Vivado HLS, we want there to be a Calyx-generated program that is as good or better in all 3 ways (at least as fast, at least as small, and at least as precise).

Extensions

The above is a truly minimal version of this project. After we have the end-to-end thing working, it may or may not actually be better than the baseline. We can consider tacking on any of these extensions to make it more awesome:

Generate custom approximate math functions for specific subcomputations. That is, if for some reason the application contains an expression cos(exp(x)), we needn't just have two different polynomial-approximation units for the cos funciton and the exp function. We could build a special-purpose approximation for the composition of these two functions. (I think this could be a fish-in-a-barrel way to beat Vivado HLS, which obviously cannot do this.)
Add floating point (IEEE, custom-sized IEEE-like, bfloat16, maybe even posits).
Get fancy about the math function generators. It's possible to get arbitrarily fancy here, which is why it's especially critical to start as simple as possible in the MVP (so we can empirically measure the accuracy/efficiency impact of fanciness).
Somehow support mixed precision, i.e., use more than one numerical format within the same program. There is a lot of work out there on multi-precision tuning and stuff in a normal software context, so we should build on that.
Address other backends beyond just Vivado HLS and Xilinx FPGAs: e.g., Catapult HLS on an ASIC process; Intel FPGAs via their HLS tools; possibly other FPGA vendors.
Relatedly, do model-specific FPGA-targeted optimization: show that differences in (e.g.) DSP shape and number mean that the optimal math implementation differs for different FPGA models.
Add nontrivial autotuning stuff, possibly by combining static analysis with dynamic profiling.

Beyond just making this "vertical slice" of the project better, here are two ideas for "horizontal" extensions that would contribute to the broader Calyx ecosystem:

So far, we have only been talking about selecting numerical representations within FPCore programs. All this is happening in the FPCore-to-Calyx translation. We should bring this same kind of parameterization to all Calyx programs. This is hard because it will be a nontrivial extension to the Calyx language: not only do we need a way to instantiate "abstract" cells as in Mapping logical memories to physical memories #1151, we also need a way to abstract over the widths of ports, wires, registers, and memories accordingly. This is proposal is therefore kind of fuzzy, but the idea would be to somehow bring what we do for FPCore to exist within Calyx IR.
This project may run up against the limits of fud's json_to_dat data conversion tools. These have always been sort of janky, TBH, as partially recapped in Optimize fixed-point conversion #1315. A really really very useful but not very researchy subproject of this project could be to build an actually good data converter: something that handles a wide variety of numerical formats and a wide variety of file formats (human-readable text, binary, hex, JSON, …) and is hopefully written in a fast language (i.e., Rust, not Python).

Related Work

There is a huge amount of related work in the broader topic of efficient numerical computation. But specifically on the topic of doing this for hardware, these seem like perhaps the most relevant things:

FloPoCo is an open-source arbitrary-precision math operator generator for RTL.
VFloat (paper, paper 2) is an open-source parameterized hardware implementation of arbitrary-precision floating-point math for both basic arithmetic and a handful of libm-style math functions.
SOAP3 (paper) works at the level of HLS C and rewrites programs to make them more amenable to commercial HLS tools. It may be an important point of comparison (i.e., using SOAP with Vivado HLS may do better than feeding naive C into Vivado HLS).

cgyurgyik · 2023-05-20T16:41:23Z

cgyurgyik
May 20, 2023
Collaborator

Very cool work!

Is part of this effort to introduce a type system then? I'm imagining a case where I want to use a fp8 format for a portion of my calculation, to determine how accuracy (runtime) and LUTs (compile time) are affected. It would be easy to miss introducing cast.fp32.fp8. Or, even more slick, perhaps casting from float16 to bfloat16 to see how different floating point formats may perform.

This gives us the freedom to choose the precision ourselves (easily, with a single global knob).

There's a lot of useful hardware that will use a mixture of precisions, and even quantization. Is that within scope?

1 reply

sampsyo May 20, 2023
Maintainer Author

Yeah, I imagine we will end up doing something fairly type-system-like to track the representation! I also agree that getting to multi-precision stuff will be pretty important. It's definitely in scope at some point, but I'm hopeful that a uniform-precision approach will work for an MVP and set us up to do that later…

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Calyx Infrastructure

Proposal: Good Numbers for Calyx #1453

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

The Calyx Infrastructure

Proposal: Good Numbers for Calyx #1453

sampsyo May 9, 2023 Maintainer

Motivation

Comparison to SOTA

Uniqueness of the ADL Setting

Project Synopsis

MVP Plan

1. FPCore to Calyx Compiler

2. Generate Math Functions

3. Evaluation

Extensions

Related Work

Replies: 1 comment · 1 reply

cgyurgyik May 20, 2023 Collaborator

sampsyo May 20, 2023 Maintainer Author

sampsyo
May 9, 2023
Maintainer

Replies: 1 comment 1 reply

cgyurgyik
May 20, 2023
Collaborator

sampsyo May 20, 2023
Maintainer Author