Replies: 1 comment 1 reply
-
Very cool work! Is part of this effort to introduce a type system then? I'm imagining a case where I want to use a
There's a lot of useful hardware that will use a mixture of precisions, and even quantization. Is that within scope? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
@bcarlet and @owolabileg and I have been brainstorming for some time about a project that would bring proper support for numerical computation to the Calyx ecosystem. The idea builds on some ideas from #686 but goes a bit farther. I think it's a potentially very exciting direction, so I'm writing this down now as a semi-formal proposal for discussion and iteration.
Motivation
We want to give people the tools they need to design good accelerators, and realistic accelerators almost always have to do some kind of numerical computation. That numerical stuff also plays a big role in the efficiency and effectiveness of the hardware: picking the right numerical representation can often make or break the viability of an accelerator project.
Calyx has never gotten in the way of doing whatever you want with numerical representations in your accelerator, but it has also never been all that helpful. Because numerics are so central to the viability of accelerators, and because we purport to help people design good accelerators, I think we should be helpful.
Here are some ways in which Calyx is currently unhelpful when it comes to numerics:
sqrt
andpow
, and we have—at great expense—crafted generator-based implementations ofexp
andln
. This is not a sustainable way to support general math computations. See Proposal: libm generator for Calyx #686 for a more complete argument.std_fp_sdiv_pipe
when it comes time to replace yourstd_div_pipe
when you want to add a fractional part and a sign bit. You need to change all of these in concert, and nothing (not even a type system) checks that you have consistently changed everything together.The goal of this project is to solve all of these problems. That includes:
Comparison to SOTA
In the end, we aim to "beat" the approaches to dealing with numerical in both RTL and classic HLS:
libm
substitute for those functions. You can get some of the desired "convenient switching" by doing a simpletypedef
, liketypedef ap_fixed<11, 6> mynum
and then usingmynum
everywhere for all your numbers. However, because itslibm
substitute consists of a pre-generated library implementations, it is limited: its fixed-point library, for example, only covers a subset the functions, and it only goes up to 32 bits; and its floating-point library only supports standard sizes. Also, every math function implementation is "one-size-fits-all"; there is no way to get different efficiency/accuracy trade-offs for a given function depending on application demands. Finally, all math operations (primitive arithmetic and math library calls) are implemented individually; in contrast, we have an opportunity to analyze the program to understand their use in context.ac_float
, for similar reasons: the library-based (as opposed to compiler-based) approach.Uniqueness of the ADL Setting
An important question to ask about this motivation is: "Don't all these problems exist in high-performance software too? And therefore, wouldn't you expect that they'd already be solved in that context?"
There are a few reasons that the accelerator design language (ADL) setting differs from the software setting, and hence reasons why the solutions don't already exist:
Project Synopsis
The proposed project is to build a "good" compiler from FPCore to Calyx that supports convenient switching between numerical formats. FPCore is the language for the FPBench suite of numerical programs, so that's our benchmark suite (or at least a starting point for it). Our goal is to beat commercial HLS tools, starting with Vivado HLS, on the accuracy/efficiency Pareto frontier: we will generate more efficient (smaller and faster) hardware at every target precision level.
The project does not include any fancy way to automatically tune precision. To the extent that we do try out multiple precision levels, we use bog-standard autotuning ("grid search" the precision parameters and evaluate the resulting compiled program empirically).
MVP Plan
This section proposes a work plan for an "MVP" version of this project, i.e., the minimum I think we need to build to write a good paper or whatever. There are many possible exciting extensions, and I've tried to separate all of them from this MVP version (they are in the next section). I expect that we might end up pursuing some of the extensions during the first phase anyway, but for clarity, I think it's important to focus on the simplest possible version, especially given that even that is pretty engineering-heavy.
1. FPCore to Calyx Compiler
The first phase does the basic work to generate hardware from FPCore code via Calyx. We only use the currently-built-in fixed-point numbers and integers. We do not support (libm) math functions.
The compiler takes, as an argument, the single, global fixed-point precision to use for everything in the entire program. FPCore supports explicit precision annotations like
(! :precision binary80 (some subcomputation))
. I think we should just reject these programs for now and only consider the ones that are "pure math" expressions. This gives us the freedom to choose the precision ourselves (easily, with a single global knob).We do differential testing against an existing implementation of FPCore, such as the interpreter.
To run the programs, we need a way to generate (Calyx-compatible) data input files. We will use fud's current data format conversion infrastructure to convert floating-point inputs to fixed-point data for the specific hardware instance.
2. Generate Math Functions
This phase uses the simplest approach we can possibly think of to automatically generate Calyx implementations of arbitrary (libm-style) math functions for fixed-point formats. There are two options for this, and we should do whatever is easier:
We extend the FPCore compiler to use these generated functions.
3. Evaluation
First, we make sure that we cover a reasonable subset of FPBench. We consider adding more stuff that is particularly hardware-oriented (and contribute this new stuff back upstream to the FPBench project!).
Then, we implement our competition: a hardware generator via Vivado HLS. Consider using an existing FPCore compiler, such as core2c, and hack it up with some stuff to generate its
ap_fixed
datatypes and such.We sweep a range of fixed-point precisions and compare the generated hardware on:
We aim to Pareto-dominate Vivado HLS. That is, for every program we can generate with Vivado HLS, we want there to be a Calyx-generated program that is as good or better in all 3 ways (at least as fast, at least as small, and at least as precise).
Extensions
The above is a truly minimal version of this project. After we have the end-to-end thing working, it may or may not actually be better than the baseline. We can consider tacking on any of these extensions to make it more awesome:
cos(exp(x))
, we needn't just have two different polynomial-approximation units for thecos
funciton and theexp
function. We could build a special-purpose approximation for the composition of these two functions. (I think this could be a fish-in-a-barrel way to beat Vivado HLS, which obviously cannot do this.)Beyond just making this "vertical slice" of the project better, here are two ideas for "horizontal" extensions that would contribute to the broader Calyx ecosystem:
json_to_dat
data conversion tools. These have always been sort of janky, TBH, as partially recapped in Optimize fixed-point conversion #1315. A really really very useful but not very researchy subproject of this project could be to build an actually good data converter: something that handles a wide variety of numerical formats and a wide variety of file formats (human-readable text, binary, hex, JSON, …) and is hopefully written in a fast language (i.e., Rust, not Python).Related Work
There is a huge amount of related work in the broader topic of efficient numerical computation. But specifically on the topic of doing this for hardware, these seem like perhaps the most relevant things:
Beta Was this translation helpful? Give feedback.
All reactions