Skip to content

Commit

Permalink
Change Default Matrix Multiplication Data Type to Integer; Add Docume…
Browse files Browse the repository at this point in the history
…ntation on Numerical Tolerances (#1615)
  • Loading branch information
andrej authored Jul 17, 2024
1 parent 9530ac4 commit ea17ecb
Show file tree
Hide file tree
Showing 9 changed files with 283 additions and 62 deletions.
18 changes: 18 additions & 0 deletions aie_kernels/aie2/mm.cc
Original file line number Diff line number Diff line change
Expand Up @@ -366,6 +366,23 @@ void matmul_vectorized_4x4x4_i16_i16(const int16 *__restrict pA,
pC);
}

template <unsigned m, unsigned k, unsigned n>
void matmul_vectorized_4x4x4_i16_i32(const int16 *__restrict pA,
const int16 *__restrict pB,
int32 *__restrict pC) {
// matmul_vectorized operates on two 4x4 input blocks of A, and two 4x4 input
// blocks of B in each iteration. Make sure we have at least 2 blocks in each
// dimension, and that our input matrix is evenly divisible.
constexpr int r = 4;
constexpr int s = 4;
constexpr int t = 4;
static_assert(m % (2 * r) == 0 && m / (2 * r) > 0);
static_assert(k % (2 * s) == 0 && k / (2 * s) > 0);
static_assert(n % (2 * t) == 0 && n / (2 * t) > 0);
return matmul_vectorized<int16, int32, m / r, k / s, n / t, r, s, t>(pA, pB,
pC);
}

template <unsigned m, unsigned k, unsigned n>
void matmul_vectorized_4x8x4_bf16_bf16(const bfloat16 *__restrict pA,
const bfloat16 *__restrict pB,
Expand Down Expand Up @@ -416,6 +433,7 @@ extern "C" {

#define combos(X) \
X(int16, i16, int16, i16, 4, 4, 4) \
X(int16, i16, int32, i32, 4, 4, 4) \
X(bfloat16, bf16, bfloat16, bf16, 4, 8, 4) \
X(bfloat16, bf16, float, f32, 4, 8, 4)

Expand Down
20 changes: 19 additions & 1 deletion programming_examples/basic/matrix_multiplication/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,4 +16,22 @@ Subdirectories in this directory contain example designs that implement matrix m
* [`single_core`](single_core) - This design performs matrix-matrix multiplication on a single AI Engine core.
* [`whole_array`](whole_array) - This design evolves `single_core`, by splitting the computation and parallelizing it. It utilizes all available AI Engine cores simultaneously.
* [`matrix_vector`](matrix_vector) - This design is a specialization to the matrix-vector-multiplication case, which poses unique challenges due to lower computation density. *Work in progress.*
* [`matrix_vector`](matrix_vector) - This design is a specialization to the matrix-vector-multiplication case, which poses unique challenges due to lower computation density. *Work in progress.*

## Note on Numerical Tolerances

This directory contains verification code that ensures the designs in the subdirectories produce the correct output.

The designs can be configured to work on different input and output data types, based on the Makefile variables `dtype_in` and `dtype_out`.
In the default configuration, all designs consume integer intputs and produce integer outputs.
For this case, the verification checks for strict equivalence between the reference output computed on the host CPU and the output calculated on the AI Engine.
That is, verification will only pass for integer data types if the output is equivalent bit-by-bit.

For floating point data types, the verification code allows the AI Engine output to deviate from the reference calculated on the host CPU by some limited maximal relative and absolute tolerance (defined in `common.h`).
This standard practice is necessary for the following reasons:

- Operations on IEEE 754 floating point values are not commutative. That is, the order of operations can affect the results. All designs in the subdirectories perform tiling of the input matrices, multiplying and accumulating sub-matrices in chunks. The reference calculation code on the CPU, on the other hand, does not perform tiling. As such, some differences due to non-commutativity are expected.
- The reference on the host CPU is always computed in `float32`, even if the input data type is `bfloat16`, since the host CPU does not support native `bfloat16` multiplication. This means results are calculated with higher precision on the CPU and subsequently truncated, whereas the AI Engine is able to calculate results in a more performant manner thanks to natively using the lower precision data type.
- If the output datatype is lower-precision than the accumulation data type, the tiling in the `K` dimension affects the results. For example, when multiplying `bfloat16` numbers, the AI Engine accumulates results in higher-precision `float32`. Our designs perform such accumulation for `k` (tiling size in `K` dimension) times before writing the results back into the output buffer. If the output buffer is lower-precision, results are truncated at that time. A larger `k` dimension means fewer such truncations take place. The AI Engine also provides a higher-precision "cascade" data path, which can be used to accumulate results between cores, although none of the designs in this directory make use of this currently.

In summary, different choices of data types, tiling strategies, and usage of AI Engine components, can all affect floating point results in slight ways. Deciding on different choices for these factors presents interesting trade-offs that must be considered on a case-by-case basis for the application at hand.
81 changes: 69 additions & 12 deletions programming_examples/basic/matrix_multiplication/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -109,11 +109,16 @@ std::vector<uint32_t> load_instr_sequence(std::string instr_path) {
// Matrix / Float / Math
// --------------------------------------------------------------------------

static inline std::int16_t random_int16_t() {
template <typename T>
static inline T get_random();

template <>
std::int16_t get_random<std::int16_t>() {
return (std::int16_t)rand() % 0x10000;
}

static inline std::bfloat16_t random_bfloat16_t() {
template <>
std::bfloat16_t get_random<std::bfloat16_t>() {
// Random numbers should NOT be uniformly between 0 and 1, because that
// would make the matrix product AB always close to 1.
return std::bfloat16_t(4.0 * (float)rand() / (float)(RAND_MAX));
Expand Down Expand Up @@ -165,6 +170,51 @@ bool nearly_equal(float a, float b, float epsilon = 128 * FLT_EPSILON,
return diff < std::max(abs_th, epsilon * norm);
}

template <typename T>
static inline float get_abs_tol();
template <typename T>
static inline float get_rel_tol();

template <>
float get_abs_tol<std::int16_t>() {
return 0.0;
}

template <>
float get_abs_tol<std::int32_t>() {
return 0.0;
}

template <>
float get_abs_tol<std::bfloat16_t>() {
return 0.5;
}

template <>
float get_abs_tol<float>() {
return 0.5;
}

template <>
float get_rel_tol<std::int16_t>() {
return 0.0;
}

template <>
float get_rel_tol<std::int32_t>() {
return 0.0;
}

template <>
float get_rel_tol<std::bfloat16_t>() {
return 0.05;
}

template <>
float get_rel_tol<float>() {
return 0.05;
}

template <typename T>
void print_matrix(const std::vector<T> matrix, int n_cols,
int n_printable_rows = 10, int n_printable_cols = 10,
Expand Down Expand Up @@ -237,10 +287,14 @@ struct error {

template <typename Tout>
std::optional<struct error<Tout>>
verify_single(std::ostream &os, int row, int col, Tout expected, Tout actual) {
const float absTol = 0.5;
const float relTol = 0.05;
if (!nearly_equal(expected, actual, relTol, absTol)) {
verify_single(std::ostream &os, int row, int col, Tout expected, Tout actual,
float abs_tol, float rel_tol) {
bool match = expected == actual;
if (abs_tol > 0 || rel_tol > 0) {
// Allow for some tolerance for float data types
match = nearly_equal(expected, actual, rel_tol, abs_tol);
}
if (!match) {
return (struct error<Tout>){row, col, expected, actual};
}
return std::nullopt;
Expand Down Expand Up @@ -275,7 +329,8 @@ void print_progress_bar(std::ostream &os, double progress, int len = 75) {

template <typename Tin, typename Tout, typename Tacc>
int verify(int M, int N, int K, std::vector<Tin> A, std::vector<Tin> B,
std::vector<Tout> C, int verbosity = 0) {
std::vector<Tout> C, int verbosity = 0, float abs_tol = 0.5,
float rel_tol = 0.05) {
int n_errors = 0;
std::vector<struct error<Tout>> errors;
Tout max_rel_error = (Tout)0.0f;
Expand All @@ -285,8 +340,9 @@ int verify(int M, int N, int K, std::vector<Tin> A, std::vector<Tin> B,

for (int row = 0; row < M; row++) {
for (int col = 0; col < N; col++) {
std::optional<struct error<Tout>> error = verify_single(
std::cout, row, col, CRef[row * N + col], C[row * N + col]);
std::optional<struct error<Tout>> error =
verify_single(std::cout, row, col, CRef[row * N + col],
C[row * N + col], abs_tol, rel_tol);
if (error.has_value()) {
if (n_errors < max_printable_errors) {
errors.push_back(*error);
Expand Down Expand Up @@ -316,7 +372,8 @@ int verify(int M, int N, int K, std::vector<Tin> A, std::vector<Tin> B,
template <typename Tin, typename Tout, typename Tacc>
int verify_stochastic(int M, int N, int K, std::vector<Tin> A,
std::vector<Tin> B, std::vector<Tout> C, int n_samples,
int verbosity = 0) {
int verbosity = 0, float abs_tol = 0.5,
float rel_tol = 0.05) {
std::mt19937 rng;
auto rows = std::views::iota(0, M);
auto cols = std::views::iota(0, N);
Expand All @@ -342,8 +399,8 @@ int verify_stochastic(int M, int N, int K, std::vector<Tin> A,
print_progress_bar(std::cerr, progress);
}
Tout ref = mul_acc<Tin, Tout, Tacc>(M, N, K, row, col, A, B);
std::optional<struct error<Tout>> error =
verify_single(std::cout, row, col, ref, C[row * N + col]);
std::optional<struct error<Tout>> error = verify_single(
std::cout, row, col, ref, C[row * N + col], abs_tol, rel_tol);
if (error.has_value()) {
if (n_errors < max_printable_errors) {
errors.push_back(*error);
Expand Down
31 changes: 29 additions & 2 deletions programming_examples/basic/matrix_multiplication/makefile-common
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,32 @@ include ${current_dir}../../makefile-common
M?=512
K?=512
N?=512
dtype_in?=i16
dtype_out?=i32

ifeq ($(dtype_in),bf16)
dtype_in_cpp=std::bfloat16_t
endif
ifeq ($(dtype_out),bf16)
dtype_out_cpp=std::bfloat16_t
dtype_acc_cpp=float
endif

ifeq ($(dtype_in),i16)
dtype_in_cpp=int16_t
endif
ifeq ($(dtype_out),i16)
dtype_out_cpp=int16_t
dtype_acc_cpp=int16_t
endif
ifeq ($(dtype_out),i32)
dtype_out_cpp=int32_t
dtype_acc_cpp=int32_t
endif
ifeq ($(dtype_out),f32)
dtype_out_cpp=float
dtype_acc_cpp=float
endif

trace_size?=65536

Expand All @@ -46,7 +72,7 @@ xclbin_target?=build/final_${target_suffix}.xclbin
insts_target?=build/insts_${target_suffix}.txt

runargs?=-v 2 --warmup 1 --iters 1
aieargs+=-M $M -K $K -N $N
aieargs+=-M $M -K $K -N $N --dtype_in ${dtype_in} --dtype_out ${dtype_out}

kernels_dir=${srcdir}/../../../../aie_kernels/aie2

Expand All @@ -69,7 +95,8 @@ ${xclbin_target}: ${mlir_target} ${kernels:%=build/%.o}
${targetname}.exe: ${srcdir}/test.cpp ${srcdir}/../test.cpp ${srcdir}/../common.h
rm -rf _build
mkdir -p _build
cd _build && ${powershell} cmake -E env CXXFLAGS="-std=c++23 -ggdb" cmake ${srcdir}/.. -D CMAKE_C_COMPILER=gcc-13 -D CMAKE_CXX_COMPILER=g++-13 -DTARGET_NAME=${targetname} -Dsubdir=${subdir}
cd _build && ${powershell} cmake -E env CXXFLAGS="-std=c++23 -ggdb -DDTYPE_IN=${dtype_in_cpp} -DDTYPE_OUT=${dtype_out_cpp} -DDTYPE_ACC=${dtype_acc_cpp}" \
cmake ${srcdir}/.. -D CMAKE_C_COMPILER=gcc-13 -D CMAKE_CXX_COMPILER=g++-13 -DTARGET_NAME=${targetname} -Dsubdir=${subdir}
cd _build && ${powershell} cmake --build . --config Release
ifeq "${powershell}" "powershell.exe"
cp _build/${targetname}.exe $@
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ K?=256
N?=256
m?=64
k?=64
n?=64
n?=32

kernels=mm_${m}x${k}x${n}
aieargs+=-m $m -k $k -n $n
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -25,20 +25,37 @@ def main():
argparser.add_argument("-N", type=int, default=256)
argparser.add_argument("-m", type=int, default=64)
argparser.add_argument("-k", type=int, default=64)
argparser.add_argument("-n", type=int, default=64)
argparser.add_argument("-n", type=int, default=32)
argparser.add_argument(
"--dtype_in", type=str, choices=["bf16", "i16"], default="i16"
)
argparser.add_argument(
"--dtype_out", type=str, choices=["bf16", "i16", "f32", "i32"], default="i32"
)
args = argparser.parse_args()
my_matmul(args.M, args.K, args.N, args.m, args.k, args.n)
my_matmul(
args.M, args.K, args.N, args.m, args.k, args.n, args.dtype_in, args.dtype_out
)


def ceildiv(a, b):
return (a + b - 1) // b


def my_matmul(M, K, N, m, k, n):
def my_matmul(M, K, N, m, k, n, dtype_in_str, dtype_out_str):

assert M % m == 0
assert K % k == 0
assert N % n == 0

r = 4
s = 8
t = 4
if dtype_in_str == "bf16":
r = 4
s = 8
t = 4
elif dtype_in_str == "i16":
r = 4
s = 4
t = 4

assert m % r == 0
assert k % s == 0
Expand All @@ -48,10 +65,24 @@ def my_matmul(M, K, N, m, k, n):
enable_tracing = False
trace_size = 65536

dtype_in = None
if dtype_in_str == "bf16":
dtype_in = T.bf16
elif dtype_in_str == "i16":
dtype_in = T.i16
dtype_out = None
if dtype_out_str == "bf16":
dtype_out = T.bf16
elif dtype_out_str == "i16":
dtype_out = T.i16
elif dtype_out_str == "f32":
dtype_out = T.f32
elif dtype_out_str == "i32":
dtype_out = T.i32

A_sz = M * K
B_sz = K * N
C_sz = M * N
C_sz_in_bytes = C_sz * 2

M_div_m = M // m
K_div_k = K // k
Expand All @@ -66,25 +97,30 @@ def my_matmul(M, K, N, m, k, n):

with mlir_mod_ctx() as ctx:

C_sz_in_bytes = C_sz * dtype_out().width // 8

@device(AIEDevice.npu1_1col)
def device_body():
memref_a_ty = T.memref(m, k, T.bf16())
memref_b_ty = T.memref(k, n, T.bf16())
memref_c_ty = T.memref(m, n, T.bf16())
memref_a_ty = T.memref(m, k, dtype_in())
memref_b_ty = T.memref(k, n, dtype_in())
memref_c_ty = T.memref(m, n, dtype_out())

ofifo_memref_a_ty = TypeAttr.get(ObjectFifoType.get(memref_a_ty))
ofifo_memref_b_ty = TypeAttr.get(ObjectFifoType.get(memref_b_ty))
ofifo_memref_c_ty = TypeAttr.get(ObjectFifoType.get(memref_c_ty))

# AIE Core Function declarations
zero_scalar = external_func("zero_scalar_bf16", inputs=[memref_c_ty])
zero = external_func("zero_bf16", inputs=[memref_c_ty])
zero_scalar = external_func(
f"zero_scalar_{dtype_out_str}", inputs=[memref_c_ty]
)
zero = external_func(f"zero_{dtype_out_str}", inputs=[memref_c_ty])
matmul_scalar = external_func(
"matmul_scalar_bf16_bf16",
f"matmul_scalar_{dtype_in_str}_{dtype_out_str}",
inputs=[memref_a_ty, memref_b_ty, memref_c_ty],
)
matmul = external_func(
"matmul_bf16_bf16", inputs=[memref_a_ty, memref_b_ty, memref_c_ty]
f"matmul_{dtype_in_str}_{dtype_out_str}",
inputs=[memref_a_ty, memref_b_ty, memref_c_ty],
)

# Tile declarations
Expand Down Expand Up @@ -196,9 +232,9 @@ def core_body():
# To/from AIE-array data movement

@FuncOp.from_py_func(
T.memref(A_sz, T.bf16()),
T.memref(B_sz, T.bf16()),
T.memref(C_sz, T.bf16()),
T.memref(A_sz, dtype_in()),
T.memref(B_sz, dtype_in()),
T.memref(C_sz, dtype_out()),
)
def sequence(A, B, C):

Expand All @@ -213,9 +249,7 @@ def sequence(A, B, C):

# only do 5 tile rows at a time before synchronizing, so we can reuse BDs
rows_per_block = 5
for tile_row_block in range(
(M_div_m + rows_per_block - 1) // rows_per_block
):
for tile_row_block in range(ceildiv(M_div_m, rows_per_block)):
C_row_offset = tile_row_block * rows_per_block * m * N
num_tile_rows = min(
[rows_per_block, M_div_m - tile_row_block * rows_per_block]
Expand Down
Loading

0 comments on commit ea17ecb

Please sign in to comment.