-
Notifications
You must be signed in to change notification settings - Fork 788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CpuGemmMatrixMultiplyKernel does not support datatype DataType::BFLOAT16 #1158
Comments
Hi @exaithrg Quoting the documentation about BF16 acceleration.
I think the problem here is that you have to pass the Something like below
Hope this helps |
@morgolock Thank you for your reply. I tried again and completed the following steps:
// Configure function
sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta); with // Configure function
GEMMInfo gemm_info;
gemm_info.set_fast_math(true);
sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
// sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta); Then I recompiled. This time I added the extra_cxx_flags="-fPIC" option to the compilation command at the beginning of the issue, to more closely follow the example in the BF16 acceleration documentation:
According to the BF16 acceleration reference, I set However, when running the compiled program (./build/examples/neon_sgemm_bf16), I still encountered the same problem, i.e., the program crashes at
The current code of neon_sgemm_bf16.cpp is as follows // neon_sgemm_bf16.cpp
/*
* Copyright (c) 2018-2019 Arm Limited.
*
* SPDX-License-Identifier: MIT
*
* Permission is hereby granted, free of charge, to any person obtaining a copy
* of this software and associated documentation files (the "Software"), to
* deal in the Software without restriction, including without limitation the
* rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
* sell copies of the Software, and to permit persons to whom the Software is
* furnished to do so, subject to the following conditions:
*
* The above copyright notice and this permission notice shall be included in all
* copies or substantial portions of the Software.
*
* THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
* IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
* FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
* AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
* LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
* OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
* SOFTWARE.
*/
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"
#include "utils/Utils.h"
#include <cstdlib>
using namespace arm_compute;
using namespace utils;
class NESGEMMExample : public Example
{
public:
bool do_setup(int argc, char **argv) override
{
NPYLoader npy0;
NPYLoader npy1;
NPYLoader npy2;
alpha = 1.0f;
beta = 0.0f;
std::ifstream stream;
if (argc > 1)
{
stream.open(argv[1], std::fstream::in);
}
if (argc < 3 || (argc < 4 && stream.bad()))
{
// Print help
std::cout << "Usage: 1) ./build/neon_sgemm input_matrix_1.npy input_matrix_2.npy [input_matrix_3.npy] "
"[alpha = 1] [beta = 0]\n";
std::cout << " 2) ./build/neon_sgemm M N K [alpha = 1.0f] [beta = 0.0f]\n\n";
std::cout << "Too few or no input_matrices provided. Using M=7, N=3, K=5, alpha=1.0f and beta=0.0f\n\n";
src0.allocator()->init(TensorInfo(TensorShape(5U, 7U), 1, DataType::BFLOAT16));
src1.allocator()->init(TensorInfo(TensorShape(3U, 5U), 1, DataType::BFLOAT16));
src2.allocator()->init(TensorInfo(TensorShape(3U, 7U), 1, DataType::BFLOAT16));
}
else
{
if (stream.good()) /* case file1.npy file2.npy [file3.npy] [alpha = 1.0f] [beta = 0.0f] */
{
npy0.open(argv[1]);
npy0.init_tensor(src0, DataType::BFLOAT16);
npy1.open(argv[2]);
npy1.init_tensor(src1, DataType::BFLOAT16);
if (argc > 3)
{
stream.close();
stream.clear();
stream.open(argv[3], std::fstream::in);
if (stream.good()) /* case with third file */
{
npy2.open(argv[3]);
npy2.init_tensor(src2, DataType::BFLOAT16);
if (argc > 4)
{
// Convert string to float
alpha = strtof(argv[4], nullptr);
if (argc > 5)
{
// Convert string to float
beta = strtof(argv[5], nullptr);
}
}
}
else /* case without third file */
{
alpha = strtof(argv[3], nullptr);
if (argc > 4)
{
beta = strtof(argv[4], nullptr);
}
}
}
}
else /* case M N K [alpha = 1.0f] [beta = 0.0f] */
{
size_t M = strtol(argv[1], nullptr, 10);
size_t N = strtol(argv[2], nullptr, 10);
size_t K = strtol(argv[3], nullptr, 10);
src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::BFLOAT16));
src1.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::BFLOAT16));
src2.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::BFLOAT16));
if (argc > 4)
{
alpha = strtof(argv[4], nullptr);
if (argc > 5)
{
beta = strtof(argv[5], nullptr);
}
}
}
}
init_sgemm_output(dst, src0, src1, DataType::BFLOAT16);
// Configure function
GEMMInfo gemm_info;
gemm_info.set_fast_math(true);
sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
// sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);
// Allocate all the images
src0.allocator()->allocate();
src1.allocator()->allocate();
dst.allocator()->allocate();
// Fill the input images with either the data provided or random data
if (npy0.is_open())
{
npy0.fill_tensor(src0);
npy1.fill_tensor(src1);
output_filename = "sgemm_out.npy";
is_fortran = npy0.is_fortran();
if (npy2.is_open())
{
src2.allocator()->allocate();
npy2.fill_tensor(src2);
}
}
else
{
src2.allocator()->allocate();
fill_random_tensor(src0, -1.f, 1.f);
fill_random_tensor(src1, -1.f, 1.f);
fill_random_tensor(src2, -1.f, 1.f);
}
// Dummy run for CLTuner
sgemm.run();
return true;
}
void do_run() override
{
// Execute the function
sgemm.run();
}
void do_teardown() override
{
if (!output_filename.empty()) /* Save to .npy file */
{
save_to_npy(dst, output_filename, is_fortran);
}
}
private:
Tensor src0{}, src1{}, src2{}, dst{};
NEGEMM sgemm{};
float alpha{}, beta{};
bool is_fortran{};
std::string output_filename{};
};
/** Main program for sgemm test
*
* @param[in] argc Number of arguments
* @param[in] argv Arguments ( [optional] Matrix A, [optional] Matrix B, [optional] Matrix C, [optional] alpha, [optional] beta )
*/
int main(int argc, char **argv)
{
return utils::run_example<NESGEMMExample>(argc, argv);
} I'm pretty sure that I passed a gemm_info with fast_math correctly set to sgemm.configure. The GDB output is: Reading symbols from ./build/examples/neon_sgemm_bf16...
(gdb) b examples/neon_sgemm_bf16.cpp:131
Breakpoint 1 at 0x41baa8: file examples/neon_sgemm_bf16.cpp, line 131.
(gdb) r
Starting program: /mnt/arm64max_share_folder/work/llm_acl/ComputeLibrary/build/examples/neon_sgemm_bf16
warning: Unable to determine the number of hardware watchpoints available.
warning: Unable to determine the number of hardware breakpoints available.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
/mnt/arm64max_share_folder/work/llm_acl/ComputeLibrary/build/examples/neon_sgemm_bf16
Usage: 1) ./build/neon_sgemm input_matrix_1.npy input_matrix_2.npy [input_matrix_3.npy] [alpha = 1] [beta = 0]
2) ./build/neon_sgemm M N K [alpha = 1.0f] [beta = 0.0f]
Too few or no input_matrices provided. Using M=7, N=3, K=5, alpha=1.0f and beta=0.0f
Breakpoint 1, NESGEMMExample::do_setup (this=0x49c390, argc=1,
argv=0xfffffffff338) at examples/neon_sgemm_bf16.cpp:131
131 GEMMInfo gemm_info;
(gdb) l
126 }
127
128 init_sgemm_output(dst, src0, src1, DataType::BFLOAT16);
129
130 // Configure function
131 GEMMInfo gemm_info;
132 gemm_info.set_fast_math(true);
133 sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
134 // sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);
135
(gdb) n
132 gemm_info.set_fast_math(true);
(gdb) n
133 sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
(gdb) p gemm_info
$1 = {_is_a_reshaped = false, _is_b_reshaped = false,
_reshape_b_only_on_first_run = true, _depth_output_gemm3d = 0,
_reinterpret_input_as_3d = false, _retain_internal_weights = false,
_gemmlowp_output_stage = {type = arm_compute::GEMMLowpOutputStageType::NONE,
gemmlowp_offset = 0, gemmlowp_multiplier = 0, gemmlowp_shift = 0,
gemmlowp_min_bound = -2147483648, gemmlowp_max_bound = 2147483647,
gemmlowp_multipliers = std::vector of length 0, capacity 0,
gemmlowp_shifts = std::vector of length 0, capacity 0,
gemmlowp_real_multiplier = 0, is_quantized_per_channel = false,
output_data_type = arm_compute::DataType::UNKNOWN}, _fast_math = true,
_fp_mixed_precision = false, _broadcast_bias = false,
_pretranspose_A = false, _pretranspose_B = false, _activation_info = {
_act = arm_compute::ActivationFunction::IDENTITY, _a = 0, _b = 0,
_enabled = false, _lut = {_M_elems = '\000' <repeats 255 times>},
_lut_fp16 = std::shared_ptr<std::array<__fp16, 65536>> (empty) = {
get() = 0x0}}, _fixed_format = false,
_weight_format = arm_compute::WeightFormat::UNSPECIFIED, _accumulate = false}
(gdb) n
[New Thread 0xfffff603f0c0 (LWP 574)]
[New Thread 0xfffff582f0c0 (LWP 575)]
[New Thread 0xfffff501f0c0 (LWP 576)]
!!!!!!!!!!!!!!!!!!!!!!!!!!!
ERROR in validate_arguments src/cpu/kernels/CpuGemmMatrixMultiplyKernel.cpp:64: ITensor data type not supported by this kernel
!!!!!!!!!!!!!!!!!!!!!!!!!!!
Test FAILED
[Thread 0xfffff603f0c0 (LWP 574) exited]
[Thread 0xfffff501f0c0 (LWP 576) exited]
[Thread 0xfffff582f0c0 (LWP 575) exited]
[Inferior 1 (process 572) exited with code 0377]
(gdb) It appears that setting Additionally, I also tried using // Configure function
GEMMInfo gemm_info;
gemm_info.set_fast_math(true);
sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
// sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta); while keeping the DataType as F32, and no issues occurred. I look forward to your response, thanks |
Hi @exaithrg Could you please run I want to make sure your device supports bf16
|
Hi @exaithrg In order to use the bf16/bf16/bf16 kernel you have to build the library with
There is no support for bf16 input if you don't use fixed format kernels. Hope this helps |
@morgolock Thank you for your timely reply. I recompiled the ACL library using In addition, attached are the results from running geng@arm64max:~/work/llm_acl/ComputeLibrary$ lscpu
Architecture: aarch64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: 0x00
Model name: -
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: 0x0
BogoMIPS: 2000.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fph
p asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 s
m3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc fl
agm ssbs sb paca pacg dcpodp sve2 sveaes svepmull sveb
itperm svesha3 svesm4 flagm2 frint svei8mm svef32mm sv
ef64mm svebf16 i8mm bf16 dgh rng bti ecv wfxt
...
geng@arm64max:~/work/llm_acl/ComputeLibrary/build/tests$ ./arm_compute_validation --filter-id=200
[CORE][21-02-2025 10:29:55][INFO] "Set CPPScheduler to Linear mode, with 1 threads to use\n"
Version = arm_compute_version=v25.02 Build options: {'Werror': '1', 'logging': '1', 'debug': '1', 'asserts': '1', 'neon': '1', 'arch': 'armv8.6-a-sve2', 'extra_cxx_flags': '-fPIC', 'os': 'linux', 'build': 'cross_compile', 'examples': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '1', 'fixed_format_kernels': '0', 'standalone': '0', 'benchmark_examples': '0', 'validate_examples': '0', 'reference_openmp': '1', 'validation_tests': '1', 'benchmark_tests': '0', 'pmu': '0', 'build_dir': '.', 'toolchain_prefix': 'aarch64-none-linux-gnu-'} Git hash=v25.02-modi
CommandLine = ./arm_compute_validation --filter-id=200
Seed = 3461256592
cpu_has_sve = true
cpu_has_sve2 = true
cpu_has_svef32mm = true
cpu_has_svei8mm = true
cpu_has_svebf16 = true
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = true
cpu_has_bf16 = true
cpu_has_dotprod = true
cpu_has_i8mm = true
CPU0 = GENERIC
CPU1 = GENERIC
CPU2 = GENERIC
CPU3 = GENERIC
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
... And the output of the test-passing ./build/examples/neon_sgemm_bf16:
Note that [ComputeLibrary][21-02-2025 10:43:57][INFO] did not output the DataType for tensors a, b, and d, but this did not affect the test's success. We will further inspect the detailed execution process and verify the accuracy of the results. I think this issue can be closed as complete now. Thank you again for your invaluable assistance. |
I tried to write a matrix multiplication program in BF16 format using ACL by directly replacing DataType::F32 with DataType::BFLOAT16 in examples/neon_sgemm.cpp, but it failed because CpuGemmMatrixMultiplyKernel does not support datatype BFLOAT16.
First, I compiled ACL library version v25.02 on an x86 platform using aarch64-none-linux-gnu-12.2 with the following command:
scons Werror=1 -j128 logging=1 debug=1 asserts=1 arch=armv8.6-a-sve2 os=linux build=cross_compile examples=1 opencl=0 neon=1 openmp=1 cppthreads=1 fixed_format_kernels=0 standalone=0 benchmark_examples=0 validate_examples=0 reference_openmp=1 validation_tests=0 benchmark_tests=0 pmu=0 build_dir=. toolchain_prefix=aarch64-none-linux-gnu- | tee -i compile_log.log
Then I tried running ./build/examples/neon_sgemm on QEMU-simulated AARCH64 DEBIAN12, and everything worked fine:
Next, I copied ./examples/neon_sgemm.cpp to ./examples/neon_sgemm_bf16.cpp, and performed the following substitution in ./examples/neon_sgemm_bf16.cpp:
from DataType::F32 to DataType::BFLOAT16
The modified code is:
Then I recompiled the entire ACL library, and the new ./build/examples/neon_sgemm_bf16 was successfully generated, but it crashes when running:
I investigated the cause of the crash, in brief:
When executing the following line in examples/neon_sgemm_bf16.cpp:
sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);
the program enters src/cpu/operators/CpuGemm.cpp:389, whose content is:
ARM_COMPUTE_RETURN_ON_ERROR(cpu::kernels::CpuGemmMatrixMultiplyKernel::validate( matrix_a_info, matrix_b_info, &tmp_output_info, alpha, run_interleave_transpose, reshape_info));
Next, the program enters src/cpu/kernels/CpuGemmMatrixMultiplyKernel.cpp, and then crashes at line 64:
The cause of the crash is obvious: CpuGemmMatrixMultiplyKernel.cpp does not support matrix multiplication in DataType::BFLOAT16 format.
However, in arm_compute/runtime/NEON/functions/NEGEMM.h, we clearly see support for BFLOAT16:
But in src/cpu/kernels/CpuGemmMatrixMultiplyKernel.h, it requires that the lhs must be of F16/F32 precision:
I checked the NEGEMM Class Reference (https://artificial-intelligence.sites.arm.com/computelibrary/v25.02/classarm__compute_1_1_n_e_g_e_m_m.xhtml), and the documentation also clearly states that the ACL library should support the BF16 format:
I am now unable to resolve this issue. Therefore, I would like to ask: How can I write a matrix multiplication program with BF16 precision based on the ACL library? I tried directly replacing DataType::F32 with DataType::BFLOAT16 in examples/neon_sgemm.cpp, but it failed because CpuGemmMatrixMultiplyKernel does not support datatype BFLOAT16.
The text was updated successfully, but these errors were encountered: