Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CpuGemmMatrixMultiplyKernel does not support datatype DataType::BFLOAT16 #1158

Open
exaithrg opened this issue Feb 20, 2025 · 5 comments
Open

Comments

@exaithrg
Copy link

exaithrg commented Feb 20, 2025

I tried to write a matrix multiplication program in BF16 format using ACL by directly replacing DataType::F32 with DataType::BFLOAT16 in examples/neon_sgemm.cpp, but it failed because CpuGemmMatrixMultiplyKernel does not support datatype BFLOAT16.

First, I compiled ACL library version v25.02 on an x86 platform using aarch64-none-linux-gnu-12.2 with the following command:

scons Werror=1 -j128 logging=1 debug=1 asserts=1 arch=armv8.6-a-sve2 os=linux build=cross_compile examples=1 opencl=0 neon=1 openmp=1 cppthreads=1 fixed_format_kernels=0 standalone=0 benchmark_examples=0 validate_examples=0 reference_openmp=1 validation_tests=0 benchmark_tests=0 pmu=0 build_dir=. toolchain_prefix=aarch64-none-linux-gnu- | tee -i compile_log.log

Then I tried running ./build/examples/neon_sgemm on QEMU-simulated AARCH64 DEBIAN12, and everything worked fine:

geng@arm64max:~/work/llm_acl/ComputeLibrary$ ./build/examples/neon_sgemm 5 6 7

./build/examples/neon_sgemm

 [ComputeLibrary][20-02-2025 04:51:23][INFO]  arm_compute::cpu::CpuGemm::configure() :
 a: Shape=7,5,DataLayout=NCHW,DataType=F32
 b: Shape=6,7,DataLayout=NCHW,DataType=F32
 c: nullptr
 d: Shape=6,5,DataLayout=NCHW,DataType=F32
 alpha: 1.000000
 beta: 0.000000
 gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=1,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}

 [CORE][20-02-2025 04:51:23][INFO]  "Set CPPScheduler to Linear mode, with 8 threads to use\n"

Test passed

Next, I copied ./examples/neon_sgemm.cpp to ./examples/neon_sgemm_bf16.cpp, and performed the following substitution in ./examples/neon_sgemm_bf16.cpp:

from DataType::F32 to DataType::BFLOAT16

The modified code is:

/*
 * Copyright (c) 2018-2019 Arm Limited.
 *
 * SPDX-License-Identifier: MIT
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to
 * deal in the Software without restriction, including without limitation the
 * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
 * sell copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in all
 * copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"

#include "utils/Utils.h"

#include <cstdlib>

using namespace arm_compute;
using namespace utils;

class NESGEMMExample : public Example
{
public:
    bool do_setup(int argc, char **argv) override
    {
        NPYLoader npy0;
        NPYLoader npy1;
        NPYLoader npy2;
        alpha = 1.0f;
        beta  = 0.0f;

        std::ifstream stream;
        if (argc > 1)
        {
            stream.open(argv[1], std::fstream::in);
        }

        if (argc < 3 || (argc < 4 && stream.bad()))
        {
            // Print help
            std::cout << "Usage: 1) ./build/neon_sgemm input_matrix_1.npy input_matrix_2.npy [input_matrix_3.npy] "
                         "[alpha = 1] [beta = 0]\n";
            std::cout << "       2) ./build/neon_sgemm M N K [alpha = 1.0f] [beta = 0.0f]\n\n";
            std::cout << "Too few or no input_matrices provided. Using M=7, N=3, K=5, alpha=1.0f and beta=0.0f\n\n";

            src0.allocator()->init(TensorInfo(TensorShape(5U, 7U), 1, DataType::BFLOAT16));
            src1.allocator()->init(TensorInfo(TensorShape(3U, 5U), 1, DataType::BFLOAT16));
            src2.allocator()->init(TensorInfo(TensorShape(3U, 7U), 1, DataType::BFLOAT16));
        }
        else
        {
            if (stream.good()) /* case file1.npy file2.npy [file3.npy] [alpha = 1.0f] [beta = 0.0f] */
            {
                npy0.open(argv[1]);
                npy0.init_tensor(src0, DataType::BFLOAT16);
                npy1.open(argv[2]);
                npy1.init_tensor(src1, DataType::BFLOAT16);

                if (argc > 3)
                {
                    stream.close();
                    stream.clear();
                    stream.open(argv[3], std::fstream::in);
                    if (stream.good()) /* case with third file */
                    {
                        npy2.open(argv[3]);
                        npy2.init_tensor(src2, DataType::BFLOAT16);

                        if (argc > 4)
                        {
                            // Convert string to float
                            alpha = strtof(argv[4], nullptr);

                            if (argc > 5)
                            {
                                // Convert string to float
                                beta = strtof(argv[5], nullptr);
                            }
                        }
                    }
                    else /* case without third file */
                    {
                        alpha = strtof(argv[3], nullptr);

                        if (argc > 4)
                        {
                            beta = strtof(argv[4], nullptr);
                        }
                    }
                }
            }
            else /* case M N K [alpha = 1.0f] [beta = 0.0f] */
            {
                size_t M = strtol(argv[1], nullptr, 10);
                size_t N = strtol(argv[2], nullptr, 10);
                size_t K = strtol(argv[3], nullptr, 10);

                src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::BFLOAT16));
                src1.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::BFLOAT16));
                src2.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::BFLOAT16));

                if (argc > 4)
                {
                    alpha = strtof(argv[4], nullptr);

                    if (argc > 5)
                    {
                        beta = strtof(argv[5], nullptr);
                    }
                }
            }
        }

        init_sgemm_output(dst, src0, src1, DataType::BFLOAT16);

        // Configure function
        sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

        // Allocate all the images
        src0.allocator()->allocate();
        src1.allocator()->allocate();
        dst.allocator()->allocate();

        // Fill the input images with either the data provided or random data
        if (npy0.is_open())
        {
            npy0.fill_tensor(src0);
            npy1.fill_tensor(src1);

            output_filename = "sgemm_out.npy";
            is_fortran      = npy0.is_fortran();

            if (npy2.is_open())
            {
                src2.allocator()->allocate();
                npy2.fill_tensor(src2);
            }
        }
        else
        {
            src2.allocator()->allocate();

            fill_random_tensor(src0, -1.f, 1.f);
            fill_random_tensor(src1, -1.f, 1.f);
            fill_random_tensor(src2, -1.f, 1.f);
        }

        // Dummy run for CLTuner
        sgemm.run();

        return true;
    }
    void do_run() override
    {
        // Execute the function
        sgemm.run();
    }
    void do_teardown() override
    {
        if (!output_filename.empty()) /* Save to .npy file */
        {
            save_to_npy(dst, output_filename, is_fortran);
        }
    }

private:
    Tensor      src0{}, src1{}, src2{}, dst{};
    NEGEMM      sgemm{};
    float       alpha{}, beta{};
    bool        is_fortran{};
    std::string output_filename{};
};

/** Main program for sgemm test
 *
 * @param[in] argc Number of arguments
 * @param[in] argv Arguments ( [optional] Matrix A, [optional] Matrix B, [optional] Matrix C, [optional] alpha, [optional] beta )
 */
int main(int argc, char **argv)
{
    return utils::run_example<NESGEMMExample>(argc, argv);
}

Then I recompiled the entire ACL library, and the new ./build/examples/neon_sgemm_bf16 was successfully generated, but it crashes when running:

geng@arm64max:~/work/llm_acl/ComputeLibrary$ ./build/examples/neon_sgemm_bf16 5 6 7

./build/examples/neon_sgemm_bf16

!!!!!!!!!!!!!!!!!!!!!!!!!!!

ERROR in validate_arguments src/cpu/kernels/CpuGemmMatrixMultiplyKernel.cpp:64: ITensor data type  not supported by this kernel No such file or directory
!!!!!!!!!!!!!!!!!!!!!!!!!!!

Test FAILED

I investigated the cause of the crash, in brief:

When executing the following line in examples/neon_sgemm_bf16.cpp:

sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

the program enters src/cpu/operators/CpuGemm.cpp:389, whose content is:

ARM_COMPUTE_RETURN_ON_ERROR(cpu::kernels::CpuGemmMatrixMultiplyKernel::validate(
            matrix_a_info, matrix_b_info, &tmp_output_info, alpha, run_interleave_transpose, reshape_info));

Next, the program enters src/cpu/kernels/CpuGemmMatrixMultiplyKernel.cpp, and then crashes at line 64:

ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(lhs, 1, DataType::F16, DataType::F32);

The cause of the crash is obvious: CpuGemmMatrixMultiplyKernel.cpp does not support matrix multiplication in DataType::BFLOAT16 format.

However, in arm_compute/runtime/NEON/functions/NEGEMM.h, we clearly see support for BFLOAT16:

     * Valid data type configurations:
     * |src0         |src1        |src2      |dst            |
     * |:------------|:-----------|:---------|:--------------|
     * |F32          |F32         |F32       |F32            |
     * |F16          |F16         |F16       |F16            |
     * |BFLOAT16     |BFLOAT16    |BFLOAT16  |BFLOAT16       |

But in src/cpu/kernels/CpuGemmMatrixMultiplyKernel.h, it requires that the lhs must be of F16/F32 precision:

class CpuGemmMatrixMultiplyKernel : public ICpuKernel<CpuGemmMatrixMultiplyKernel>
{
...
     * @param[in]  lhs            Left-handside tensor info containing the interleaved Matrix A or the vector A. Data types supported: F16/F32

I checked the NEGEMM Class Reference (https://artificial-intelligence.sites.arm.com/computelibrary/v25.02/classarm__compute_1_1_n_e_g_e_m_m.xhtml), and the documentation also clearly states that the ACL library should support the BF16 format:

Member Function Documentation
◆ configure()
void configure	(	const ITensor * 	a,
const ITensor * 	b,
const ITensor * 	c,
ITensor * 	d,
float 	alpha,
float 	beta,
const GEMMInfo & 	gemm_info = GEMMInfo() 
)		
Initialise the kernel's inputs, output.

Valid data layouts:

All
Valid data type configurations:

src0	src1	src2	dst
F32	F32	F32	F32
F16	F16	F16	F16
BFLOAT16	BFLOAT16	BFLOAT16	BFLOAT16

I am now unable to resolve this issue. Therefore, I would like to ask: How can I write a matrix multiplication program with BF16 precision based on the ACL library? I tried directly replacing DataType::F32 with DataType::BFLOAT16 in examples/neon_sgemm.cpp, but it failed because CpuGemmMatrixMultiplyKernel does not support datatype BFLOAT16.

@exaithrg exaithrg changed the title ./examples/neon_sgemm.cpp failed when change datatype to DataType::BFLOAT16 CpuGemmMatrixMultiplyKernel does not support datatype DataType::BFLOAT16 Feb 20, 2025
@morgolock
Copy link

morgolock commented Feb 20, 2025

Hi @exaithrg

Quoting the documentation about BF16 acceleration.

To enable BF16 acceleration when running FP32 "fast-math" has to be enabled and that works only for Neon convolution layer using cpu gemm. In this scenario on CPU: the CpuGemmConv2d kernel performs the conversion from FP32, type of input tensor, to BF16 at block level to exploit the arithmetic capabilities dedicated to BF16. Then transforms back to FP32, the output tensor type.

I think the problem here is that you have to pass the fast_maths=true argument into NEGEMM::configure to enable BF16

Something like below

const GEMMInfo &gemm_info;
gemm_info.set_fast_math(true);
sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);

Hope this helps

@exaithrg
Copy link
Author

exaithrg commented Feb 20, 2025

@morgolock Thank you for your reply. I tried again and completed the following steps:

  1. Copied ./examples/neon_sgemm.cpp to ./examples/neon_sgemm_bf16.cpp
  2. Replaced all DataType::F32 with DataType::BFLOAT16 in ./examples/neon_sgemm_bf16.cpp
  3. Replaced the original line 131:
        // Configure function
        sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

with

        // Configure function
        GEMMInfo gemm_info;
        gemm_info.set_fast_math(true);
        sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
        // sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

Then I recompiled. This time I added the extra_cxx_flags="-fPIC" option to the compilation command at the beginning of the issue, to more closely follow the example in the BF16 acceleration documentation:

scons Werror=1 -j128 logging=1 debug=1 asserts=1 neon=1 arch=armv8.6-a-sve2 extra_cxx_flags="-fPIC" os=linux build=cross_compile examples=1 opencl=0 openmp=1 cppthreads=1 fixed_format_kernels=0 standalone=0 benchmark_examples=0 validate_examples=0 reference_openmp=1 validation_tests=0 benchmark_tests=0 pmu=0 build_dir=. toolchain_prefix=aarch64-none-linux-gnu- | tee -i compile_log.log

According to the BF16 acceleration reference, I set neon=1, arch=armv8.6-a-sve2, extra_cxx_flags="-fPIC".

However, when running the compiled program (./build/examples/neon_sgemm_bf16), I still encountered the same problem, i.e., the program crashes at src/cpu/kernels/CpuGemmMatrixMultiplyKernel.cpp:64 with the same reason:

ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(lhs, 1, DataType::F16, DataType::F32);

The current code of neon_sgemm_bf16.cpp is as follows

// neon_sgemm_bf16.cpp
/*
 * Copyright (c) 2018-2019 Arm Limited.
 *
 * SPDX-License-Identifier: MIT
 *
 * Permission is hereby granted, free of charge, to any person obtaining a copy
 * of this software and associated documentation files (the "Software"), to
 * deal in the Software without restriction, including without limitation the
 * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
 * sell copies of the Software, and to permit persons to whom the Software is
 * furnished to do so, subject to the following conditions:
 *
 * The above copyright notice and this permission notice shall be included in all
 * copies or substantial portions of the Software.
 *
 * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
 * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
 * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
 * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
 * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
 * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 * SOFTWARE.
 */
#include "arm_compute/core/Types.h"
#include "arm_compute/runtime/NEON/NEFunctions.h"
#include "arm_compute/runtime/NEON/NEScheduler.h"

#include "utils/Utils.h"

#include <cstdlib>

using namespace arm_compute;
using namespace utils;

class NESGEMMExample : public Example
{
public:
    bool do_setup(int argc, char **argv) override
    {
        NPYLoader npy0;
        NPYLoader npy1;
        NPYLoader npy2;
        alpha = 1.0f;
        beta  = 0.0f;

        std::ifstream stream;
        if (argc > 1)
        {
            stream.open(argv[1], std::fstream::in);
        }

        if (argc < 3 || (argc < 4 && stream.bad()))
        {
            // Print help
            std::cout << "Usage: 1) ./build/neon_sgemm input_matrix_1.npy input_matrix_2.npy [input_matrix_3.npy] "
                         "[alpha = 1] [beta = 0]\n";
            std::cout << "       2) ./build/neon_sgemm M N K [alpha = 1.0f] [beta = 0.0f]\n\n";
            std::cout << "Too few or no input_matrices provided. Using M=7, N=3, K=5, alpha=1.0f and beta=0.0f\n\n";

            src0.allocator()->init(TensorInfo(TensorShape(5U, 7U), 1, DataType::BFLOAT16));
            src1.allocator()->init(TensorInfo(TensorShape(3U, 5U), 1, DataType::BFLOAT16));
            src2.allocator()->init(TensorInfo(TensorShape(3U, 7U), 1, DataType::BFLOAT16));
        }
        else
        {
            if (stream.good()) /* case file1.npy file2.npy [file3.npy] [alpha = 1.0f] [beta = 0.0f] */
            {
                npy0.open(argv[1]);
                npy0.init_tensor(src0, DataType::BFLOAT16);
                npy1.open(argv[2]);
                npy1.init_tensor(src1, DataType::BFLOAT16);

                if (argc > 3)
                {
                    stream.close();
                    stream.clear();
                    stream.open(argv[3], std::fstream::in);
                    if (stream.good()) /* case with third file */
                    {
                        npy2.open(argv[3]);
                        npy2.init_tensor(src2, DataType::BFLOAT16);

                        if (argc > 4)
                        {
                            // Convert string to float
                            alpha = strtof(argv[4], nullptr);

                            if (argc > 5)
                            {
                                // Convert string to float
                                beta = strtof(argv[5], nullptr);
                            }
                        }
                    }
                    else /* case without third file */
                    {
                        alpha = strtof(argv[3], nullptr);

                        if (argc > 4)
                        {
                            beta = strtof(argv[4], nullptr);
                        }
                    }
                }
            }
            else /* case M N K [alpha = 1.0f] [beta = 0.0f] */
            {
                size_t M = strtol(argv[1], nullptr, 10);
                size_t N = strtol(argv[2], nullptr, 10);
                size_t K = strtol(argv[3], nullptr, 10);

                src0.allocator()->init(TensorInfo(TensorShape(K, M), 1, DataType::BFLOAT16));
                src1.allocator()->init(TensorInfo(TensorShape(N, K), 1, DataType::BFLOAT16));
                src2.allocator()->init(TensorInfo(TensorShape(N, M), 1, DataType::BFLOAT16));

                if (argc > 4)
                {
                    alpha = strtof(argv[4], nullptr);

                    if (argc > 5)
                    {
                        beta = strtof(argv[5], nullptr);
                    }
                }
            }
        }

        init_sgemm_output(dst, src0, src1, DataType::BFLOAT16);

        // Configure function
        GEMMInfo gemm_info;
        gemm_info.set_fast_math(true);
        sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
        // sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

        // Allocate all the images
        src0.allocator()->allocate();
        src1.allocator()->allocate();
        dst.allocator()->allocate();

        // Fill the input images with either the data provided or random data
        if (npy0.is_open())
        {
            npy0.fill_tensor(src0);
            npy1.fill_tensor(src1);

            output_filename = "sgemm_out.npy";
            is_fortran      = npy0.is_fortran();

            if (npy2.is_open())
            {
                src2.allocator()->allocate();
                npy2.fill_tensor(src2);
            }
        }
        else
        {
            src2.allocator()->allocate();

            fill_random_tensor(src0, -1.f, 1.f);
            fill_random_tensor(src1, -1.f, 1.f);
            fill_random_tensor(src2, -1.f, 1.f);
        }

        // Dummy run for CLTuner
        sgemm.run();

        return true;
    }
    void do_run() override
    {
        // Execute the function
        sgemm.run();
    }
    void do_teardown() override
    {
        if (!output_filename.empty()) /* Save to .npy file */
        {
            save_to_npy(dst, output_filename, is_fortran);
        }
    }

private:
    Tensor      src0{}, src1{}, src2{}, dst{};
    NEGEMM      sgemm{};
    float       alpha{}, beta{};
    bool        is_fortran{};
    std::string output_filename{};
};

/** Main program for sgemm test
 *
 * @param[in] argc Number of arguments
 * @param[in] argv Arguments ( [optional] Matrix A, [optional] Matrix B, [optional] Matrix C, [optional] alpha, [optional] beta )
 */
int main(int argc, char **argv)
{
    return utils::run_example<NESGEMMExample>(argc, argv);
}

I'm pretty sure that I passed a gemm_info with fast_math correctly set to sgemm.configure. The GDB output is:

Reading symbols from ./build/examples/neon_sgemm_bf16...
(gdb) b examples/neon_sgemm_bf16.cpp:131
Breakpoint 1 at 0x41baa8: file examples/neon_sgemm_bf16.cpp, line 131.
(gdb) r
Starting program: /mnt/arm64max_share_folder/work/llm_acl/ComputeLibrary/build/examples/neon_sgemm_bf16
warning: Unable to determine the number of hardware watchpoints available.
warning: Unable to determine the number of hardware breakpoints available.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".

/mnt/arm64max_share_folder/work/llm_acl/ComputeLibrary/build/examples/neon_sgemm_bf16

Usage: 1) ./build/neon_sgemm input_matrix_1.npy input_matrix_2.npy [input_matrix_3.npy] [alpha = 1] [beta = 0]
       2) ./build/neon_sgemm M N K [alpha = 1.0f] [beta = 0.0f]

Too few or no input_matrices provided. Using M=7, N=3, K=5, alpha=1.0f and beta=0.0f


Breakpoint 1, NESGEMMExample::do_setup (this=0x49c390, argc=1,
    argv=0xfffffffff338) at examples/neon_sgemm_bf16.cpp:131
131             GEMMInfo gemm_info;
(gdb) l
126             }
127
128             init_sgemm_output(dst, src0, src1, DataType::BFLOAT16);
129
130             // Configure function
131             GEMMInfo gemm_info;
132             gemm_info.set_fast_math(true);
133             sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
134             // sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);
135
(gdb) n
132             gemm_info.set_fast_math(true);
(gdb) n
133             sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
(gdb) p gemm_info
$1 = {_is_a_reshaped = false, _is_b_reshaped = false,
  _reshape_b_only_on_first_run = true, _depth_output_gemm3d = 0,
  _reinterpret_input_as_3d = false, _retain_internal_weights = false,
  _gemmlowp_output_stage = {type = arm_compute::GEMMLowpOutputStageType::NONE,
    gemmlowp_offset = 0, gemmlowp_multiplier = 0, gemmlowp_shift = 0,
    gemmlowp_min_bound = -2147483648, gemmlowp_max_bound = 2147483647,
    gemmlowp_multipliers = std::vector of length 0, capacity 0,
    gemmlowp_shifts = std::vector of length 0, capacity 0,
    gemmlowp_real_multiplier = 0, is_quantized_per_channel = false,
    output_data_type = arm_compute::DataType::UNKNOWN}, _fast_math = true,
  _fp_mixed_precision = false, _broadcast_bias = false,
  _pretranspose_A = false, _pretranspose_B = false, _activation_info = {
    _act = arm_compute::ActivationFunction::IDENTITY, _a = 0, _b = 0,
    _enabled = false, _lut = {_M_elems = '\000' <repeats 255 times>},
    _lut_fp16 = std::shared_ptr<std::array<__fp16, 65536>> (empty) = {
      get() = 0x0}}, _fixed_format = false,
  _weight_format = arm_compute::WeightFormat::UNSPECIFIED, _accumulate = false}
(gdb) n
[New Thread 0xfffff603f0c0 (LWP 574)]
[New Thread 0xfffff582f0c0 (LWP 575)]
[New Thread 0xfffff501f0c0 (LWP 576)]
!!!!!!!!!!!!!!!!!!!!!!!!!!!

ERROR in validate_arguments src/cpu/kernels/CpuGemmMatrixMultiplyKernel.cpp:64: ITensor data type  not supported by this kernel
!!!!!!!!!!!!!!!!!!!!!!!!!!!

Test FAILED
[Thread 0xfffff603f0c0 (LWP 574) exited]
[Thread 0xfffff501f0c0 (LWP 576) exited]
[Thread 0xfffff582f0c0 (LWP 575) exited]
[Inferior 1 (process 572) exited with code 0377]
(gdb)	

It appears that setting fast_math=true in gemm_info does not alter the program’s data flow during execution. The neon_sgemm_bf16 program still goes into CpuGemm::validate, then calls cpu::kernels::CpuGemmMatrixMultiplyKernel::validate, and then fails due to ARM_COMPUTE_RETURN_ERROR_ON_DATA_TYPE_CHANNEL_NOT_IN(lhs, 1, DataType::F16, DataType::F32);

Additionally, I also tried using

        // Configure function
        GEMMInfo gemm_info;
        gemm_info.set_fast_math(true);
        sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gemm_info);
        // sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta);

while keeping the DataType as F32, and no issues occurred.

I look forward to your response, thanks

@morgolock
Copy link

Hi @exaithrg

Could you please run arm_compute_validation and share the output of the first few lines as shown below:

I want to make sure your device supports bf16

ComputeLibrary]$ LD_LIBRARY_PATH=./build/:$LD_LIBRARY_PATH ./build/tests/arm_compute_validation --filter-id=200
Version = arm_compute_version=v0.0-unreleased Build options: {'Werror': '0', 'debug': '0', 'neon': '1', 'opencl': '0', 'embed_kernels': '0', 'validation_tests': '1', 'os': 'linux', 'arch': 'armv8a', 'build': 'native', 'multi_isa': '1', 'fixed_format_kernels': '1', 'openmp': '0', 'cppthreads': '1', 'asserts': '0', 'logging': '0'} Git hash=b'ac1ee52827530974d94ad3604fe343b60da686a3'
CommandLine = ./build/tests/arm_compute_validation --filter-id=200 
Seed = 1384251364
cpu_has_sve = false
cpu_has_sve2 = false
cpu_has_svef32mm = false
cpu_has_svei8mm = false
cpu_has_svebf16 = false
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = true
cpu_has_bf16 = false
cpu_has_dotprod = true
cpu_has_i8mm = false

@morgolock
Copy link

Hi @exaithrg

In order to use the bf16/bf16/bf16 kernel you have to build the library with fixed_format_kernels=1 and make the following changes to your test

         GEMMInfo gi;
         gi.set_fixed_format(true);   
         gi.set_weight_format(arm_compute::WeightFormat::OHWIo4i4);
 
         // Configure function
         sgemm.configure(&src0, &src1, nullptr, &dst, alpha, beta, gi);

There is no support for bf16 input if you don't use fixed format kernels.

Hope this helps

@exaithrg
Copy link
Author

exaithrg commented Feb 21, 2025

@morgolock Thank you for your timely reply. I recompiled the ACL library using fixed_format_kernels=1 as per your instructions and made the necessary changes in the test code. Now the BF16 matrix multiplication works properly. I will further study what these new configuration options actually mean.

In addition, attached are the results from running lscpu and arm_compute_validation on my qemu environment:

geng@arm64max:~/work/llm_acl/ComputeLibrary$ lscpu
Architecture:             aarch64
  CPU op-mode(s):         32-bit, 64-bit
  Byte Order:             Little Endian
CPU(s):                   4
  On-line CPU(s) list:    0-3
Vendor ID:                0x00
  Model name:             -
    Model:                0
    Thread(s) per core:   1
    Core(s) per cluster:  4
    Socket(s):            -
    Cluster(s):           1
    Stepping:             0x0
    BogoMIPS:             2000.00
    Flags:                fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fph
                          p asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 s
                          m3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc fl
                          agm ssbs sb paca pacg dcpodp sve2 sveaes svepmull sveb
                          itperm svesha3 svesm4 flagm2 frint svei8mm svef32mm sv
                          ef64mm svebf16 i8mm bf16 dgh rng bti ecv wfxt
...
geng@arm64max:~/work/llm_acl/ComputeLibrary/build/tests$ ./arm_compute_validation --filter-id=200
 [CORE][21-02-2025 10:29:55][INFO]  "Set CPPScheduler to Linear mode, with 1 threads to use\n"
Version = arm_compute_version=v25.02 Build options: {'Werror': '1', 'logging': '1', 'debug': '1', 'asserts': '1', 'neon': '1', 'arch': 'armv8.6-a-sve2', 'extra_cxx_flags': '-fPIC', 'os': 'linux', 'build': 'cross_compile', 'examples': '1', 'opencl': '0', 'openmp': '1', 'cppthreads': '1', 'fixed_format_kernels': '0', 'standalone': '0', 'benchmark_examples': '0', 'validate_examples': '0', 'reference_openmp': '1', 'validation_tests': '1', 'benchmark_tests': '0', 'pmu': '0', 'build_dir': '.', 'toolchain_prefix': 'aarch64-none-linux-gnu-'} Git hash=v25.02-modi
CommandLine = ./arm_compute_validation --filter-id=200
Seed = 3461256592
cpu_has_sve = true
cpu_has_sve2 = true
cpu_has_svef32mm = true
cpu_has_svei8mm = true
cpu_has_svebf16 = true
cpu_has_sme = false
cpu_has_sme2 = false
cpu_has_fp16 = true
cpu_has_bf16 = true
cpu_has_dotprod = true
cpu_has_i8mm = true
CPU0 = GENERIC
CPU1 = GENERIC
CPU2 = GENERIC
CPU3 = GENERIC
Iterations = 1
Threads = 1
Dataset mode = PRECOMMIT
...

And the output of the test-passing ./build/examples/neon_sgemm_bf16:

geng@arm64max:~/work/llm_acl/ComputeLibrary$ ./build/examples/neon_sgemm_bf16 50 60 70

./build/examples/neon_sgemm_bf16

 [ComputeLibrary][21-02-2025 10:43:57][INFO]  arm_compute::cpu::CpuGemm::configure() :
 a: Shape=70,50,DataLayout=NCHW,DataType=
 b: Shape=60,70,DataLayout=NCHW,DataType=
 c: nullptr
 d: Shape=60,50,DataLayout=NCHW,DataType=
 alpha: 1.000000
 beta: 0.000000
 gemm_info: {is_a_reshaped=0,is_b_reshaped=0,reshape_b_only_on_first_run=1,depth_output_gemm3d=0,reinterpret_input_as_3d=0,retain_internal_weights=0,fp_mixed_precision=0,broadcast_bias=0,pretranspose_B=0,}

 [CORE][21-02-2025 10:43:57][INFO]  "Set CPPScheduler to Linear mode, with 4 threads to use\n"
 [CORE][21-02-2025 10:43:57][INFO]  "Set CPPScheduler to Linear mode, with 4 threads to use\n"

Test passed

Note that [ComputeLibrary][21-02-2025 10:43:57][INFO] did not output the DataType for tensors a, b, and d, but this did not affect the test's success. We will further inspect the detailed execution process and verify the accuracy of the results.

I think this issue can be closed as complete now. Thank you again for your invaluable assistance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants