AArch64: matmul acl kernel aborted with Assertion failure #2674

xiang1guo · 2025-02-12T04:14:42Z

Background

Recently, graph team start to investigate the skipped(failed) graph test cases in #2168. For those SDPA/MHA/MQA-related cases, I finally can reproduce the issue with primitive matmul API. Let's start to restore these test cases based on this issue.

cpu-graph-gqa-cpp
cpu-graph-mqa-cpp
cpu-graph-sdpa-cpp
cpu-graph-sdpa-stacked-qkv-cpp
test_graph_unit_dnnl_large_partition_usm_cpu(SDPA related case)
test_graph_unit_dnnl_sdp_decomp_usm_cpu

Summary

I wrote an example with primitive API. The case failed in acl_matmul kernel, the case can pass after remove the kernel here. The failure log is as follows:

ubuntu@ip-172-31-44-162:~/oneDNN/build$ ./examples/primitives-matmul-cpp 
fill data ...
primitive creation ...
primitive execution ...
primitives-matmul-cpp: src/core/NEON/kernels/arm_gemm/gemm_hybrid_indirect.hpp:453: void arm_gemm::GemmHybridIndirect<strategy, To, Tw, Tr, OutputStage, SeparateQuantize, FixedFormat>::execute_stateless(const ndcoord_t&, const ndcoord_t&, int, arm_gemm::GemmArrays<To, Tw, Tr>&) [with strategy = arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL; To = float; Tw = float; Tr = float; OutputStage = arm_gemm::Nothing; bool SeparateQuantize = false; bool FixedFormat = false; arm_gemm::ndcoord_t = arm_gemm::NDCoordinate<6>]: Assertion `FixedFormat || _B_transposed' failed.
Aborted (core dumped)

Version

Latest oneDNN main branch, the commit I used: af1410c

Environment

system: Linux 22.04.1-Ubuntu SMP aarch64 aarch64 aarch64 GNU/Linux
gcc: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
cmake cmake version 3.22.1

Steps to reproduce

Code change
Add the following example in /path-to-oneDNN-repo/oneDNN/examples/primitives/test.cpp

/*******************************************************************************
* Copyright 2020-2023 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#include <algorithm>
#include <chrono>
#include <cmath>
#include <iostream>
#include <random>
#include <string>
#include <vector>
#include <omp.h>
#include "example_utils.hpp"
#include "oneapi/dnnl/dnnl.hpp"
using namespace dnnl;
using tag = memory::format_tag;
using dt = memory::data_type;
using md = memory::desc;
// number of omp threads.
#define NTHREADS 4
// number of loop for perf measurement.
#define NLOOP 10

void fill_data(memory &mem) {
    static std::mt19937 mt(static_cast<unsigned>(time(0)));
    std::uniform_real_distribution<float> dist(-100.0f, 100.0f);

    float *handle = (float *)mem.get_data_handle();
    size_t size = mem.get_desc().get_size()/4;

    if (handle == nullptr || size == 0) {
        return;
    }

#pragma omp parallel for num_threads(NTHREADS)
    for (size_t i = 0; i < size; i++) {
        handle[i] = dist(mt);
    }
}

static const memory::dim MBO = NTHREADS, MBI = 16, M1 = 384, K1 = 384, N1 = 64;
void matmul_decomp(dnnl::engine::kind engine_kind) {
    auto datatype = dt::f32;
    memory::dims src1_dims = {MBO, MBI, M1, K1};
    memory::dims wei1_dims = {MBO, MBI, K1, N1};
    memory::dims dst1_dims = {MBO, MBI, M1, N1};

    auto src1_md = memory::desc(src1_dims, datatype, tag::abcd);
    auto wei1_md = memory::desc(wei1_dims, datatype, tag::abdc);
    auto dst1_md = memory::desc(dst1_dims, datatype, tag::abcd);

    dnnl::engine eng(engine::kind::cpu, 0);
    auto src1 = memory(src1_md, eng);
    auto wei1 = memory(wei1_md, eng);
    auto dst1 = memory(dst1_md, eng);

    std::cout << "fill data ..." << std::endl;
    fill_data(src1);
    fill_data(wei1);
    
    std::cout << "primitive creation ..." << std::endl;
    memory::dims sub_src1_dims = {1, 1, M1, K1};
    memory::dims sub_wei1_dims = {1, 1, K1, N1};
    memory::dims sub_dst1_dims = {1, 1, M1, N1};

    auto sub_src1_md = memory::desc(sub_src1_dims, datatype, tag::abcd);
    auto sub_wei1_md = memory::desc(sub_wei1_dims, datatype, tag::abdc);
    auto sub_dst1_md = memory::desc(sub_dst1_dims, datatype, tag::abcd);
    // must use user mode to support concurrent execution
    primitive_attr attr;
    attr.set_scratchpad_mode(scratchpad_mode::user);

    auto sub_mm1_pd = matmul::primitive_desc(
            eng, sub_src1_md, sub_wei1_md, sub_dst1_md, attr);
    auto sub_mm1_prim = matmul(sub_mm1_pd);

    std::vector<memory> sub_src1_vec(NTHREADS);   
    std::vector<memory> sub_wei1_vec(NTHREADS);
    std::vector<memory> sub_dst1_vec(NTHREADS);
    std::vector<memory> sub_scratchpad_vec(NTHREADS);
    
    memory::desc max_scratchpad;
    max_scratchpad = sub_mm1_pd.scratchpad_desc();

    for (size_t id = 0; id < NTHREADS; ++id) {
        sub_src1_vec[id] = memory(sub_src1_md, eng);
        sub_dst1_vec[id] = memory(sub_dst1_md, eng);
        sub_wei1_vec[id] = memory(sub_wei1_md, eng);
        sub_scratchpad_vec[id] = memory(max_scratchpad, eng);
    }

    const auto loop = [&](size_t bo, size_t bi) {
        const int tid = omp_get_thread_num();
        auto sub_src1 = sub_src1_vec[tid];
        auto &sub_wei1 = sub_wei1_vec[tid];
        auto &sub_dst1 = sub_dst1_vec[tid];
        auto &sub_scratchpad = sub_scratchpad_vec[tid];
        // every thread has its own stream.
        dnnl::stream strm(eng);
        const size_t sub_src1_offset = M1 * K1 * (bi + MBI * bo);
        const size_t sub_wei1_offset = K1 * N1 * (bi + MBI * bo);
        const size_t sub_dst1_offset = M1 * N1 * (bi + MBI * bo);
        sub_src1.set_data_handle(static_cast<char *>(src1.get_data_handle())
                + sub_src1_offset);
        sub_wei1.set_data_handle(
                static_cast<char *>(wei1.get_data_handle())
                + sub_wei1_offset);

        sub_dst1.set_data_handle(static_cast<char *>(dst1.get_data_handle())
                + sub_dst1_offset);
        sub_mm1_prim.execute(strm,
                {{DNNL_ARG_SRC, sub_src1}, {DNNL_ARG_WEIGHTS, sub_wei1},
                        {DNNL_ARG_DST, sub_dst1},
                        {DNNL_ARG_SCRATCHPAD, sub_scratchpad}});

    };

    std::cout << "primitive execution ..." << std::endl;

    for (int time = 0; time < NLOOP; ++time) {
#pragma omp parallel for collapse(2) num_threads(NTHREADS)
        for (size_t bo = 0; bo < MBO; ++bo) {
            for (size_t bi = 0; bi < MBI; ++bi) {
                loop(bo, bi);
            }
        }
    }
}

int main(int argc, char **argv) {
    auto exit_code
            = handle_example_errors(matmul_decomp, parse_engine_kind(argc, argv));
    return exit_code;
}

Build library:

1. setup ACL library
    git clone --branch v24.11.1 --depth 1 https://github.com/ARM-software/ComputeLibrary.git 
    git checkout 1f3bf6bbc4a1a57b5915fc0a19b195ae53acc06d
    scons -j4 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 os=linux arch=armv8.2-a build=native multi_isa=1 fixed_format_kernels=1 cppthreads=0 openmp=1 examples=0 validation_tests=0
2. export ACL_ROOT_DIR=/path/to/ComputeLibrary
3. build oneDNN
    cmake .. -DDNNL_AARCH64_USE_ACL=ON -DONEDNN_BUILD_GRAPH=ON -DDNNL_CPU_RUNTIME=OMP -DONEDNN_WERROR=ON -DDNNL_BUILD_FOR_CI=ON -DONEDNN_TEST_SET=NIGHTLY -DCMAKE_BUILD_TYPE=Debug
    make -j 4

Run test:

./build/examples/primitives-test-cpp

Some tips

I tried to use different OMP_THREADS_NUM, and smaller threads num can pass.
I also found that some other shape(384x64 : 64x384) can pass the test, it should be related to the kernel implementation of transpose.

The text was updated successfully, but these errors were encountered:

xiang1guo added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Feb 12, 2025

xiang1guo mentioned this issue Feb 12, 2025

AArch64: convolution diff issue with gemm_s8u8s32 kernel #2675

Open

TaoLv changed the title ~~Aarc64: matmul acl kernel aborted with Assertion failure~~ AArch64: matmul acl kernel aborted with Assertion failure Feb 12, 2025

Sqvid assigned Radu2k Feb 13, 2025

TaoLv mentioned this issue Feb 19, 2025

gtests: fix test failures in test_sdp_decomp #2721

Draft

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AArch64: matmul acl kernel aborted with Assertion failure #2674

AArch64: matmul acl kernel aborted with Assertion failure #2674

xiang1guo commented Feb 12, 2025 •

edited

Loading

AArch64: matmul acl kernel aborted with Assertion failure #2674

AArch64: matmul acl kernel aborted with Assertion failure #2674

Comments

xiang1guo commented Feb 12, 2025 • edited Loading

Background

Summary

Version

Environment

Steps to reproduce

Some tips

xiang1guo commented Feb 12, 2025 •

edited

Loading