Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AArch64: matmul acl kernel aborted with Assertion failure #2674

Open
xiang1guo opened this issue Feb 12, 2025 · 0 comments
Open

AArch64: matmul acl kernel aborted with Assertion failure #2674

xiang1guo opened this issue Feb 12, 2025 · 0 comments
Assignees
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed

Comments

@xiang1guo
Copy link
Contributor

xiang1guo commented Feb 12, 2025

Background

Recently, graph team start to investigate the skipped(failed) graph test cases in #2168. For those SDPA/MHA/MQA-related cases, I finally can reproduce the issue with primitive matmul API. Let's start to restore these test cases based on this issue.

cpu-graph-gqa-cpp
cpu-graph-mqa-cpp
cpu-graph-sdpa-cpp
cpu-graph-sdpa-stacked-qkv-cpp
test_graph_unit_dnnl_large_partition_usm_cpu(SDPA related case)
test_graph_unit_dnnl_sdp_decomp_usm_cpu

Summary

I wrote an example with primitive API. The case failed in acl_matmul kernel, the case can pass after remove the kernel here. The failure log is as follows:

ubuntu@ip-172-31-44-162:~/oneDNN/build$ ./examples/primitives-matmul-cpp 
fill data ...
primitive creation ...
primitive execution ...
primitives-matmul-cpp: src/core/NEON/kernels/arm_gemm/gemm_hybrid_indirect.hpp:453: void arm_gemm::GemmHybridIndirect<strategy, To, Tw, Tr, OutputStage, SeparateQuantize, FixedFormat>::execute_stateless(const ndcoord_t&, const ndcoord_t&, int, arm_gemm::GemmArrays<To, Tw, Tr>&) [with strategy = arm_gemm::cls_sve_hybrid_fp32_mla_6x4VL; To = float; Tw = float; Tr = float; OutputStage = arm_gemm::Nothing; bool SeparateQuantize = false; bool FixedFormat = false; arm_gemm::ndcoord_t = arm_gemm::NDCoordinate<6>]: Assertion `FixedFormat || _B_transposed' failed.
Aborted (core dumped)

Version

Latest oneDNN main branch, the commit I used: af1410c

Environment

  • system: Linux 22.04.1-Ubuntu SMP aarch64 aarch64 aarch64 GNU/Linux
  • gcc: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
  • cmake cmake version 3.22.1

Steps to reproduce

  • Code change
    Add the following example in /path-to-oneDNN-repo/oneDNN/examples/primitives/test.cpp
/*******************************************************************************
* Copyright 2020-2023 Intel Corporation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*******************************************************************************/
#include <algorithm>
#include <chrono>
#include <cmath>
#include <iostream>
#include <random>
#include <string>
#include <vector>
#include <omp.h>
#include "example_utils.hpp"
#include "oneapi/dnnl/dnnl.hpp"
using namespace dnnl;
using tag = memory::format_tag;
using dt = memory::data_type;
using md = memory::desc;
// number of omp threads.
#define NTHREADS 4
// number of loop for perf measurement.
#define NLOOP 10

void fill_data(memory &mem) {
    static std::mt19937 mt(static_cast<unsigned>(time(0)));
    std::uniform_real_distribution<float> dist(-100.0f, 100.0f);

    float *handle = (float *)mem.get_data_handle();
    size_t size = mem.get_desc().get_size()/4;

    if (handle == nullptr || size == 0) {
        return;
    }

#pragma omp parallel for num_threads(NTHREADS)
    for (size_t i = 0; i < size; i++) {
        handle[i] = dist(mt);
    }
}

static const memory::dim MBO = NTHREADS, MBI = 16, M1 = 384, K1 = 384, N1 = 64;
void matmul_decomp(dnnl::engine::kind engine_kind) {
    auto datatype = dt::f32;
    memory::dims src1_dims = {MBO, MBI, M1, K1};
    memory::dims wei1_dims = {MBO, MBI, K1, N1};
    memory::dims dst1_dims = {MBO, MBI, M1, N1};

    auto src1_md = memory::desc(src1_dims, datatype, tag::abcd);
    auto wei1_md = memory::desc(wei1_dims, datatype, tag::abdc);
    auto dst1_md = memory::desc(dst1_dims, datatype, tag::abcd);

    dnnl::engine eng(engine::kind::cpu, 0);
    auto src1 = memory(src1_md, eng);
    auto wei1 = memory(wei1_md, eng);
    auto dst1 = memory(dst1_md, eng);

    std::cout << "fill data ..." << std::endl;
    fill_data(src1);
    fill_data(wei1);
    
    std::cout << "primitive creation ..." << std::endl;
    memory::dims sub_src1_dims = {1, 1, M1, K1};
    memory::dims sub_wei1_dims = {1, 1, K1, N1};
    memory::dims sub_dst1_dims = {1, 1, M1, N1};

    auto sub_src1_md = memory::desc(sub_src1_dims, datatype, tag::abcd);
    auto sub_wei1_md = memory::desc(sub_wei1_dims, datatype, tag::abdc);
    auto sub_dst1_md = memory::desc(sub_dst1_dims, datatype, tag::abcd);
    // must use user mode to support concurrent execution
    primitive_attr attr;
    attr.set_scratchpad_mode(scratchpad_mode::user);

    auto sub_mm1_pd = matmul::primitive_desc(
            eng, sub_src1_md, sub_wei1_md, sub_dst1_md, attr);
    auto sub_mm1_prim = matmul(sub_mm1_pd);

    std::vector<memory> sub_src1_vec(NTHREADS);   
    std::vector<memory> sub_wei1_vec(NTHREADS);
    std::vector<memory> sub_dst1_vec(NTHREADS);
    std::vector<memory> sub_scratchpad_vec(NTHREADS);
    
    memory::desc max_scratchpad;
    max_scratchpad = sub_mm1_pd.scratchpad_desc();

    for (size_t id = 0; id < NTHREADS; ++id) {
        sub_src1_vec[id] = memory(sub_src1_md, eng);
        sub_dst1_vec[id] = memory(sub_dst1_md, eng);
        sub_wei1_vec[id] = memory(sub_wei1_md, eng);
        sub_scratchpad_vec[id] = memory(max_scratchpad, eng);
    }

    const auto loop = [&](size_t bo, size_t bi) {
        const int tid = omp_get_thread_num();
        auto sub_src1 = sub_src1_vec[tid];
        auto &sub_wei1 = sub_wei1_vec[tid];
        auto &sub_dst1 = sub_dst1_vec[tid];
        auto &sub_scratchpad = sub_scratchpad_vec[tid];
        // every thread has its own stream.
        dnnl::stream strm(eng);
        const size_t sub_src1_offset = M1 * K1 * (bi + MBI * bo);
        const size_t sub_wei1_offset = K1 * N1 * (bi + MBI * bo);
        const size_t sub_dst1_offset = M1 * N1 * (bi + MBI * bo);
        sub_src1.set_data_handle(static_cast<char *>(src1.get_data_handle())
                + sub_src1_offset);
        sub_wei1.set_data_handle(
                static_cast<char *>(wei1.get_data_handle())
                + sub_wei1_offset);

        sub_dst1.set_data_handle(static_cast<char *>(dst1.get_data_handle())
                + sub_dst1_offset);
        sub_mm1_prim.execute(strm,
                {{DNNL_ARG_SRC, sub_src1}, {DNNL_ARG_WEIGHTS, sub_wei1},
                        {DNNL_ARG_DST, sub_dst1},
                        {DNNL_ARG_SCRATCHPAD, sub_scratchpad}});

    };

    std::cout << "primitive execution ..." << std::endl;

    for (int time = 0; time < NLOOP; ++time) {
#pragma omp parallel for collapse(2) num_threads(NTHREADS)
        for (size_t bo = 0; bo < MBO; ++bo) {
            for (size_t bi = 0; bi < MBI; ++bi) {
                loop(bo, bi);
            }
        }
    }
}

int main(int argc, char **argv) {
    auto exit_code
            = handle_example_errors(matmul_decomp, parse_engine_kind(argc, argv));
    return exit_code;
}
  • Build library:
1. setup ACL library
    git clone --branch v24.11.1 --depth 1 https://github.com/ARM-software/ComputeLibrary.git 
    git checkout 1f3bf6bbc4a1a57b5915fc0a19b195ae53acc06d
    scons -j4 Werror=0 debug=0 neon=1 opencl=0 embed_kernels=0 os=linux arch=armv8.2-a build=native multi_isa=1 fixed_format_kernels=1 cppthreads=0 openmp=1 examples=0 validation_tests=0
2. export ACL_ROOT_DIR=/path/to/ComputeLibrary
3. build oneDNN
    cmake .. -DDNNL_AARCH64_USE_ACL=ON -DONEDNN_BUILD_GRAPH=ON -DDNNL_CPU_RUNTIME=OMP -DONEDNN_WERROR=ON -DDNNL_BUILD_FOR_CI=ON -DONEDNN_TEST_SET=NIGHTLY -DCMAKE_BUILD_TYPE=Debug
    make -j 4
  • Run test:
./build/examples/primitives-test-cpp

Some tips

I tried to use different OMP_THREADS_NUM, and smaller threads num can pass.
I also found that some other shape(384x64 : 64x384) can pass the test, it should be related to the kernel implementation of transpose.

@xiang1guo xiang1guo added platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Feb 12, 2025
@TaoLv TaoLv changed the title Aarc64: matmul acl kernel aborted with Assertion failure AArch64: matmul acl kernel aborted with Assertion failure Feb 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:cpu-aarch64 Codeowner: @oneapi-src/onednn-cpu-aarch64 sighting Suspicious library behavior. Should be promoted to a bug when confirmed
Projects
None yet
Development

No branches or pull requests

2 participants