Skip to content

feat: op perf opt #38

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 103 commits into from
Apr 21, 2025
Merged

feat: op perf opt #38

merged 103 commits into from
Apr 21, 2025

Conversation

chraac
Copy link
Owner

@chraac chraac commented Apr 15, 2025

Related to #34

Passed test-backend-ops on sd 8gen2

unload rpcmem lib successfully
  OPT_STEP_ADAMW(type=f32,ne=[10,5,4,3]): not supported [hexagon-npu] 
  5297/5297 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/5: qnn-npu
  Skipping
Backend 3/5: qnn-gpu
  Skipping
Backend 4/5: qnn-cpu
  Skipping
Backend 5/5: CPU
  Skipping
5/5 backends passed
�[1;32mOK�[0m

Full log
test-backend-ops_all.debug.hexagon.android.366b8b5.log
test-backend-ops_all.debug.hexagon.android.366b8b5_logcat.log

@chraac chraac requested a review from Copilot April 20, 2025 03:25
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance optimizations and new backend implementations for the Hexagon NPU. Key changes include:

  • New host-side support for device initialization, memory buffering, and graph handling.
  • Addition of device-level implementations for tensor operations and operations such as matrix multiplication and element-wise arithmetic.
  • Cleanup of deprecated QNN backend definitions.

Reviewed Changes

Copilot reviewed 57 out of 59 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host.cpp Implements host backend interfaces and device initialization routines.
ggml/src/ggml-qnn/npu/host/graph.{hpp,cpp} Introduces graph management for tensor operations on the host side.
ggml/src/ggml-qnn/npu/host/device.{hpp,cpp} Establishes a new device interface and support functions for the Hexagon NPU.
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} Adds buffer allocation and tensor initialization using RPC memory.
ggml/src/ggml-qnn/npu/device/* Provides implementations for tensor operations, op intrinsics, and graph execution.
ggml/include/ggml-qnn.h Removes deprecated QNN-specific definitions.
Files not reviewed (2)
  • ggml/src/ggml-qnn/CMakeLists.txt: Language not supported
  • ggml/src/ggml-qnn/npu/CMakeLists.txt: Language not supported

@chraac chraac requested a review from Copilot April 20, 2025 10:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds new Hexagon NPU support with performance optimizations for operator execution. Key changes include:

  • Implementation of host and device backend interfaces for Hexagon NPU.
  • New tensor, buffer, and graph modules to enable op offloading and improved operator implementations.
  • Introduction of several op implementations, including matrix multiplication and element‐wise operations, while removing outdated qnn backend definitions.

Reviewed Changes

Copilot reviewed 57 out of 59 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host.cpp Adds host backend functions and device proxy implementation.
ggml/src/ggml-qnn/npu/host/graph.{hpp,cpp} Introduces host graph management for tensor operations.
ggml/src/ggml-qnn/npu/host/device.{hpp,cpp} Implements NPU device initialization, memory, and op support.
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} Provides buffer allocation and tensor initialization support.
ggml/src/ggml-qnn/npu/device/* Defines op implementations, tensor, graph, and device APIs.
ggml/include/ggml-qnn.h Removes legacy qnn backend definitions.
Files not reviewed (2)
  • ggml/src/ggml-qnn/CMakeLists.txt: Language not supported
  • ggml/src/ggml-qnn/npu/CMakeLists.txt: Language not supported

@chraac chraac changed the title [WIP]feat: op perf opt feat: op perf opt Apr 20, 2025
@chraac chraac requested a review from Copilot April 20, 2025 15:09
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements performance optimizations for Hexagon NPU operator processing by adding new device, graph, tensor, buffer, and operator implementations. Key changes include:

  • Introduction of new host device and backend classes for NPU initialization and operation offloading.
  • Implementation of host graph management and enhanced memory/buffer handling.
  • New operator implementations (matrix multiplication and element-wise ops) using HVX intrinsics for improved compute performance.

Reviewed Changes

Copilot reviewed 57 out of 59 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host_device.hpp Added npu_device and npu_backend classes for NPU device handling
ggml/src/ggml-qnn/npu/host/host_device.cpp Implements device initialization, RPC memory setup, and logging
ggml/src/ggml-qnn/npu/host/host.cpp Defines host interface functions for device context management
ggml/src/ggml-qnn/npu/host/graph.{hpp,cpp} Introduces host_graph class for graph creation, update, and compute
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} Implements host_buffer and host_buffer_type for buffer allocation
ggml/src/ggml-qnn/npu/device/{tensor, op_*, graph, device}.{hpp,cpp} New operator, tensor, and graph implementations with HVX intrinsics
ggml/include/ggml-qnn.h Updates public interface by removing deprecated backend definitions
Files not reviewed (2)
  • ggml/src/ggml-qnn/CMakeLists.txt: Language not supported
  • ggml/src/ggml-qnn/npu/CMakeLists.txt: Language not supported

@chraac chraac requested a review from Copilot April 21, 2025 03:52
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance optimizations for operations on the Hexagon NPU backend. Key changes include:

  • New implementations and refinements for device, tensor, and operation (e.g. mul_mat, element-wise) handling.
  • Enhancements for RPC memory allocation, graph management, and device interfacing.
  • Updates to backend header interfaces and the removal of deprecated files.

Reviewed Changes

Copilot reviewed 57 out of 59 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host_device.hpp & .cpp Introduce and initialize the npu_device and backend classes with RPC memory management.
ggml/src/ggml-qnn/npu/host/host.cpp Implement backend device proxy and host device interfaces.
ggml/src/ggml-qnn/npu/host/graph.{hpp,cpp} Create and update the host graph structure for tensor management.
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} Add host buffer and tensor initialization using RPC memory with proper mapping.
ggml/src/ggml-qnn/npu/device/{tensor,op_mul_mat,op_impl,graph,device}.{hpp,cpp} Provide optimized implementations for tensor operations, multiplication, and graph execution.
ggml/include/ggml-qnn.h Update the public interface headers to reflect backend changes and remove legacy definitions.
Files not reviewed (2)
  • ggml/src/ggml-qnn/CMakeLists.txt: Language not supported
  • ggml/src/ggml-qnn/npu/CMakeLists.txt: Language not supported
Comments suppressed due to low confidence (1)

ggml/src/ggml-qnn/npu/device/op_impl.cpp:136

  • There appears to be a typo in the format specifier for dst.ne[i]: '%l;d' should likely be '%lld' to match the cast. Please update to ensure proper logging.
DEVICE_LOG_DEBUG("src0.ne[%zu] and dst.ne[%zu] not match: %lld vs %l;d\n", i, i, (long long) src0.ne[i], (long long) dst.ne[i]);

@chraac chraac requested a review from Copilot April 21, 2025 04:00
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance optimizations and new implementations for the Hexagon NPU backend. Key changes include:

  • New host-side device, graph, and buffer implementations to manage NPU resources.
  • Optimized operator implementations (e.g. matrix multiplication) using Hexagon HVX intrinsics.
  • Updates to backend and device interfaces to improve resource handling and logging.

Reviewed Changes

Copilot reviewed 57 out of 59 changed files in this pull request and generated no comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/host_device.hpp New header defining the NPU device interface and associated functions.
ggml/src/ggml-qnn/npu/host/host_device.cpp Implementation of device initialization, support checks, and logging.
ggml/src/ggml-qnn/npu/host/host.cpp New host interface to create device proxies and manage backend context.
ggml/src/ggml-qnn/npu/host/graph.{hpp,cpp} Graph management implementation with caching and update logic.
ggml/src/ggml-qnn/npu/host/buffer.{hpp,cpp} Buffer and tensor allocation via RPC memory handling.
ggml/src/ggml-qnn/npu/device/{tensor.hpp,op_mul_mat.{hpp,cpp}} New device operator implementations using Hexagon intrinsics.
ggml/src/ggml-qnn/npu/device/{op_impl.hpp,op_impl.cpp} Operator implementation and support verification logic.
ggml/src/ggml-qnn/npu/device/device.cpp Device open/close and tensor/graph operations for the NPU backend.
ggml/include/ggml-qnn.h Minor cleanup in backend header (removed unused code).
Files not reviewed (2)
  • ggml/src/ggml-qnn/CMakeLists.txt: Language not supported
  • ggml/src/ggml-qnn/npu/CMakeLists.txt: Language not supported
Comments suppressed due to low confidence (1)

ggml/src/ggml-qnn/npu/host/buffer.cpp:138

  • [nitpick] Consider using the '%zu' format specifier for size_t instead of casting to int for clarity and consistency in the error message.
LOG_ERROR("failed to allocate rpc memory, size: %d MB\n", (int) (size / (1 << 20)));

@chraac chraac merged commit beff5c4 into dev-refactoring Apr 21, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in qnn backend Apr 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant