Skip to content

feat: perf opt part4 #43

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 50 commits into from
May 27, 2025
Merged

feat: perf opt part4 #43

merged 50 commits into from
May 27, 2025

Conversation

chraac
Copy link
Owner

@chraac chraac commented May 26, 2025

Related to #34
Closing #32

Passed test-backend-ops on sd 8gen2

  5547/5547 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops_all.debug.hexagon.8409dd1e9.log
test-backend-ops_all.debug.hexagon.8409dd1e9_logcat

Performance

The following table compares execution time across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:

devices commit type src0_dim src1_dim cpu_time(us) host_total(us) host_param_update(us) device_total(us) device_dequant(us) device_compute(us)
8gen2 8409dd1 F32 16384x16384 16384x1 38935 285529 296 89518 0 89518
8gen2 8409dd1 Q8_0 16384x16384 16384x1 8930 327385 774 255894 245178 10716
8gen2 8409dd1 Q4_0 16384x16384 16384x1 12503 143390 735 96932 86927 10005

For more detail
HeHexagon NPU FastRPC Backend Overview

@chraac chraac requested a review from Copilot May 26, 2025 14:12
@chraac chraac self-assigned this May 26, 2025
@chraac chraac added the enhancement New feature or request label May 26, 2025
@chraac chraac linked an issue May 26, 2025 that may be closed by this pull request
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds batched parameter updates for NPU tensors, integrates performance tracking and profiling, expands quantized and unary operation support, and refactors threading and memory utilities.

  • Introduce npu_device_tensor_update_config and update_params APIs for bulk tensor updates
  • Enhance host_graph and graph to use new update APIs and add scoped performance trackers
  • Extend device-side ops with RMS normalization, refined quantization/dequant functions, thread-barrier support, and VTCM quota handling

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/tensor.hpp Add update_params & update_hosts_params_only, static asserts
ggml/src/ggml-qnn/npu/host/host_device.cpp Switch debug logs to use ggml_op_desc
ggml/src/ggml-qnn/npu/host/graph.hpp Add tensor update config vector
ggml/src/ggml-qnn/npu/host/graph.cpp Populate and pass update configs, add performance tracker
ggml/src/ggml-qnn/npu/host/buffer.cpp Include buffer size in mmap error log
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp Strengthen is_valid check
ggml/src/ggml-qnn/npu/device/util.hpp Add power utilities, include <HAP_power.h>
ggml/src/ggml-qnn/npu/device/thread_pool.hpp Double default stack size, rename max-thread constant
ggml/src/ggml-qnn/npu/device/tensor.hpp Introduce kMaxParamsCount, add update_config, param getters
ggml/src/ggml-qnn/npu/device/quants.hpp Use dequantized_element_type alias, make input const
ggml/src/ggml-qnn/npu/device/quants.cpp Refactor quant dequant loops, add generic HVX loads
ggml/src/ggml-qnn/npu/device/op_types.hpp Expand compute_params, refine get_thread_work_slice, new constants
ggml/src/ggml-qnn/npu/device/op_mul_mat.hpp Remove duplicate constants, add reduction helpers
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp Refine dot-product loops, dependency cleanup TODO
ggml/src/ggml-qnn/npu/device/op_impl.hpp Declare requires_thread_barrier
ggml/src/ggml-qnn/npu/device/op_impl.cpp Add unary RMS-NORM op support, refine template logic
ggml/src/ggml-qnn/npu/device/graph.hpp Store VTCM quota size
ggml/src/ggml-qnn/npu/device/graph.cpp Pass VTCM quota to compute, add scoped trackers, thread sync
ggml/src/ggml-qnn/npu/device/device.cpp Log device open/close, validate support_op args, new graph API
ggml/src/ggml-qnn/npu/CMakeLists.txt Link qprintf_static, add subdirectory include

@chraac chraac requested a review from Copilot May 26, 2025 16:27
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance improvements and refactoring in the Hexagon NPU backend for GGML. Key changes include refactored tensor update and source handling functions, enhancements in thread pool stack size and synchronization, and updates to quantized matrix multiplication and unary operations.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/tensor.hpp Added static_asserts and refactored update functions for tensor parameters.
ggml/src/ggml-qnn/npu/device/thread_pool.hpp Increased default thread stack size and renamed thread count constants.
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp Updated VTCM cache fallback handling and adjusted plane slicing for quantized computations.
ggml/src/ggml-qnn/npu/device/graph.cpp Added VTCM quota initialization and performance tracking in graph compute.
ggml/src/ggml-qnn/npu/device/op_impl.cpp Extended support for the new RMS_NORM op and introduced a new thread barrier requirement flag.
ggml/src/ggml-qnn/npu/device/device.cpp & CMakeLists.txt Various log and dependency updates.
Comments suppressed due to low confidence (2)

ggml/src/ggml-qnn/npu/device/thread_pool.hpp:15

  • [nitpick] Verify that increasing the default thread stack size to 32KB is justified by the workload and does not lead to unnecessary memory overhead on constrained systems.
constexpr const size_t kDefaultStackSize = 1024 * 32;  // 32KB

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:221

  • Ensure that clamping 'src0_plane_slice_row_count' based on vtcm_quota_size is intentional and that it does not inadvertently truncate necessary rows for correct matrix multiplication.
src0_plane_slice_row_count = std::min(params->vtcm_quota_size / src0_actual_row_size, src0_plane_slice_row_count);

@@ -19,11 +21,13 @@ class host_tensor {

explicit host_tensor(ggml_tensor * tensor, int buffer_fd, uint64_t offset, remote_handle64 device_handle) :
_device_handle(device_handle) {
static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large");
Copy link
Preview

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider replacing the magic number '100' with a named constant or macro for improved clarity and maintainability.

Suggested change
static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large");
static_assert(sizeof(npu_device_tensor_config) < NPU_DEVICE_TENSOR_CONFIG_MAX_SIZE, "npu_device_tensor_config size too large");

Copilot uses AI. Check for mistakes.


bool is_unary_op_supported(const npu_device_tensor_spec & src0, const npu_device_tensor_spec & src1,
const npu_device_tensor_spec & dst, npu_device_tensor_op op) {
if (op != NPU_OP_RMS_NORM) {
Copy link
Preview

Copilot AI May 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Since the RMS_NORM operation currently supports only F32 data type, consider adding explicit documentation or comments to clarify the data type restriction for future maintainers.

Copilot uses AI. Check for mistakes.

@chraac chraac requested a review from Copilot May 27, 2025 13:23
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements performance optimizations and code refactoring for the Hexagon NPU backend. Key changes include:

  • Refactoring tensor parameter updates (e.g. renaming set_op to update_config and consolidating source setting in host_tensor).
  • Enhancements to thread pool, VTCM cache handling, and logging for improved debugging and performance.
  • Updates and new support for RMS_NORM and quantized matrix multiplication operations.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ggml/src/ggml-qnn/npu/host/tensor.hpp Added type traits include and refactored tensor parameter update logic with static_asserts.
ggml/src/ggml-qnn/npu/host/host_device.cpp Updated logging to use descriptive op information.
ggml/src/ggml-qnn/npu/host/graph*.{hpp,cpp} Introduced tensor update configurations and performance tracker scopes during update.
ggml/src/ggml-qnn/npu/host/buffer.cpp Enhanced error logging with additional context data.
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp Modified is_valid() to include a size check for improved validity checking.
ggml/src/ggml-qnn/npu/device/util.hpp Introduced a new power_utils helper class using updated power API functions.
ggml/src/ggml-qnn/npu/device/thread_pool.hpp Increased default stack size and renamed constants for better clarity in thread pool usage.
ggml/src/ggml-qnn/npu/device/tensor.hpp Updated API for configuring tensor source handles and parameter queries.
ggml/src/ggml-qnn/npu/device/quants*.{hpp,cpp} Updated dequantization routines with new HVX intrinsics and improved block loading helpers.
ggml/src/ggml-qnn/npu/device/op_types.hpp Revised thread work slicing and introduced additional constants for vector alignment.
ggml/src/ggml-qnn/npu/device/op_mul_mat*.{hpp,cpp} Refactored matrix multiplication implementation with VTMC caching improvements and logging.
ggml/src/ggml-qnn/npu/device/op_impl*.{hpp,cpp} Added unary op support (RMS_NORM), thread barrier requirement flag, and shape validation.
ggml/src/ggml-qnn/npu/device/graph*.{hpp,cpp} Propagated VTMC quota size setup and integrated performance tracking during graph compute.
ggml/src/ggml-qnn/npu/device/device.cpp Updated device API functions, including improved error checking and tensor configuration updates.
ggml/src/ggml-qnn/npu/CMakeLists.txt Added dependency on qprintf library and updated target linkage.
Comments suppressed due to low confidence (1)

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:247

  • The loop uses 'src0_plane_slice_row_count' as the increment for the column index. It would be helpful to add a comment clarifying the intended relationship between the slice row count and the number of elements processed per iteration to ensure clarity for future maintainers.
for (int64_t col_idx = start_end_element.first; col_idx < start_end_element.second; col_idx += src0_plane_slice_row_count) {

@@ -260,6 +432,14 @@ compute_func_type get_compute_func(tensor * dst) {
return get_compute_func_impl(dst->get_op(), dst->get_type());
}

bool requires_thread_barrier(npu_device_tensor_op op) {
Copy link
Preview

Copilot AI May 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a new 'requires_thread_barrier' flag has been introduced in the op_capabilities table, consider adding documentation or inline comments explaining the rationale for when a thread barrier is required for an op to aid future code reviews and debugging.

Copilot uses AI. Check for mistakes.

@chraac chraac merged commit c23ab46 into dev-refactoring May 27, 2025
@github-project-automation github-project-automation bot moved this from In progress to Done in qnn backend May 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]Fix GGML_OP_RMS_NORM
1 participant