feat: perf opt part4 #43

chraac · 2025-05-26T14:12:03Z

Related to #34
Closing #32

Passed test-backend-ops on sd 8gen2

  5547/5547 tests passed
  Backend hexagon-npu: �[1;32mOK�[0m

Backend 2/2: CPU
  Skipping
2/2 backends passed
�[1;32mOK�[0m

Full log:
test-backend-ops_all.debug.hexagon.8409dd1e9.log
test-backend-ops_all.debug.hexagon.8409dd1e9_logcat

Performance

The following table compares execution time across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:

devices	commit	type	src0_dim	src1_dim	cpu_time(us)	host_total(us)	host_param_update(us)	device_total(us)	device_dequant(us)	device_compute(us)
8gen2	`8409dd1`	F32	16384x16384	16384x1	38935	285529	296	89518	0	89518
8gen2	`8409dd1`	Q8_0	16384x16384	16384x1	8930	327385	774	255894	245178	10716
8gen2	`8409dd1`	Q4_0	16384x16384	16384x1	12503	143390	735	96932	86927	10005

For more detail
HeHexagon NPU FastRPC Backend Overview

Copilot

Pull Request Overview

This PR adds batched parameter updates for NPU tensors, integrates performance tracking and profiling, expands quantized and unary operation support, and refactors threading and memory utilities.

Introduce npu_device_tensor_update_config and update_params APIs for bulk tensor updates
Enhance host_graph and graph to use new update APIs and add scoped performance trackers
Extend device-side ops with RMS normalization, refined quantization/dequant functions, thread-barrier support, and VTCM quota handling

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/host/tensor.hpp	Add `update_params` & `update_hosts_params_only`, static asserts
ggml/src/ggml-qnn/npu/host/host_device.cpp	Switch debug logs to use `ggml_op_desc`
ggml/src/ggml-qnn/npu/host/graph.hpp	Add tensor update config vector
ggml/src/ggml-qnn/npu/host/graph.cpp	Populate and pass update configs, add performance tracker
ggml/src/ggml-qnn/npu/host/buffer.cpp	Include buffer size in mmap error log
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp	Strengthen `is_valid` check
ggml/src/ggml-qnn/npu/device/util.hpp	Add power utilities, include `<HAP_power.h>`
ggml/src/ggml-qnn/npu/device/thread_pool.hpp	Double default stack size, rename max-thread constant
ggml/src/ggml-qnn/npu/device/tensor.hpp	Introduce `kMaxParamsCount`, add `update_config`, param getters
ggml/src/ggml-qnn/npu/device/quants.hpp	Use `dequantized_element_type` alias, make input const
ggml/src/ggml-qnn/npu/device/quants.cpp	Refactor quant dequant loops, add generic HVX loads
ggml/src/ggml-qnn/npu/device/op_types.hpp	Expand `compute_params`, refine `get_thread_work_slice`, new constants
ggml/src/ggml-qnn/npu/device/op_mul_mat.hpp	Remove duplicate constants, add reduction helpers
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp	Refine dot-product loops, dependency cleanup TODO
ggml/src/ggml-qnn/npu/device/op_impl.hpp	Declare `requires_thread_barrier`
ggml/src/ggml-qnn/npu/device/op_impl.cpp	Add unary RMS-NORM op support, refine template logic
ggml/src/ggml-qnn/npu/device/graph.hpp	Store VTCM quota size
ggml/src/ggml-qnn/npu/device/graph.cpp	Pass VTCM quota to compute, add scoped trackers, thread sync
ggml/src/ggml-qnn/npu/device/device.cpp	Log device open/close, validate support_op args, new graph API
ggml/src/ggml-qnn/npu/CMakeLists.txt	Link `qprintf_static`, add subdirectory include

ggml/src/ggml-qnn/npu/host/tensor.hpp

ggml/src/ggml-qnn/npu/host/graph.cpp

ggml/src/ggml-qnn/npu/device/op_types.hpp

ggml/src/ggml-qnn/npu/device/device.cpp

Copilot

Pull Request Overview

This PR introduces performance improvements and refactoring in the Hexagon NPU backend for GGML. Key changes include refactored tensor update and source handling functions, enhancements in thread pool stack size and synchronization, and updates to quantized matrix multiplication and unary operations.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/host/tensor.hpp	Added static_asserts and refactored update functions for tensor parameters.
ggml/src/ggml-qnn/npu/device/thread_pool.hpp	Increased default thread stack size and renamed thread count constants.
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp	Updated VTCM cache fallback handling and adjusted plane slicing for quantized computations.
ggml/src/ggml-qnn/npu/device/graph.cpp	Added VTCM quota initialization and performance tracking in graph compute.
ggml/src/ggml-qnn/npu/device/op_impl.cpp	Extended support for the new RMS_NORM op and introduced a new thread barrier requirement flag.
ggml/src/ggml-qnn/npu/device/device.cpp & CMakeLists.txt	Various log and dependency updates.

Comments suppressed due to low confidence (2)

ggml/src/ggml-qnn/npu/device/thread_pool.hpp:15

[nitpick] Verify that increasing the default thread stack size to 32KB is justified by the workload and does not lead to unnecessary memory overhead on constrained systems.

constexpr const size_t kDefaultStackSize = 1024 * 32;  // 32KB

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:221

Ensure that clamping 'src0_plane_slice_row_count' based on vtcm_quota_size is intentional and that it does not inadvertently truncate necessary rows for correct matrix multiplication.

src0_plane_slice_row_count = std::min(params->vtcm_quota_size / src0_actual_row_size, src0_plane_slice_row_count);

Copilot · 2025-05-26T16:28:41Z

ggml/src/ggml-qnn/npu/host/tensor.hpp

@@ -19,11 +21,13 @@ class host_tensor {

    explicit host_tensor(ggml_tensor * tensor, int buffer_fd, uint64_t offset, remote_handle64 device_handle) :
        _device_handle(device_handle) {
+        static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large");


Consider replacing the magic number '100' with a named constant or macro for improved clarity and maintainability.

Suggested change

static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large");

static_assert(sizeof(npu_device_tensor_config) < NPU_DEVICE_TENSOR_CONFIG_MAX_SIZE, "npu_device_tensor_config size too large");

ggml/src/ggml-qnn/npu/device/graph.cpp

Copilot · 2025-05-26T16:28:41Z

ggml/src/ggml-qnn/npu/device/op_impl.cpp

+
+bool is_unary_op_supported(const npu_device_tensor_spec & src0, const npu_device_tensor_spec & src1,
+                           const npu_device_tensor_spec & dst, npu_device_tensor_op op) {
+    if (op != NPU_OP_RMS_NORM) {


[nitpick] Since the RMS_NORM operation currently supports only F32 data type, consider adding explicit documentation or comments to clarify the data type restriction for future maintainers.

Copilot

Pull Request Overview

This PR implements performance optimizations and code refactoring for the Hexagon NPU backend. Key changes include:

Refactoring tensor parameter updates (e.g. renaming set_op to update_config and consolidating source setting in host_tensor).
Enhancements to thread pool, VTCM cache handling, and logging for improved debugging and performance.
Updates and new support for RMS_NORM and quantized matrix multiplication operations.

Reviewed Changes

Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
ggml/src/ggml-qnn/npu/host/tensor.hpp	Added type traits include and refactored tensor parameter update logic with static_asserts.
ggml/src/ggml-qnn/npu/host/host_device.cpp	Updated logging to use descriptive op information.
ggml/src/ggml-qnn/npu/host/graph*.{hpp,cpp}	Introduced tensor update configurations and performance tracker scopes during update.
ggml/src/ggml-qnn/npu/host/buffer.cpp	Enhanced error logging with additional context data.
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp	Modified is_valid() to include a size check for improved validity checking.
ggml/src/ggml-qnn/npu/device/util.hpp	Introduced a new power_utils helper class using updated power API functions.
ggml/src/ggml-qnn/npu/device/thread_pool.hpp	Increased default stack size and renamed constants for better clarity in thread pool usage.
ggml/src/ggml-qnn/npu/device/tensor.hpp	Updated API for configuring tensor source handles and parameter queries.
ggml/src/ggml-qnn/npu/device/quants*.{hpp,cpp}	Updated dequantization routines with new HVX intrinsics and improved block loading helpers.
ggml/src/ggml-qnn/npu/device/op_types.hpp	Revised thread work slicing and introduced additional constants for vector alignment.
ggml/src/ggml-qnn/npu/device/op_mul_mat*.{hpp,cpp}	Refactored matrix multiplication implementation with VTMC caching improvements and logging.
ggml/src/ggml-qnn/npu/device/op_impl*.{hpp,cpp}	Added unary op support (RMS_NORM), thread barrier requirement flag, and shape validation.
ggml/src/ggml-qnn/npu/device/graph*.{hpp,cpp}	Propagated VTMC quota size setup and integrated performance tracking during graph compute.
ggml/src/ggml-qnn/npu/device/device.cpp	Updated device API functions, including improved error checking and tensor configuration updates.
ggml/src/ggml-qnn/npu/CMakeLists.txt	Added dependency on qprintf library and updated target linkage.

Comments suppressed due to low confidence (1)

ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:247

The loop uses 'src0_plane_slice_row_count' as the increment for the column index. It would be helpful to add a comment clarifying the intended relationship between the slice row count and the number of elements processed per iteration to ensure clarity for future maintainers.

for (int64_t col_idx = start_end_element.first; col_idx < start_end_element.second; col_idx += src0_plane_slice_row_count) {

Copilot · 2025-05-27T13:24:12Z

ggml/src/ggml-qnn/npu/device/op_impl.cpp

@@ -260,6 +432,14 @@ compute_func_type get_compute_func(tensor * dst) {
    return get_compute_func_impl(dst->get_op(), dst->get_type());
 }

+bool requires_thread_barrier(npu_device_tensor_op op) {


Since a new 'requires_thread_barrier' flag has been introduced in the op_capabilities table, consider adding documentation or inline comments explaining the rationale for when a thread barrier is required for an op to aid future code reviews and debugging.

chraac added 30 commits May 16, 2025 22:52

wip

00890e5

refactor: rewrite dequantize_row_q4_0 by intrinsic

d2d35e4

log for debug

5149473

fix q4 intrinsic

58e8bcf

small opt

f6f2277

wip

8f8906f

wip

688b3c6

add vtcm_quota_size

3fc25f2

add perf log for hexagon-npu backend

e6229ec

wip

7f47761

add log

9001c4f

sync after a specfic op

416ce0d

increase worker thread priority

9cdfef7

fix unbalanced thread slice

137a3cc

small slict to fit in vtcm cache

e58e921

limit the supported row element size

00ee0cc

opt 4_0 dequant

035749f

fix q4 dequant

4d4d239

add power_utils

142ab93

add rms_norm

2415498

wip

c5749a6

enable rms_norm f32

96d2c18

fix rms_norm with param

35df13a

fix compiling flags

2b76030

use float

bd70076

fix small row size

84d5e0e

vectorized rms norm

bb949a3

wip

68a6d76

read 2 vectors

1d94db3

rename

edfb963

chraac added 11 commits May 24, 2025 13:04

wip

4bacb3f

print more log

d65475d

add struct for update param config

4acf419

add npu_device_graph_set_tensor_with_param

a42a9d9

merge tensor and params update

8352d47

wip

2caea38

wip

1ee710b

make as template to reuse

8c027ba

vectorize dequantize_row_q8_0

57252c0

opt

fe08014

avoid using union to store q data

8409dd1

chraac requested a review from Copilot May 26, 2025 14:12

chraac self-assigned this May 26, 2025

chraac added this to qnn backend May 26, 2025

chraac added the enhancement New feature or request label May 26, 2025

chraac linked an issue May 26, 2025 that may be closed by this pull request

[Bug]Fix GGML_OP_RMS_NORM #32

Closed

Copilot AI reviewed May 26, 2025

View reviewed changes

chraac moved this to In progress in qnn backend May 26, 2025

chraac mentioned this pull request May 26, 2025

feat: perf opt part4 chraac/llama-cpp-qnn-builder#15

Merged

chraac added 2 commits May 26, 2025 22:31

wip

435caa2

wip

fafe1ad

chraac requested a review from Copilot May 26, 2025 16:27

Copilot AI reviewed May 26, 2025

View reviewed changes

chraac added 3 commits May 27, 2025 10:57

Merge branch 'dev-refactoring' into dev-perf-opt-part4

411097e

wip

d8bd368

Merge branch 'dev-refactoring' into dev-perf-opt-part4

99fc1e3

chraac requested a review from Copilot May 27, 2025 13:23

Copilot AI reviewed May 27, 2025

View reviewed changes

chraac merged commit c23ab46 into dev-refactoring May 27, 2025

github-project-automation bot moved this from In progress to Done in qnn backend May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: perf opt part4 #43

feat: perf opt part4 #43

Uh oh!

chraac commented May 26, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 26, 2025

Uh oh!

Uh oh!

Copilot AI May 26, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI May 27, 2025

Uh oh!

Uh oh!

	static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large");
	static_assert(sizeof(npu_device_tensor_config) < NPU_DEVICE_TENSOR_CONFIG_MAX_SIZE, "npu_device_tensor_config size too large");

feat: perf opt part4 #43

feat: perf opt part4 #43

Uh oh!

Conversation

chraac commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Performance

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI May 26, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI May 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chraac commented May 26, 2025 •

edited

Loading