-
Notifications
You must be signed in to change notification settings - Fork 3
feat: perf opt part4 #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds batched parameter updates for NPU tensors, integrates performance tracking and profiling, expands quantized and unary operation support, and refactors threading and memory utilities.
- Introduce
npu_device_tensor_update_config
andupdate_params
APIs for bulk tensor updates - Enhance
host_graph
andgraph
to use new update APIs and add scoped performance trackers - Extend device-side ops with RMS normalization, refined quantization/dequant functions, thread-barrier support, and VTCM quota handling
Reviewed Changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 5 comments.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/host/tensor.hpp | Add update_params & update_hosts_params_only , static asserts |
ggml/src/ggml-qnn/npu/host/host_device.cpp | Switch debug logs to use ggml_op_desc |
ggml/src/ggml-qnn/npu/host/graph.hpp | Add tensor update config vector |
ggml/src/ggml-qnn/npu/host/graph.cpp | Populate and pass update configs, add performance tracker |
ggml/src/ggml-qnn/npu/host/buffer.cpp | Include buffer size in mmap error log |
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp | Strengthen is_valid check |
ggml/src/ggml-qnn/npu/device/util.hpp | Add power utilities, include <HAP_power.h> |
ggml/src/ggml-qnn/npu/device/thread_pool.hpp | Double default stack size, rename max-thread constant |
ggml/src/ggml-qnn/npu/device/tensor.hpp | Introduce kMaxParamsCount , add update_config , param getters |
ggml/src/ggml-qnn/npu/device/quants.hpp | Use dequantized_element_type alias, make input const |
ggml/src/ggml-qnn/npu/device/quants.cpp | Refactor quant dequant loops, add generic HVX loads |
ggml/src/ggml-qnn/npu/device/op_types.hpp | Expand compute_params , refine get_thread_work_slice , new constants |
ggml/src/ggml-qnn/npu/device/op_mul_mat.hpp | Remove duplicate constants, add reduction helpers |
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp | Refine dot-product loops, dependency cleanup TODO |
ggml/src/ggml-qnn/npu/device/op_impl.hpp | Declare requires_thread_barrier |
ggml/src/ggml-qnn/npu/device/op_impl.cpp | Add unary RMS-NORM op support, refine template logic |
ggml/src/ggml-qnn/npu/device/graph.hpp | Store VTCM quota size |
ggml/src/ggml-qnn/npu/device/graph.cpp | Pass VTCM quota to compute, add scoped trackers, thread sync |
ggml/src/ggml-qnn/npu/device/device.cpp | Log device open/close, validate support_op args, new graph API |
ggml/src/ggml-qnn/npu/CMakeLists.txt | Link qprintf_static , add subdirectory include |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces performance improvements and refactoring in the Hexagon NPU backend for GGML. Key changes include refactored tensor update and source handling functions, enhancements in thread pool stack size and synchronization, and updates to quantized matrix multiplication and unary operations.
Reviewed Changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 3 comments.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/host/tensor.hpp | Added static_asserts and refactored update functions for tensor parameters. |
ggml/src/ggml-qnn/npu/device/thread_pool.hpp | Increased default thread stack size and renamed thread count constants. |
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp | Updated VTCM cache fallback handling and adjusted plane slicing for quantized computations. |
ggml/src/ggml-qnn/npu/device/graph.cpp | Added VTCM quota initialization and performance tracking in graph compute. |
ggml/src/ggml-qnn/npu/device/op_impl.cpp | Extended support for the new RMS_NORM op and introduced a new thread barrier requirement flag. |
ggml/src/ggml-qnn/npu/device/device.cpp & CMakeLists.txt | Various log and dependency updates. |
Comments suppressed due to low confidence (2)
ggml/src/ggml-qnn/npu/device/thread_pool.hpp:15
- [nitpick] Verify that increasing the default thread stack size to 32KB is justified by the workload and does not lead to unnecessary memory overhead on constrained systems.
constexpr const size_t kDefaultStackSize = 1024 * 32; // 32KB
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:221
- Ensure that clamping 'src0_plane_slice_row_count' based on vtcm_quota_size is intentional and that it does not inadvertently truncate necessary rows for correct matrix multiplication.
src0_plane_slice_row_count = std::min(params->vtcm_quota_size / src0_actual_row_size, src0_plane_slice_row_count);
@@ -19,11 +21,13 @@ class host_tensor { | |||
|
|||
explicit host_tensor(ggml_tensor * tensor, int buffer_fd, uint64_t offset, remote_handle64 device_handle) : | |||
_device_handle(device_handle) { | |||
static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider replacing the magic number '100' with a named constant or macro for improved clarity and maintainability.
static_assert(sizeof(npu_device_tensor_config) < 100, "npu_device_tensor_config size too large"); | |
static_assert(sizeof(npu_device_tensor_config) < NPU_DEVICE_TENSOR_CONFIG_MAX_SIZE, "npu_device_tensor_config size too large"); |
Copilot uses AI. Check for mistakes.
|
||
bool is_unary_op_supported(const npu_device_tensor_spec & src0, const npu_device_tensor_spec & src1, | ||
const npu_device_tensor_spec & dst, npu_device_tensor_op op) { | ||
if (op != NPU_OP_RMS_NORM) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] Since the RMS_NORM operation currently supports only F32 data type, consider adding explicit documentation or comments to clarify the data type restriction for future maintainers.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements performance optimizations and code refactoring for the Hexagon NPU backend. Key changes include:
- Refactoring tensor parameter updates (e.g. renaming set_op to update_config and consolidating source setting in host_tensor).
- Enhancements to thread pool, VTCM cache handling, and logging for improved debugging and performance.
- Updates and new support for RMS_NORM and quantized matrix multiplication operations.
Reviewed Changes
Copilot reviewed 32 out of 32 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
ggml/src/ggml-qnn/npu/host/tensor.hpp | Added type traits include and refactored tensor parameter update logic with static_asserts. |
ggml/src/ggml-qnn/npu/host/host_device.cpp | Updated logging to use descriptive op information. |
ggml/src/ggml-qnn/npu/host/graph*.{hpp,cpp} | Introduced tensor update configurations and performance tracker scopes during update. |
ggml/src/ggml-qnn/npu/host/buffer.cpp | Enhanced error logging with additional context data. |
ggml/src/ggml-qnn/npu/device/vtcm_mem.hpp | Modified is_valid() to include a size check for improved validity checking. |
ggml/src/ggml-qnn/npu/device/util.hpp | Introduced a new power_utils helper class using updated power API functions. |
ggml/src/ggml-qnn/npu/device/thread_pool.hpp | Increased default stack size and renamed constants for better clarity in thread pool usage. |
ggml/src/ggml-qnn/npu/device/tensor.hpp | Updated API for configuring tensor source handles and parameter queries. |
ggml/src/ggml-qnn/npu/device/quants*.{hpp,cpp} | Updated dequantization routines with new HVX intrinsics and improved block loading helpers. |
ggml/src/ggml-qnn/npu/device/op_types.hpp | Revised thread work slicing and introduced additional constants for vector alignment. |
ggml/src/ggml-qnn/npu/device/op_mul_mat*.{hpp,cpp} | Refactored matrix multiplication implementation with VTMC caching improvements and logging. |
ggml/src/ggml-qnn/npu/device/op_impl*.{hpp,cpp} | Added unary op support (RMS_NORM), thread barrier requirement flag, and shape validation. |
ggml/src/ggml-qnn/npu/device/graph*.{hpp,cpp} | Propagated VTMC quota size setup and integrated performance tracking during graph compute. |
ggml/src/ggml-qnn/npu/device/device.cpp | Updated device API functions, including improved error checking and tensor configuration updates. |
ggml/src/ggml-qnn/npu/CMakeLists.txt | Added dependency on qprintf library and updated target linkage. |
Comments suppressed due to low confidence (1)
ggml/src/ggml-qnn/npu/device/op_mul_mat.cpp:247
- The loop uses 'src0_plane_slice_row_count' as the increment for the column index. It would be helpful to add a comment clarifying the intended relationship between the slice row count and the number of elements processed per iteration to ensure clarity for future maintainers.
for (int64_t col_idx = start_end_element.first; col_idx < start_end_element.second; col_idx += src0_plane_slice_row_count) {
@@ -260,6 +432,14 @@ compute_func_type get_compute_func(tensor * dst) { | |||
return get_compute_func_impl(dst->get_op(), dst->get_type()); | |||
} | |||
|
|||
bool requires_thread_barrier(npu_device_tensor_op op) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since a new 'requires_thread_barrier' flag has been introduced in the op_capabilities table, consider adding documentation or inline comments explaining the rationale for when a thread barrier is required for an op to aid future code reviews and debugging.
Copilot uses AI. Check for mistakes.
Related to #34
Closing #32
Passed
test-backend-ops
on sd 8gen2Full log:
test-backend-ops_all.debug.hexagon.8409dd1e9.log
test-backend-ops_all.debug.hexagon.8409dd1e9_logcat
Performance
The following table compares execution time across different precision formats for a large 16384×16384 matrix multiplied by a 16384×1 vector on some devices:
For more detail
HeHexagon NPU FastRPC Backend Overview