Use weight cache for quantized tensor scale data

GregoryComer · GregoryComer · commit 9568209ce774 · 2025-09-19T15:48:05.000-07:00
Summary:
When enabling the XNNPACK weight cache and running a model with qb4 or qc8-quantized linear weights, it triggers an assertion that is intended to make sure all data is in the weight cache. This can be reproduced by running the XNNPACK backend linear op tests with weight cache enabled.

The root cause appears to be that tensor scale data was bypassing the weight cache - likely an oversight in initial implementation. This isn't a correctness issue, but does cause the aforementioned assert to fail and uses marginally more memory than it otherwise needs to.

This PR updates the XNNPACK compileModel call to use the weight cache for scale data (instead of putting it in the unpacked_buffers list). With this change, the linear op tests pass with weight cache enabled.

Differential Revision: D82862629
diff --git a/backends/xnnpack/runtime/XNNCompiler.cpp b/backends/xnnpack/runtime/XNNCompiler.cpp
@@ -440,6 +440,15 @@ Error defineTensor(
                   qparams->scale_buffer_idx());
           const std::string& data_name =
               scale_buffer_offset->named_key()->str();
+#ifdef ENABLE_XNNPACK_WEIGHTS_CACHE
+          auto load_result = weights_cache->load_unpacked_data(data_name);
+          ET_CHECK_OR_RETURN_ERROR(
+              load_result.ok(),
+              Internal,
+              "Failed to load block scales from cache: %u.",
+              load_result.error());
+          scale = reinterpret_cast<const float*>(load_result.get());
+#else // ENABLE_XNNPACK_WEIGHTS_CACHE disabled
           Result<FreeableBuffer> scale_buffer =
               named_data_map->get_data(data_name.c_str());
           ET_CHECK_OR_RETURN_ERROR(
@@ -450,6 +459,7 @@ Error defineTensor(
               static_cast<uint32_t>(scale_buffer.error()));
           scale = reinterpret_cast<const float*>(scale_buffer.get().data());
           freeable_buffers.push_back(std::move(scale_buffer.get()));
+#endif
         }
         status = xnn_define_channelwise_quantized_tensor_value_v2(
             /*subgraph=*/subgraph_ptr,
@@ -488,6 +498,15 @@ Error defineTensor(
                   qparams->scale_buffer_idx());
           const std::string& data_name =
               scale_buffer_offset->named_key()->str();
+#ifdef ENABLE_XNNPACK_WEIGHTS_CACHE
+          auto load_result = weights_cache->load_unpacked_data(data_name);
+          ET_CHECK_OR_RETURN_ERROR(
+              load_result.ok(),
+              Internal,
+              "Failed to load tensor scales from cache: %u.",
+              load_result.error());
+          scale_data = reinterpret_cast<const uint16_t*>(load_result.get());
+#else // ENABLE_XNNPACK_WEIGHTS_CACHE disabled
           Result<FreeableBuffer> scale_buffer =
               named_data_map->get_data(data_name.c_str());
           ET_CHECK_OR_RETURN_ERROR(
@@ -499,6 +518,7 @@ Error defineTensor(
           scale_data =
               reinterpret_cast<const uint16_t*>(scale_buffer.get().data());
           freeable_buffers.push_back(std::move(scale_buffer.get()));
+#endif
           scale_numel = qparams->num_scales();
         } else {
           // Read fp32 scales, convert to bf16.