Support string tensors in DeduplicateInitializersPass and call_onnx_api #169

iksnagreb · 2025-08-27T12:56:15Z

Getting the number of bytes in a string tensor needs special treatment as the STRING data type does not define a bitwidth but needs to be computed from flattening the strings into a sequence of bytes. See call_onnx_api querying .nbytes to temporarily remove large initializers.

String initializers need to be converted to bytes via the .string_data method instead of the .tobytes method in DeduplicateInitializersPass.

Also adds mapping of np.bytes_ to STRING in DataType.from_numpy, which was probably just overlooked before, as object and np.str_ (unicode strings) are already handled. This is related to microsoft/onnxscript#2514

Getting the number of bytes in a string tensor needs special treatment as the STRING data type does not define a bitwidth and the size needs to be computed from flattening the strings into a sequence of bytes. See call_onnx_api querying .nbytes to temporarily remove large initializers. String initializers need to be converted to bytes via the .string_data method instead of the .tobytes method in DeduplicateInitializersPass. Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

codecov · 2025-08-27T12:57:40Z

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.72%. Comparing base (ce1d0e6) to head (8ae3925).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines	Patch %	Lines
src/onnx_ir/_enums.py	66.66%	0 Missing and 1 partial ⚠️
...onnx_ir/passes/common/initializer_deduplication.py	85.71%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #169      +/-   ##
==========================================
+ Coverage   76.35%   76.72%   +0.36%     
==========================================
  Files          40       40              
  Lines        4809     4816       +7     
  Branches      952      952              
==========================================
+ Hits         3672     3695      +23     
+ Misses        840      827      -13     
+ Partials      297      294       -3

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/onnx_ir/_core.py

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

src/onnx_ir/_core.py

src/onnx_ir/passes/common/initializer_deduplication.py

justinchuby · 2025-08-27T14:41:14Z

Could you make similar changes to DeduplicateHashedInitializersPass as well if possible? Thanks!

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

justinchuby · 2025-08-27T15:59:54Z

Could you add a test in https://github.com/onnx/ir-py/blob/main/src/onnx_ir/passes/common/initializer_deduplication_test.py?

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

iksnagreb · 2025-08-27T19:22:26Z

I wonder if there is a particular reason to not implement .tobytes for StringTensor? Otherwise the changes to the two deduplicate initializer passes could be unified by simply implementing the .tobytes method without weird workarounds checking for the data type just to do basically .string_data().tobytes() instead?

justinchuby · 2025-08-27T19:25:59Z

The onnx proto def does not support string as bytes value. So I didn’t see a need for the method at the time. Maybe it’s time to revisit this?

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

src/onnx_ir/passes/common/initializer_deduplication_test.py

justinchuby · 2025-08-28T00:05:37Z

@iksnagreb I thought about this more and still have this question: In onnx the strings are spec'ed to be utf8: https://github.com/onnx/onnx/blob/72accdac9bacb61b5d4d871b10a9368b1912cdfd/onnx/onnx.proto#L699-L704 which does not allow null terminators. In theory we shouldn't see the null spacing between each element?

justinchuby · 2025-08-28T00:07:48Z

>>> np.array((b"a", b"b"))
array([b'a', b'b'], dtype='|S1')
>>> np.array((b"a", b"b")).tobytes()
b'ab'

This is what I got

src/onnx_ir/_core.py

src/onnx_ir/passes/common/initializer_deduplication.py

iksnagreb · 2025-08-28T10:55:55Z

Trying to summarize: the .string_data method corresponds to the referenced onnx spec above, the proposed .tobytes seems to be out of spec, so it should probably not be added to the StringTensor aiming to implement the onnx spec(?) and should not be used for serializing the tensor (as there could be differences in null-terminator and padding bytes and .string_dataalready implements that according to the spec). However, it is sufficient for the purpose of getting a key to uniquely identify tensors for deduplication. So maybe we could keep this local and give it another name to make the intended use clear?

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

src/onnx_ir/passes/common/initializer_deduplication.py

justinchuby

Thank you!

justinchuby · 2025-08-28T15:19:09Z

Sorry for nitpicking but could you move the comment to inside the function definition? This way they can be seen more related

justinchuby · 2025-08-28T15:19:42Z

Also commits need to be signed off

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

src/onnx_ir/passes/common/initializer_deduplication.py

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

justinchuby · 2025-09-05T14:17:03Z

src/onnx_ir/_enums.py

            DataType.FLOAT8E8M0,
        }

+    def is_string(self) -> bool:


I think this needs a versionadded line too

Sorry, #177

iksnagreb requested review from titaiwangms and a team as code owners August 27, 2025 12:56

github-advanced-security bot found potential problems Aug 27, 2025

View reviewed changes

src/onnx_ir/_core.py Fixed Show fixed Hide fixed

Linting: Avoid extraneous parentheses

55512d0

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

justinchuby reviewed Aug 27, 2025

View reviewed changes

src/onnx_ir/_core.py Outdated Show resolved Hide resolved

justinchuby reviewed Aug 27, 2025

View reviewed changes

src/onnx_ir/_core.py Outdated Show resolved Hide resolved

justinchuby reviewed Aug 27, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Outdated Show resolved Hide resolved

justinchuby added the module: passes label Aug 27, 2025

Remove size specialization of StringTensor which should have been nbytes

e4acdad

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

iksnagreb mentioned this pull request Aug 27, 2025

[Optimizer] Fix reinterpretation of strings in _get_numpy_value microsoft/onnxscript#2514

Merged

iksnagreb added 2 commits August 27, 2025 20:40

Add mapping of np.bytes_ to STRING in DataType.from_numpy

7341661

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

Support string tensors in DeduplicateHashedInitializersPass as well

94a0a5c

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

Add test cases for deduplicating string initializers

4e87c81

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

justinchuby reviewed Aug 27, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication_test.py Show resolved Hide resolved

justinchuby reviewed Aug 28, 2025

View reviewed changes

src/onnx_ir/_core.py Show resolved Hide resolved

justinchuby reviewed Aug 28, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Show resolved Hide resolved

iksnagreb added 3 commits August 28, 2025 13:01

Test not deduplicating strings with differently grouped bytes sequence

5fbe4ee

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

Add test cases for StringTensor.nbytes

32ddce6

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

Address linitng issues

8a4902d

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

justinchuby reviewed Aug 28, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Outdated Show resolved Hide resolved

justinchuby approved these changes Aug 28, 2025

View reviewed changes

Comment on bytes representation for initilizer deduplication key

8795661

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

iksnagreb force-pushed the string-properties branch from 8476ad0 to 8795661 Compare August 28, 2025 15:23

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

src/onnx_ir/passes/common/initializer_deduplication.py Fixed Show fixed Hide fixed

src/onnx_ir/passes/common/initializer_deduplication.py Fixed Show fixed Hide fixed

Address linting issues

8ae3925

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>

justinchuby approved these changes Aug 28, 2025

View reviewed changes

justinchuby enabled auto-merge (squash) August 28, 2025 15:55

justinchuby merged commit 7b9b90b into onnx:main Aug 28, 2025
20 checks passed

justinchuby reviewed Sep 5, 2025

View reviewed changes

Support string tensors in DeduplicateInitializersPass and call_onnx_api #169

Support string tensors in DeduplicateInitializersPass and call_onnx_api #169

Uh oh!

Conversation

iksnagreb commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Aug 27, 2025

Uh oh!

justinchuby commented Aug 27, 2025

Uh oh!

iksnagreb commented Aug 27, 2025

Uh oh!

justinchuby commented Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

justinchuby commented Aug 28, 2025

Uh oh!

justinchuby commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

iksnagreb commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

justinchuby left a comment

Choose a reason for hiding this comment

Uh oh!

justinchuby commented Aug 28, 2025

Uh oh!

justinchuby commented Aug 28, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

justinchuby Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

iksnagreb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

iksnagreb commented Aug 27, 2025 •

edited

Loading

codecov bot commented Aug 27, 2025 •

edited

Loading

justinchuby commented Aug 27, 2025 •

edited

Loading

iksnagreb commented Aug 28, 2025 •

edited

Loading