Skip to content

Conversation

iksnagreb
Copy link
Contributor

@iksnagreb iksnagreb commented Aug 27, 2025

Getting the number of bytes in a string tensor needs special treatment as the STRING data type does not define a bitwidth but needs to be computed from flattening the strings into a sequence of bytes. See call_onnx_api querying .nbytes to temporarily remove large initializers.

String initializers need to be converted to bytes via the .string_data method instead of the .tobytes method in DeduplicateInitializersPass.

Also adds mapping of np.bytes_ to STRING in DataType.from_numpy, which was probably just overlooked before, as object and np.str_ (unicode strings) are already handled. This is related to microsoft/onnxscript#2514

Getting the number of bytes in a string tensor needs special treatment
as the STRING data type does not define a bitwidth and the size needs to
be computed from flattening the strings into a sequence of bytes. See
call_onnx_api querying .nbytes to temporarily remove large initializers.

String initializers need to be converted to bytes via the .string_data
method instead of the .tobytes method in DeduplicateInitializersPass.

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@iksnagreb iksnagreb requested review from titaiwangms and a team as code owners August 27, 2025 12:56
Copy link

codecov bot commented Aug 27, 2025

Codecov Report

❌ Patch coverage is 84.61538% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.72%. Comparing base (ce1d0e6) to head (8ae3925).
⚠️ Report is 5 commits behind head on main.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
src/onnx_ir/_enums.py 66.66% 0 Missing and 1 partial ⚠️
...onnx_ir/passes/common/initializer_deduplication.py 85.71% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #169      +/-   ##
==========================================
+ Coverage   76.35%   76.72%   +0.36%     
==========================================
  Files          40       40              
  Lines        4809     4816       +7     
  Branches      952      952              
==========================================
+ Hits         3672     3695      +23     
+ Misses        840      827      -13     
+ Partials      297      294       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@justinchuby
Copy link
Member

Could you make similar changes to DeduplicateHashedInitializersPass as well if possible? Thanks!

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@justinchuby
Copy link
Member

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@iksnagreb
Copy link
Contributor Author

I wonder if there is a particular reason to not implement .tobytes for StringTensor? Otherwise the changes to the two deduplicate initializer passes could be unified by simply implementing the .tobytes method without weird workarounds checking for the data type just to do basically .string_data().tobytes() instead?

@justinchuby
Copy link
Member

justinchuby commented Aug 27, 2025

The onnx proto def does not support string as bytes value. So I didn’t see a need for the method at the time. Maybe it’s time to revisit this?

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@justinchuby
Copy link
Member

@iksnagreb I thought about this more and still have this question: In onnx the strings are spec'ed to be utf8: https://github.com/onnx/onnx/blob/72accdac9bacb61b5d4d871b10a9368b1912cdfd/onnx/onnx.proto#L699-L704 which does not allow null terminators. In theory we shouldn't see the null spacing between each element?

@justinchuby
Copy link
Member

>>> np.array((b"a", b"b"))
array([b'a', b'b'], dtype='|S1')
>>> np.array((b"a", b"b")).tobytes()
b'ab'

This is what I got

@iksnagreb
Copy link
Contributor Author

iksnagreb commented Aug 28, 2025

Trying to summarize: the .string_data method corresponds to the referenced onnx spec above, the proposed .tobytes seems to be out of spec, so it should probably not be added to the StringTensor aiming to implement the onnx spec(?) and should not be used for serializing the tensor (as there could be differences in null-terminator and padding bytes and .string_dataalready implements that according to the spec). However, it is sufficient for the purpose of getting a key to uniquely identify tensors for deduplication. So maybe we could keep this local and give it another name to make the intended use clear?

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Copy link
Member

@justinchuby justinchuby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@justinchuby
Copy link
Member

Sorry for nitpicking but could you move the comment to inside the function definition? This way they can be seen more related

@justinchuby
Copy link
Member

Also commits need to be signed off

Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@justinchuby justinchuby enabled auto-merge (squash) August 28, 2025 15:55
@justinchuby justinchuby merged commit 7b9b90b into onnx:main Aug 28, 2025
20 checks passed
DataType.FLOAT8E8M0,
}

def is_string(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs a versionadded line too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, #177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants