-
Notifications
You must be signed in to change notification settings - Fork 13
Support string tensors in DeduplicateInitializersPass and call_onnx_api #169
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Getting the number of bytes in a string tensor needs special treatment as the STRING data type does not define a bitwidth and the size needs to be computed from flattening the strings into a sequence of bytes. See call_onnx_api querying .nbytes to temporarily remove large initializers. String initializers need to be converted to bytes via the .string_data method instead of the .tobytes method in DeduplicateInitializersPass. Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #169 +/- ##
==========================================
+ Coverage 76.35% 76.72% +0.36%
==========================================
Files 40 40
Lines 4809 4816 +7
Branches 952 952
==========================================
+ Hits 3672 3695 +23
+ Misses 840 827 -13
+ Partials 297 294 -3 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Could you make similar changes to DeduplicateHashedInitializersPass as well if possible? Thanks! |
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
I wonder if there is a particular reason to not implement |
The onnx proto def does not support string as bytes value. So I didn’t see a need for the method at the time. Maybe it’s time to revisit this? |
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
@iksnagreb I thought about this more and still have this question: In onnx the strings are spec'ed to be utf8: https://github.com/onnx/onnx/blob/72accdac9bacb61b5d4d871b10a9368b1912cdfd/onnx/onnx.proto#L699-L704 which does not allow null terminators. In theory we shouldn't see the null spacing between each element? |
This is what I got |
Trying to summarize: the |
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
Sorry for nitpicking but could you move the comment to inside the function definition? This way they can be seen more related |
Also commits need to be signed off |
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
8476ad0
to
8795661
Compare
Signed-off-by: Christoph Berganski <christoph.berganski@gmail.com>
DataType.FLOAT8E8M0, | ||
} | ||
|
||
def is_string(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs a versionadded line too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, #177
Getting the number of bytes in a string tensor needs special treatment as the STRING data type does not define a bitwidth but needs to be computed from flattening the strings into a sequence of bytes. See call_onnx_api querying .nbytes to temporarily remove large initializers.
String initializers need to be converted to bytes via the .string_data method instead of the .tobytes method in DeduplicateInitializersPass.
Also adds mapping of
np.bytes_
toSTRING
inDataType.from_numpy
, which was probably just overlooked before, as object andnp.str_
(unicode strings) are already handled. This is related to microsoft/onnxscript#2514