-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Cloud AutoML models contain the vocab as a weight of dtype "string". The infra for string tensors is in core, however we don't yet have the ability to serialize the string values into the .bin
file.
The proposal is to follow the protobuf spec for strings, which encodes strings as utf-8. Then use the built-in TextDecoder
in JS to decode utf-8 to js strings.
The numpy array of strings in python will follow the built-in serialization (utf-8 strings separated by a null-terminating byte), which then can be converted into string[]
using TextDecoder.decode(bytes).split('\u0000')
The separator code point (\u0000) and the byte length of the string tensor will be stored as metadata in the WeightsManifestEntry
. We should store the separator in case we want to change it later and be backwards compatible with previously serialized files.