Skip to content

Support weight serialization of string tensors #1598

@dsmilkov

Description

@dsmilkov

Cloud AutoML models contain the vocab as a weight of dtype "string". The infra for string tensors is in core, however we don't yet have the ability to serialize the string values into the .bin file.

The proposal is to follow the protobuf spec for strings, which encodes strings as utf-8. Then use the built-in TextDecoder in JS to decode utf-8 to js strings.

The numpy array of strings in python will follow the built-in serialization (utf-8 strings separated by a null-terminating byte), which then can be converted into string[] using TextDecoder.decode(bytes).split('\u0000')

The separator code point (\u0000) and the byte length of the string tensor will be stored as metadata in the WeightsManifestEntry. We should store the separator in case we want to change it later and be backwards compatible with previously serialized files.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions