Skip to content

Avro Adapter - Dictionary Encoding #731

@martin-traverse

Description

@martin-traverse

Describe the enhancement requested

I'd like to add dictionary encoded values to the Avro adapter and get them working for a full round trip with both schema and data. I'm doing some (unrelated) work on dictionary encoded values atm so it's easy for me to work on this at the same time.

The way I am thinking, dictionary encoding will be supported for string values only, encoded as Avro enums. For write operations the entire dictionary will need to be specified up-front - that is fine for single batch, once we add multi-batch there will be limitations on streams with dictionary updates. When reading, the index type should be the smallest signed int type that will hold all the values. Avro has no concept of ordering, so reading will always create unordered dictionaries and ordering will be lost in round trip. I think this approach is correct - if anyone has a different opinion please do shout!

This will part 3 in the Avro adapter series, following #615 and #698, so file-level capabilities will be part 4, hope that's ok. PR to follow in a few days for review.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions