Skip to content

ExternalSorter Fails to Spill Dictionaries #4658

@tustvold

Description

@tustvold

Describe the bug

ExternalSorter currently spills data using the Arrow IPC format, unfortunately the IPC file format does not support replacing a dictionary with the same ID. Consequently if ExternalSorter spills two batches with dictionary encoded columns, it will error with

Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches

To Reproduce

Expected behavior

Various options present themselves:

  • Number the dictionary IDs consistently across DataFusion (a monumental task)
  • Write batches to separate files
  • Spill to the row format instead of Arrow IPC

Of these I think the last is the most compelling, I plan to work on this in the coming month

Additional context

It is possible Ballista also runs into this issue

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions