-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
ExternalSorter currently spills data using the Arrow IPC format, unfortunately the IPC file format does not support replacing a dictionary with the same ID. Consequently if ExternalSorter
spills two batches with dictionary encoded columns, it will error with
Dictionary replacement detected when writing IPC file format. Arrow IPC files only support a single dictionary for a given field across all batches
To Reproduce
Expected behavior
Various options present themselves:
- Number the dictionary IDs consistently across DataFusion (a monumental task)
- Write batches to separate files
- Spill to the row format instead of Arrow IPC
Of these I think the last is the most compelling, I plan to work on this in the coming month
Additional context
It is possible Ballista also runs into this issue
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working