-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem or challenge?
Today when DataFusion spills files to disk, it uses the Arrow IPC format
Here is the code:
datafusion/datafusion/physical-plan/src/spill.rs
Lines 60 to 88 in 988a535
pub(crate) fn spill_record_batches( | |
batches: &[RecordBatch], | |
path: PathBuf, | |
schema: SchemaRef, | |
) -> Result<(usize, usize)> { | |
let mut writer = IPCStreamWriter::new(path.as_ref(), schema.as_ref())?; | |
for batch in batches { | |
writer.write(batch)?; | |
} | |
writer.finish()?; | |
debug!( | |
"Spilled {} batches of total {} rows to disk, memory released {}", | |
writer.num_batches, | |
writer.num_rows, | |
human_readable_size(writer.num_bytes), | |
); | |
Ok((writer.num_rows, writer.num_bytes)) | |
} | |
fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> { | |
let file = BufReader::new(File::open(path)?); | |
let reader = StreamReader::try_new(file, None)?; | |
for batch in reader { | |
sender | |
.blocking_send(batch.map_err(Into::into)) | |
.map_err(|e| exec_datafusion_err!("{e}"))?; | |
} | |
Ok(()) | |
} |
The IPC reader currently re-validates that all the data written is valid arrow data (for example, that the strings are valid utf8)
- The upcoming Release arrow-rs / parquet minor version
54.3.0
(Mar 2025) arrow-rs#7107 release has the ability to disable this validation
Disabling the validation resulted in a 3x performance increase in the arrow benchmarks
Here are the relvant arrow-rs prs / issues:
- Improve Arrow-IPC performance by avoiding Unsafe Unchecked IPC Read RecordBatch arrow-rs#3287
- Add
with_skip_validation
flag to IPCStreamReader
,FileReader
andFileDecoder
arrow-rs#7120
Describe the solution you'd like
I would like to disable validation when reading the spill files back in.
Describe alternatives you've considered
- Disable validation when reading spill files in
- Justify that change with comments explaining that we trust that nothing messed with the files after datafusion wrote them
- Add / use a benchmark showing the peformance benefit of doing this
Additional context
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request