Skip to content

Improve spill performance: Disable re-validation of spilled files #15320

@alamb

Description

@alamb

Is your feature request related to a problem or challenge?

Today when DataFusion spills files to disk, it uses the Arrow IPC format

Here is the code:

pub(crate) fn spill_record_batches(
batches: &[RecordBatch],
path: PathBuf,
schema: SchemaRef,
) -> Result<(usize, usize)> {
let mut writer = IPCStreamWriter::new(path.as_ref(), schema.as_ref())?;
for batch in batches {
writer.write(batch)?;
}
writer.finish()?;
debug!(
"Spilled {} batches of total {} rows to disk, memory released {}",
writer.num_batches,
writer.num_rows,
human_readable_size(writer.num_bytes),
);
Ok((writer.num_rows, writer.num_bytes))
}
fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> {
let file = BufReader::new(File::open(path)?);
let reader = StreamReader::try_new(file, None)?;
for batch in reader {
sender
.blocking_send(batch.map_err(Into::into))
.map_err(|e| exec_datafusion_err!("{e}"))?;
}
Ok(())
}

The IPC reader currently re-validates that all the data written is valid arrow data (for example, that the strings are valid utf8)

Disabling the validation resulted in a 3x performance increase in the arrow benchmarks

Here are the relvant arrow-rs prs / issues:

Describe the solution you'd like

I would like to disable validation when reading the spill files back in.

Describe alternatives you've considered

  1. Disable validation when reading spill files in
  2. Justify that change with comments explaining that we trust that nothing messed with the files after datafusion wrote them
  3. Add / use a benchmark showing the peformance benefit of doing this

Additional context

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions