Improve spill performance: Disable re-validation of spilled files

- part of https://github.com/apache/datafusion/issues/15271
### Is your feature request related to a problem or challenge?

Today when DataFusion spills files to disk, it uses the Arrow IPC format 

Here is the code:
https://github.com/apache/datafusion/blob/988a53540b67cb36f3f259b47a68fe11736fccbb/datafusion/physical-plan/src/spill.rs#L60-L88

The IPC reader currently re-validates that all the data written is valid arrow data (for example, that the strings are valid utf8)

- The upcoming  https://github.com/apache/arrow-rs/issues/7107 release has the ability to disable this validation

Disabling the validation resulted in a 3x performance increase in the arrow benchmarks

Here are the relvant arrow-rs prs / issues:
- https://github.com/apache/arrow-rs/issues/3287
- https://github.com/apache/arrow-rs/pull/7120

### Describe the solution you'd like

I would like to disable validation when reading the spill files back in. 

### Describe alternatives you've considered

1. Disable validation when reading spill files in
2. Justify that change with comments explaining that we trust that nothing messed with the files after datafusion wrote them
3. Add / use a benchmark showing the peformance benefit of doing this

### Additional context

- https://github.com/apache/arrow-rs/issues/3287
- https://github.com/apache/arrow-rs/pull/7120

	pub(crate) fn spill_record_batches(
	batches: &[RecordBatch],
	path: PathBuf,
	schema: SchemaRef,
	) -> Result<(usize, usize)> {
	let mut writer = IPCStreamWriter::new(path.as_ref(), schema.as_ref())?;
	for batch in batches {
	writer.write(batch)?;
	}
	writer.finish()?;
	debug!(
	"Spilled {} batches of total {} rows to disk, memory released {}",
	writer.num_batches,
	writer.num_rows,
	human_readable_size(writer.num_bytes),
	);
	Ok((writer.num_rows, writer.num_bytes))
	}

	fn read_spill(sender: Sender<Result<RecordBatch>>, path: &Path) -> Result<()> {
	let file = BufReader::new(File::open(path)?);
	let reader = StreamReader::try_new(file, None)?;
	for batch in reader {
	sender
	.blocking_send(batch.map_err(Into::into))
	.map_err(\|e\| exec_datafusion_err!("{e}"))?;
	}
	Ok(())
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve spill performance: Disable re-validation of spilled files #15320

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve spill performance: Disable re-validation of spilled files #15320

Description

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions