Skip to content

Conversation

2010YOUY01
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

See the rationale part of the first attempt PR #14975
The included changes and implementation are different than the above PR, it will be explained below.

What changes are included in this PR?

  1. Add max_temp_directory_size field inside DiskManager to keep track of current total disk usage for temporary files, by default it's 100GB.
  2. Update RefCountedTempFile:
    • Included a reference to DiskManager that created it
    • Added a utility function update_disk_usage() to update the global disk usage. After modifying the managed tempfile, the caller also has to call this function to do the update, to make sure when disk limit is exceeded an error will be thrown. (Currently RefCountedTempFile is only used for spill files inside DataFusion, so I think this additional interface is okay to add)

Are these changes tested?

Yes, integration test is included for queries exceed/not-exceed the disk limit.

Are there any user-facing changes?

@2010YOUY01 2010YOUY01 changed the title feat: Add config max_temp_directory_size to limit max disk usage for spilling queries feat: Add config max_temp_directory_size to limit max disk usage for spilling queries Apr 1, 2025
@2010YOUY01 2010YOUY01 requested a review from Copilot April 1, 2025 06:56
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds a new configuration to limit temporary disk usage for spilling queries by introducing the max_temp_directory_size setting in DiskManager and updating related components.

  • Adds max_temp_directory_size and used_disk_space to DiskManager to track and enforce the disk usage limit.
  • Updates RefCountedTempFile and InProgressSpillFile to update the global disk usage after file modifications.
  • Introduces integration tests to verify behavior when the disk spill limit is reached versus not reached.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
datafusion/physical-plan/src/spill/spill_manager.rs Updated error documentation for spill functions.
datafusion/physical-plan/src/spill/in_progress_spill_file.rs Enhanced error docs and updated disk usage after appending batches.
datafusion/execution/src/disk_manager.rs Added disk usage tracking fields, methods and updated temporary file handling.
datafusion/core/tests/memory_limit/mod.rs Added tests validating disk usage limits for spilling queries.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@github-actions github-actions bot added core Core DataFusion crate execution Related to the execution crate labels Apr 1, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @2010YOUY01 -- this looks 👨‍🍳 👌 to me ❤️

if let Some(writer) = &mut self.writer {
let (spilled_rows, spilled_bytes) = writer.write(batch)?;
if let Some(in_progress_file) = &mut self.in_progress_file {
in_progress_file.update_disk_usage()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is quite nice that this is encapsulated as part of InProgressSpillFile

/// tempfiles are cleaned up.
#[tokio::test]
async fn test_disk_spill_limit_not_reached() -> Result<()> {
let disk_spill_limit = 100 * 1024 * 1024; // 100MB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to generate 100MB to test temporary file space? Could we perhaps lower this to something less resource intensive like 1MB (and reduce the argument to generate_series)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5506637 reduced the disk and memory usage of UT to < 1MB

@2010YOUY01
Copy link
Contributor Author

@alamb Thank you. I filed #15553 for configuring this setting in datafusion-cli.

@alamb alamb merged commit 387541c into apache:main Apr 3, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 3, 2025

Thanks again @2010YOUY01

nirnayroy pushed a commit to nirnayroy/datafusion that referenced this pull request May 2, 2025
…r spilling queries (apache#15520)

* Add disk limit field inside disk manager

* Implement disk usage tracking

* Update datafusion/execution/src/disk_manager.rs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Let unit-test use less memory

* reduce UT's memory and disk usage to < 1MB

* typo

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate execution Related to the execution crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

limit max disk usage for spilling queries

2 participants