-
Notifications
You must be signed in to change notification settings - Fork 1.7k
feat: Add config max_temp_directory_size
to limit max disk usage for spilling queries
#15520
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
max_temp_directory_size
to limit max disk usage for spilling queries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds a new configuration to limit temporary disk usage for spilling queries by introducing the max_temp_directory_size setting in DiskManager and updating related components.
- Adds max_temp_directory_size and used_disk_space to DiskManager to track and enforce the disk usage limit.
- Updates RefCountedTempFile and InProgressSpillFile to update the global disk usage after file modifications.
- Introduces integration tests to verify behavior when the disk spill limit is reached versus not reached.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
File | Description |
---|---|
datafusion/physical-plan/src/spill/spill_manager.rs | Updated error documentation for spill functions. |
datafusion/physical-plan/src/spill/in_progress_spill_file.rs | Enhanced error docs and updated disk usage after appending batches. |
datafusion/execution/src/disk_manager.rs | Added disk usage tracking fields, methods and updated temporary file handling. |
datafusion/core/tests/memory_limit/mod.rs | Added tests validating disk usage limits for spilling queries. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @2010YOUY01 -- this looks 👨🍳 👌 to me ❤️
if let Some(writer) = &mut self.writer { | ||
let (spilled_rows, spilled_bytes) = writer.write(batch)?; | ||
if let Some(in_progress_file) = &mut self.in_progress_file { | ||
in_progress_file.update_disk_usage()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is quite nice that this is encapsulated as part of InProgressSpillFile
/// tempfiles are cleaned up. | ||
#[tokio::test] | ||
async fn test_disk_spill_limit_not_reached() -> Result<()> { | ||
let disk_spill_limit = 100 * 1024 * 1024; // 100MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we really need to generate 100MB to test temporary file space? Could we perhaps lower this to something less resource intensive like 1MB (and reduce the argument to generate_series
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5506637 reduced the disk and memory usage of UT to < 1MB
Thanks again @2010YOUY01 |
…r spilling queries (apache#15520) * Add disk limit field inside disk manager * Implement disk usage tracking * Update datafusion/execution/src/disk_manager.rs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Let unit-test use less memory * reduce UT's memory and disk usage to < 1MB * typo --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Which issue does this PR close?
Rationale for this change
See the rationale part of the first attempt PR #14975
The included changes and implementation are different than the above PR, it will be explained below.
What changes are included in this PR?
max_temp_directory_size
field insideDiskManager
to keep track of current total disk usage for temporary files, by default it's 100GB.RefCountedTempFile
:DiskManager
that created itupdate_disk_usage()
to update the global disk usage. After modifying the managed tempfile, the caller also has to call this function to do the update, to make sure when disk limit is exceeded an error will be thrown. (CurrentlyRefCountedTempFile
is only used for spill files inside DataFusion, so I think this additional interface is okay to add)Are these changes tested?
Yes, integration test is included for queries exceed/not-exceed the disk limit.
Are there any user-facing changes?