Skip to content

[Proposal] Expose Spilling Progress Interface in DataFusion #19697

@xudong963

Description

@xudong963

Background

Currently:

  1. SpillMetrics (per operator) are updated only at the end of a spill.
  2. DiskManager tracks used_disk_space (current total) but doesn't expose a structured "progress" view.

Proposed Changes

  1. Real-time Metric Updates in SpillMetrics: modify InProgressSpillFile to ensure spilled_bytes
    and spill_file_count metrics are updated as soon as the data is written to disk.
  • Initial update: In append_batch, when the IPCStreamWriter is first created, immediately call update_disk_usage() on the file and add the size (schema/header) to spilled_bytes
  • Incremental update: After each writer.write(batch) call, call update_disk_usage() and add the delta size to
    spilled_bytes
  • Final update: In finish() call update_disk_usage() after finishing the writer and add the remaining delta size (footer/metadata) to spilled_bytes
    .
  1. Spilling Progress Interface in DiskManager: expose the current global state of the disk manager.
  • New SpillingProgress struct
    pub struct SpillingProgress {
        /// Total bytes currently used on disk for spilling
        pub current_bytes: u64,
        /// Total number of active spill files
        pub active_files_count: usize,
    }
  • Implement spilling_progress(&self) -> SpillingProgress
  1. Delegate Interface in RuntimeEnv: provide a convenient entry point for users.
    let progress = ctx.runtime_env().spilling_progress();
    

Then users could call the API to get the real-time spilling progress, for our use case, we want to call this from the SQL UI to give users the real-time feedback about their SQLs.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions