Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for memory usage in SortPreservingMerge (#5885) #7130

Merged
merged 5 commits into from
Aug 9, 2023

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Jul 28, 2023

Which issue does this PR close?

Closes #5885
Closes #6382 (an earlier version of this code)

Rationale for this change

Merging takes memory. It is most pronounced for dictionaries where the RowConverter actually interns the dictionary values and thus can take an appreciable amount of memory for high cardinality dictionaries. We have seen this be 10s of GB in certain IOx cases.

Thus it is important that the streaming_merge and things that use it, like Sort and SortPreservingMerge properly account for the memory used while merging.

What changes are included in this PR?

This is based on the changes from @tustvold in #6382:

  1. Thread a MemoryReservation through to streaming_merge
  2. Adds two new config parameters sort_spill_reservation_bytes and sort_in_place_threshold_bytes that control the level of spilling
  3. Adds new tests

There is some subtlety related to reserving memory for this merge up front when doing a spilling Sort, which I describe inline in comments

Are these changes tested?

Yes

Are there any user-facing changes?

If memory limits are configured, some plans will now error rather than exceed that limit.

@github-actions github-actions bot added the core Core DataFusion crate label Jul 28, 2023
@alamb alamb force-pushed the alamb/sort-merge-accounting branch 2 times, most recently from dc9c2b6 to f5019c9 Compare August 2, 2023 17:57
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 2, 2023
}

#[tokio::test]
async fn sort_spill_reservation() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test demonstrates why the sort_spill_reservation_bytes is needed -- if it is insufficiently large, spilling may fail (because it runs out of memory when trying to write to the spill file). If someone hits this they can increase the value of the memory reserved for merge

use datafusion::physical_plan::SendableRecordBatchStream;
use datafusion_common::assert_contains;
use datafusion::physical_plan::{ExecutionPlan, SendableRecordBatchStream};
use datafusion_common::{assert_contains, Result};

use datafusion::prelude::{SessionConfig, SessionContext};
use datafusion_execution::TaskContext;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests in this file can be cleaned up significantly, but I will do so as a follow on PR to keep the size of this one down

}

impl TestCase {
// TODO remove expected errors and memory limits and query from constructor
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will do this as a follow on PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

datafusion/common/src/config.rs Outdated Show resolved Hide resolved
@alamb alamb force-pushed the alamb/sort-merge-accounting branch from f17afb0 to 9dbee0b Compare August 2, 2023 18:35
/// When sorting, below what size should data be concatenated
/// and sorted in a single RecordBatch rather than sorted in
/// batches and merged.
pub sort_in_place_threshold_bytes: usize, default = 1024 * 1024
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a behavior change. This constant was hard coded in sort.rs -- I have just pulled it out into its own config setting so I can write tests

@alamb alamb mentioned this pull request Aug 3, 2023
@alamb alamb force-pushed the alamb/sort-merge-accounting branch from 9dbee0b to ac0aea2 Compare August 3, 2023 17:57
@@ -84,13 +85,16 @@ pub struct RowCursorStream {
column_expressions: Vec<Arc<dyn PhysicalExpr>>,
/// Input streams
streams: FusedStreams,
/// Tracks the memory used by `converter`
reservation: MemoryReservation,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We observed this to be a key consumer of memory for large dictionary encoded data

// enough memory to sort if we don't try to merge it all at once
(partition_size * 5) / 2,
)
// use a single partiton so only a sort is needed
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test demonstrates the need for reserving memory up front for the spill -- and shows that if someone hits the error they can increased the memory set aside for the merge and it will work

@alamb
Copy link
Contributor Author

alamb commented Aug 4, 2023

I ran the sort benchmarks and they are basically the same. I think the 9% slower measure is due to a high variance in the benchmark (which I should look into if/when I have time). I saw similar variations when I compared main to itself

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_sort-merge-accounting ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ Qsort utf8   │ 60974.35ms │                  66454.87ms │ 1.09x slower │
│ Qsort int    │ 78390.32ms │                  82235.04ms │    no change │
│ Qsort        │ 66029.57ms │                  65439.66ms │    no change │
│ decimal      │            │                             │              │
│ Qsort        │ 84864.85ms │                  87846.28ms │    no change │
│ integer      │            │                             │              │
│ tuple        │            │                             │              │
│ Qsort utf8   │ 62470.81ms │                  63878.11ms │    no change │
│ tuple        │            │                             │              │
│ Qsort mixed  │ 72625.32ms │                  74784.97ms │    no change │
│ tuple        │            │                             │              │
└──────────────┴────────────┴─────────────────────────────┴──────────────┘

@alamb alamb marked this pull request as ready for review August 4, 2023 13:21
@alamb
Copy link
Contributor Author

alamb commented Aug 4, 2023

I have also tested this on our code upstream and it definitely helps account for some of the difference between tracked and actual memory.

@yjshen
Copy link
Member

yjshen commented Aug 8, 2023

I'll review this carefully today.

Copy link
Member

@yjshen yjshen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great to me! Thanks @alamb!

use datafusion::physical_optimizer::PhysicalOptimizerRule;
use datafusion::physical_plan::common::batch_byte_size;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would probably remove this method and use RecordBatch::get_array_memory_size in the repo.

Copy link
Contributor Author

@alamb alamb Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea -- I will do so in a follow on PR

Update: #7245

/// A handle to the runtime to get Disk spill files
/// Reservation for the merging of in-memory batches. If the sort
/// might spill, `sort_spill_reservation_bytes` will be
/// pre-reserved to ensure there is some space for this sort/merg.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// pre-reserved to ensure there is some space for this sort/merg.
/// pre-reserved to ensure there is some space for this sort/merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in f87705e

// Release the memory reserved for merge back to the pool so
// there is some left when `in_memo_sort_stream` requests an
// allocation.
self.merge_reservation.free();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 239 to 248
/// How much memory is set aside, for each spillable sort, to
/// ensure an in-memory merge can occur. This setting has no
/// if the sort can not spill (there is no `DiskManager`
/// configured)
///
/// As part of spilling to disk, in memory data must be sorted
/// / merged before writing the file. This in-memory
/// sort/merge requires memory as well, so To avoid allocating
/// once memory is exhausted, DataFusion sets aside this
/// many bytes before.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe:

/// Specifies the reserved memory for each spillable sort operation to
/// facilitate an in-memory merge.
///
/// When a sort operation spills to disk, the in-memory data must be
/// sorted and merged before being written to a file. This setting reserves
/// a specific amount of memory for that in-memory sort/merge process.
///
/// Note: This setting is irrelevant if the sort operation cannot spill
/// (i.e., if there's no `DiskManager` configured).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a much better wording. Thank you @yjshen -- in f87705e

}

let streams = std::mem::take(&mut self.in_mem_batches)
.into_iter()
.map(|batch| {
let metrics = self.metrics.baseline.intermediate();
Ok(spawn_buffered(self.sort_batch_stream(batch, metrics)?, 1))
let reservation = self.reservation.split(batch.get_array_memory_size());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@crepererum crepererum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (modulo what @yjshen already said).

Copy link
Contributor Author

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @yjshen and @crepererum

use datafusion::physical_optimizer::PhysicalOptimizerRule;
use datafusion::physical_plan::common::batch_byte_size;
Copy link
Contributor Author

@alamb alamb Aug 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good idea -- I will do so in a follow on PR

Update: #7245

Comment on lines 239 to 248
/// How much memory is set aside, for each spillable sort, to
/// ensure an in-memory merge can occur. This setting has no
/// if the sort can not spill (there is no `DiskManager`
/// configured)
///
/// As part of spilling to disk, in memory data must be sorted
/// / merged before writing the file. This in-memory
/// sort/merge requires memory as well, so To avoid allocating
/// once memory is exhausted, DataFusion sets aside this
/// many bytes before.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a much better wording. Thank you @yjshen -- in f87705e

/// A handle to the runtime to get Disk spill files
/// Reservation for the merging of in-memory batches. If the sort
/// might spill, `sort_spill_reservation_bytes` will be
/// pre-reserved to ensure there is some space for this sort/merg.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in f87705e

@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2023

@gruuya I am quite confident that this PR will conflict with #7180 -- is it ok if I merge this one before #7180?

@alamb alamb mentioned this pull request Aug 9, 2023
@gruuya
Copy link
Contributor

gruuya commented Aug 9, 2023

@gruuya I am quite confident that this PR will conflict with #7180 -- is it ok if I merge this one before #7180?

Oh definitely, please go ahead. I treat #7180 more as an experiment at this point, cheers.

@yjshen
Copy link
Member

yjshen commented Aug 9, 2023

Great, let's merge this! Thanks @alamb @crepererum @gruuya!

@yjshen yjshen merged commit 161c6d3 into apache:main Aug 9, 2023
@alamb
Copy link
Contributor Author

alamb commented Aug 9, 2023

Thanks @yjshen and @gruuya

@alamb alamb deleted the alamb/sort-merge-accounting branch August 9, 2023 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SortPreservingMerge does not account for memory usage
4 participants