Skip to content

Commit 8985aee

Browse files
committed
Add batch_size comment
1 parent c07b1b5 commit 8985aee

File tree

1 file changed

+18
-1
lines changed

1 file changed

+18
-1
lines changed

Diff for: parquet/src/arrow/record_reader.rs

+18-1
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,24 @@ where
169169

170170
// If repetition levels present, we don't know how much more to read
171171
// in order to read the requested number of records, therefore read at least
172-
// MIN_BATCH_SIZE, otherwise read exactly what was requested
172+
// MIN_BATCH_SIZE, otherwise read **exactly** what was requested. This helps
173+
// to avoid a degenerate case where the buffers are never fully drained.
174+
//
175+
// Consider the scenario where the user is requesting batches of MIN_BATCH_SIZE.
176+
//
177+
// When transitioning across a row group boundary, this will read some remainder
178+
// from the row group `r`, before reading MIN_BATCH_SIZE from the next row group,
179+
// leaving `MIN_BATCH_SIZE + r` in the buffer.
180+
//
181+
// The client will then only split off the `MIN_BATCH_SIZE` they actually wanted,
182+
// leaving behind `r`. This will continue indefinitely.
183+
//
184+
// Aside from wasting cycles splitting and shuffling buffers unnecessarily, this
185+
// prevents dictionary preservation from functioning correctly as the buffer
186+
// will never be emptied, allowing a new dictionary to be registered.
187+
//
188+
// This degenerate case can still occur for repeated fields, but
189+
// it is avoided for the more common case of a non-repeated field
173190
let batch_size = match &self.rep_levels {
174191
Some(_) => max(num_records - records_read, MIN_BATCH_SIZE),
175192
None => num_records - records_read,

0 commit comments

Comments
 (0)