Commit 4c70984
[Data] Fix file size ordering in download partitioning with multiple URI columns (#58517)
The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.
When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.
## Changes
- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size
## Related issues
#58464 (comment)
---------
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>1 parent 7498739 commit 4c70984
1 file changed
+12
-7
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
326 | 326 | | |
327 | 327 | | |
328 | 328 | | |
329 | | - | |
| 329 | + | |
330 | 330 | | |
331 | 331 | | |
332 | | - | |
333 | | - | |
334 | | - | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
335 | 336 | | |
336 | 337 | | |
337 | | - | |
| 338 | + | |
| 339 | + | |
338 | 340 | | |
339 | 341 | | |
340 | | - | |
| 342 | + | |
341 | 343 | | |
342 | 344 | | |
343 | | - | |
| 345 | + | |
344 | 346 | | |
| 347 | + | |
| 348 | + | |
| 349 | + | |
345 | 350 | | |
0 commit comments