Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEAT] Estimate materialized size of ScanTask better from Parquet reads #3302

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

jaychia
Copy link
Contributor

@jaychia jaychia commented Nov 15, 2024

No description provided.

@github-actions github-actions bot added the enhancement New feature or request label Nov 15, 2024
Copy link

codspeed-hq bot commented Nov 15, 2024

CodSpeed Performance Report

Merging #3302 will degrade performances by 18.26%

Comparing jay/better-scan-task-estimations-2 (6c7bd68) with main (4470192)

Summary

❌ 1 regressions
✅ 16 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark main jay/better-scan-task-estimations-2 Change
test_iter_rows_first_row[100 Small Files] 334.6 ms 409.4 ms -18.26%

@@ -269,6 +270,9 @@ pub(crate) fn split_by_row_groups(

*chunk_spec = Some(ChunkSpec::Parquet(curr_row_group_indices));
*size_bytes = Some(curr_size_bytes as u64);

// Re-estimate the size bytes in memory
new_estimated_size_bytes_in_memory = t.estimated_materialized_size_bytes.map(|est| (est as f64 * (curr_num_rows as f64 / file.num_rows as f64)) as usize);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kevinzwang could you take a look at this logic for splitting ScanTasks, and trying to correctly predict the resultant estimated materialized size bytes?

Looks like we're doing some crazy stuff wrt modifying the FileMetadata and I couldn't really figure out if it is safe to do this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant