Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rust][datafusion] optimize count(*) queries on parquet sources #75

Closed
alamb opened this issue Apr 26, 2021 · 1 comment
Closed

[rust][datafusion] optimize count(*) queries on parquet sources #75

alamb opened this issue Apr 26, 2021 · 1 comment
Labels
arrow Changes to the arrow crate

Comments

@alamb
Copy link
Contributor

alamb commented Apr 26, 2021

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8902

Currently, as far as I can tell, when you perform a select count(*) from dataset in datafusion against a parquet dataset, the way this is implemented is by doing a scan on column 0, and counting up all of the rows (specifically I think it counts the # of rows in each batch).

 

However, for the specific case of just counting everythign in a parquet file, you can just read the rowcount from the footer metadata, so it's O(1) instead of O(n)

@alamb alamb added the arrow Changes to the arrow crate label Apr 26, 2021
@alamb
Copy link
Contributor Author

alamb commented Apr 26, 2021

wrong repo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

No branches or pull requests

2 participants