[rust][datafusion] optimize count(*) queries on parquet sources #75

alamb · 2021-04-26T12:31:06Z

Note: migrated from original JIRA: https://issues.apache.org/jira/browse/ARROW-8902

Currently, as far as I can tell, when you perform a select count(*) from dataset in datafusion against a parquet dataset, the way this is implemented is by doing a scan on column 0, and counting up all of the rows (specifically I think it counts the # of rows in each batch).

However, for the specific case of just counting everythign in a parquet file, you can just read the rowcount from the footer metadata, so it's O(1) instead of O(n)

The text was updated successfully, but these errors were encountered:

alamb · 2021-04-26T12:41:07Z

wrong repo

alamb added the arrow Changes to the arrow crate label Apr 26, 2021

alamb closed this as completed Apr 26, 2021

jorgecarleitao added the invalid label May 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[rust][datafusion] optimize count(*) queries on parquet sources #75

[rust][datafusion] optimize count(*) queries on parquet sources #75

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021

[rust][datafusion] optimize count(*) queries on parquet sources #75

[rust][datafusion] optimize count(*) queries on parquet sources #75

Comments

alamb commented Apr 26, 2021

alamb commented Apr 26, 2021