Error on dataframe count using arrow dataset #800

timsaucer · 2024-08-09T13:18:20Z

Describe the bug
When using a pyarrow.dataset as your source and performing a dataframe count operation you get an error.

To Reproduce
You can point the below snippet to any parquet file.

from datafusion import SessionContext
import pyarrow.dataset as ds

ctx = SessionContext()
file_path = "/some-path/datafusion-python/examples/tpch/data/lineitem.parquet"
pyarrow_dataset = ds.dataset([file_path])

ctx.register_dataset("pyarrow_dataset", pyarrow_dataset)
df = ctx.table("pyarrow_dataset").select("l_orderkey", "l_partkey", "l_linenumber")

df.limit(3).show()
df.count()

This generates the following output. The show is to demonstrate the file is read appropriately.

DataFrame()
+------------+-----------+--------------+
| l_orderkey | l_partkey | l_linenumber |
+------------+-----------+--------------+
| 1          | 155190    | 1            |
| 1          | 67310     | 2            |
| 1          | 63700     | 3            |
+------------+-----------+--------------+
Traceback (most recent call last):
  File "/Users/tsaucer/src/personal/arrow_rs_dataset_read/count_dataset_read.py", line 16, in <module>
    df.count()
  File "/Users/tsaucer/src/personal/datafusion-python/python/datafusion/dataframe.py", line 507, in count
    return self.df.count()
           ^^^^^^^^^^^^^^^
Exception: External error: Arrow error: External error: ArrowException: Invalid argument error: must either specify a row count or at least one column

Expected behavior
count() should return the number of rows in this dataset.

Work around is to aggregate and count

from datafusion import col, functions as f
df.aggregate([], [f.count(col("l_orderkey"))]).show()

Additional context
In my investigation, I found that we register arrow datasets by creating a TableProvider in src/dataset.rs and then the execution calls happen in src/dataset_exec.rs.

The text was updated successfully, but these errors were encountered:

timsaucer · 2024-08-09T13:22:15Z

I cannot reproduce this in the datafusion repo because arrow-rs doesn't appear to have the concept of dataset in the same way as pyarrow that I can find.

Michael-J-Ward · 2024-08-27T20:21:06Z

I tracked down the error.

It occurs when attempting to convert the pyarrow result into an arrow-rs RecordBatch

datafusion-python/src/dataset_exec.rs

Lines 61 to 65 in ae7470e

 // NOTE: This is where the failure actually occurs. 

 // It occurs because `from_pyarrow_bound` uses the default `RecordBatchOptions` which does *not* allow a batch with no columns. 

 // See https://github.com/apache/arrow-rs/pull/1552 for more details. 

 let extracted = next_batch.extract::<PyArrowType<_>>().expect("failed to extract batch"); 

 Some(Ok(extracted.0))

arrow-rs does have an option for creating "count" like RecordBatches but it requires an additional config in RecordBatchOptions: apache/arrow-rs#1552

However, the from_pyarrow_bound method only uses the default options.
https://github.com/apache/arrow-rs/blob/b711f23a136e0b094a70a4aafb020d4bb9f60619/arrow/src/pyarrow.rs#L334-L392

timsaucer added the bug Something isn't working label Aug 9, 2024

Michael-J-Ward mentioned this issue Aug 20, 2024

Tracking DF 41 release #828

Closed

9 tasks

Michael-J-Ward linked a pull request Aug 27, 2024 that will close this issue

fix: Calling count on a pyarrow dataset results in an error #843

Open

This was referenced Aug 28, 2024

Allow converting empty pyarrow.RecordBatch to arrow::RecordBatch apache/arrow-rs#6318

Closed

Support zero column RecordBatches in pyarrow integration (use RecordBatchOptions when converting a pyarrow RecordBatch) apache/arrow-rs#6320

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error on dataframe count using arrow dataset #800

Error on dataframe count using arrow dataset #800

timsaucer commented Aug 9, 2024

timsaucer commented Aug 9, 2024

Michael-J-Ward commented Aug 27, 2024

Error on dataframe count using arrow dataset #800

Error on dataframe count using arrow dataset #800

Comments

timsaucer commented Aug 9, 2024

timsaucer commented Aug 9, 2024

Michael-J-Ward commented Aug 27, 2024