Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error on dataframe count using arrow dataset #800

Open
timsaucer opened this issue Aug 9, 2024 · 2 comments · May be fixed by #843
Open

Error on dataframe count using arrow dataset #800

timsaucer opened this issue Aug 9, 2024 · 2 comments · May be fixed by #843
Labels
bug Something isn't working

Comments

@timsaucer
Copy link
Contributor

Describe the bug
When using a pyarrow.dataset as your source and performing a dataframe count operation you get an error.

To Reproduce
You can point the below snippet to any parquet file.

from datafusion import SessionContext
import pyarrow.dataset as ds

ctx = SessionContext()
file_path = "/some-path/datafusion-python/examples/tpch/data/lineitem.parquet"
pyarrow_dataset = ds.dataset([file_path])

ctx.register_dataset("pyarrow_dataset", pyarrow_dataset)
df = ctx.table("pyarrow_dataset").select("l_orderkey", "l_partkey", "l_linenumber")

df.limit(3).show()
df.count()

This generates the following output. The show is to demonstrate the file is read appropriately.

DataFrame()
+------------+-----------+--------------+
| l_orderkey | l_partkey | l_linenumber |
+------------+-----------+--------------+
| 1          | 155190    | 1            |
| 1          | 67310     | 2            |
| 1          | 63700     | 3            |
+------------+-----------+--------------+
Traceback (most recent call last):
  File "/Users/tsaucer/src/personal/arrow_rs_dataset_read/count_dataset_read.py", line 16, in <module>
    df.count()
  File "/Users/tsaucer/src/personal/datafusion-python/python/datafusion/dataframe.py", line 507, in count
    return self.df.count()
           ^^^^^^^^^^^^^^^
Exception: External error: Arrow error: External error: ArrowException: Invalid argument error: must either specify a row count or at least one column

Expected behavior
count() should return the number of rows in this dataset.

Work around is to aggregate and count

from datafusion import col, functions as f
df.aggregate([], [f.count(col("l_orderkey"))]).show()

Additional context
In my investigation, I found that we register arrow datasets by creating a TableProvider in src/dataset.rs and then the execution calls happen in src/dataset_exec.rs.

@timsaucer timsaucer added the bug Something isn't working label Aug 9, 2024
@timsaucer
Copy link
Contributor Author

I cannot reproduce this in the datafusion repo because arrow-rs doesn't appear to have the concept of dataset in the same way as pyarrow that I can find.

@Michael-J-Ward
Copy link
Contributor

I tracked down the error.

It occurs when attempting to convert the pyarrow result into an arrow-rs RecordBatch

// NOTE: This is where the failure actually occurs.
// It occurs because `from_pyarrow_bound` uses the default `RecordBatchOptions` which does *not* allow a batch with no columns.
// See https://github.com/apache/arrow-rs/pull/1552 for more details.
let extracted = next_batch.extract::<PyArrowType<_>>().expect("failed to extract batch");
Some(Ok(extracted.0))

arrow-rs does have an option for creating "count" like RecordBatches but it requires an additional config in RecordBatchOptions: apache/arrow-rs#1552

However, the from_pyarrow_bound method only uses the default options.
https://github.com/apache/arrow-rs/blob/b711f23a136e0b094a70a4aafb020d4bb9f60619/arrow/src/pyarrow.rs#L334-L392

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants