-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make it easier to create a Pandas dataframe from DataFusion query results #139
Comments
What do you think about following snippet? df = Table.from_batches(batches).to_pandas()
|
Thanks for sharing @krzysztof-kwitt! I think that works - I tried to implement a Example: batch_1 = pa.RecordBatch.from_arrays(
[pa.array([0.1, -0.7, 0.55])], names=["value"]
)
batch_2 = pa.RecordBatch.from_arrays(
[pa.array([0.5, -0.6, 0.8])], names=["value"]
)
df = ctx.create_dataframe([[batch_1, batch_2]]) |
I saw in the documentation that pyarrow Table has a few methods that might be helpful when constructing dataframes or getting out the results: Do you think it would be useful to implement those methods as well? |
I don't think we need to use other This is how it has been implemented in DuckDB PolarsDataFrame DuckDBPyRelation::ToPolars(idx_t batch_size) {
auto arrow = ToArrowTable(batch_size);
return py::cast<PolarsDataFrame>(pybind11::module_::import("polars").attr("DataFrame")(arrow));
} https://github.com/duckdb/duckdb/pull/6181/files |
@krzysztof-kwitt thanks for the suggestion, I've created a PR with additional export functions that would among others simplify export to polars DataFrames, see #236. You can convert datafusion dataframes to polars like this: polars_df = df.to_polars() |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
DataFrame.collect returns a list of PyArrow record batches. Each batch can be turned into a Pandas datraframe but I do not know how to create a Pandas dataframe that contains data from all of the batches in an efficient way.
Describe the solution you'd like
Either an example for this, or new features to help with this. Perhaps a
DataFrame.collect_single_batch
could work.Describe alternatives you've considered
None
Additional context
None
The text was updated successfully, but these errors were encountered: