Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: iterrows() on an awkward pandas column with equal-length rows results in a ValueError #55

Open
Girmii opened this issue Jun 8, 2024 · 1 comment

Comments

@Girmii
Copy link

Girmii commented Jun 8, 2024

Reproducible Example

import awkward as ak
import awkward_pandas as akpd
import pandas as pd

# numbers = [[1, 2, 3], [4, 5], [6]]  # no error
numbers = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]  # error
letters = ["A", "B", "C"]

numbers_ak = ak.from_iter(numbers)
numbers_akpd = akpd.from_awkward(numbers_ak)

df = pd.DataFrame({"letters": letters, "numbers": numbers_akpd})

for idx, row in df.iterrows():
    print(f"{idx} - {row['letters']}, {row['numbers']}")
File .venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2253, in EABackedBlock.get_values(self, dtype)
   [2251](.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2251)     values = values.astype(object)
   [2252](.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2252) # TODO(EA2D): reshape not needed with 2D EAs
-> [2253](.venv/lib/python3.11/site-packages/pandas/core/internals/blocks.py:2253) return np.asarray(values).reshape(self.shape)

ValueError: cannot reshape array of size 9 into shape

Issue Description

I reported this issue at the Pandas repository, but they referred me to here first to verify that it is not an error with awkward_pandas. (see Pandas issue 58927)


When calling iterrows() on a DataFrame which contains an awkward array as a column, a ValueError occurs (see stacktrace example). This error only occurs when all rows of the awkward array are of equal length. In this case the calls to values.astype(object) and/or np.asarray(values) in the get_values function in the pandas/core/internals/blocks.py module result in a 2D array, instead of a 1D array with nested lists.
When the awkward array is actually jagged, the call results in the correct format of the array (see commented line in code example) and iterrows() works as intended.

Expected Behavior

I would expect iterrows() to iterate over the DataFrame rows without throwing an error, but instead returning a Series with the value of the awkward array at the index of the row set correctly.

Installed Versions

awkward         2.6.5
awkward_pandas  2023.8.0
numpy           1.26.3
pandas          2.2.0
@martindurant
Copy link
Member

In the version of this library on main, we have changed this library quite substantially, to make it simpler yet support more dataframe libraries. Therefore, the pandas "awkward" dtype will disappear, and only the .ak accessor (on series and dataframes) as the way to get awkward's vectorised nested/ragged operations. The data columns themselves will tend to be stored in arrow layout, which is becoming the pandas standard.

That's a rather long way of saying, that iterrows() will "just work" as it does for any other data type that pandas already knows about.

Exactly how to get your data to be stored as arrow is another matter and one that pandas seems a bit confused about (see here). With #56 , which I just posted, you could do

df["numbers"] = df.numbers.ak.to_output()

(note that you don't need your data to be in arrow storage before using .ak)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants