Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

REGR: passing dask arrays to Series or DataFrame #38645

Closed
keewis opened this issue Dec 22, 2020 · 23 comments · Fixed by #42577
Closed

REGR: passing dask arrays to Series or DataFrame #38645

keewis opened this issue Dec 22, 2020 · 23 comments · Fixed by #42577
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@keewis
Copy link
Contributor

keewis commented Dec 22, 2020

Code Sample, a copy-pastable example

import pandas as pd
import dask.array as da
a = da.ones((12,), chunks=4)
s = pd.Series(a, index=range(12))
print(s.dtype)

Problem description

This has been detected by xarray's upstream-dev CI (environment): with 1.1.3, the dtype is float64 while on master (installed from scipy-wheels-nightly) this became object (and the series / dataframe contains dask scalars). Was that change intentional? Poking around on the merged PR list, this might have been #38563 (not sure, though).

To be clear, for us this only affects test code and since it would compute anyways we can easily work around this by computing the dask array before passing it to pd.Series or pd.DataFrame.

See also pydata/xarray#4717.

cc @TomAugspurger

@keewis keewis added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 22, 2020
@jorisvandenbossche jorisvandenbossche added Regression Functionality that used to work in a prior pandas version and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 23, 2020
@jorisvandenbossche
Copy link
Member

@keewis thanks for the report! Can confirm the change in behaviour.

cc @jbrockmendel

@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Dec 23, 2020
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Dec 23, 2020
@simonjayhawkins
Copy link
Member

Poking around on the merged PR list, this might have been #38563 (not sure, though).

can confirm, first bad commit: [cec2f5f] REF: handle non-list_like cases upfront in sanitize_array (#38563)

@jbrockmendel
Copy link
Member

the low-level place to fix this would be in is_list_like, question is if we can do that without a big performance hit

@jorisvandenbossche jorisvandenbossche changed the title BUG: passing dask arrays to Series or DataFrame REGR: passing dask arrays to Series or DataFrame Jan 6, 2021
@jbrockmendel jbrockmendel added the Constructors Series/DataFrame/Index/pd.array Constructors label Jun 8, 2021
@jbrockmendel
Copy link
Member

One way to fix on dask's end would to be to implement Array.__iter__. Is that a viable option?

@keewis
Copy link
Contributor Author

keewis commented Jun 27, 2021

I'm going to forward this to the dask devs: cc @TomAugspurger, @jsignell, @jrbourbeau

@simonjayhawkins simonjayhawkins modified the milestones: 1.3, 1.3.1 Jun 30, 2021
@jsignell
Copy link
Contributor

One way to fix on dask's end would to be to implement Array.__iter__. Is that a viable option?

In that scenario would the output of Array.__iter__ be an iterable of real data values? Currently the output is an iterable of dask arrays, which have .dtype. It seems like returning real values would potentially trigger computation prematurely and returning generated value with the right dtype seems potentially confusing as well.

For comparison, dd.DataFrame has an __iter__ but it just looks at the _meta not at the real data (so it knows about columns but not about rows).

@jbrockmendel
Copy link
Member

In that scenario would the output of Array.iter be an iterable of real data values?

I think it literally just needs to have a __iter__ attribute, doesn't matter what it returns (even would be OK if it raised NotImplementedError)

@jsignell
Copy link
Contributor

This is what it looks like if I have Array.__iter__ raise NotImplementedError

In [1]: import pandas as pd
   ...: import dask.array as da
   ...: a = da.ones((12,), chunks=4)
   ...: s = pd.Series(a, index=range(12))
   ...: print(s.dtype)
---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-1-2e11dcb4eba5> in <module>
      2 import dask.array as da
      3 a = da.ones((12,), chunks=4)
----> 4 s = pd.Series(a, index=range(12))
      5 print(s.dtype)

~/conda/envs/dask-upstream/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
    436                     data = data.copy()
    437             else:
--> 438                 data = sanitize_array(data, index, dtype, copy)
    439 
    440                 manager = get_option("mode.data_manager")

~/conda/envs/dask-upstream/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
    562         # materialize e.g. generators, convert e.g. tuples, abc.ValueView
    563         # TODO: non-standard array-likes we can convert to ndarray more efficiently?
--> 564         data = list(data)
    565 
    566         if dtype is not None or len(data) == 0:

~/dask/dask/array/core.py in __iter__(self)
   1343 
   1344     def __iter__(self):
-> 1345         raise NotImplementedError
   1346 
   1347     def __len__(self):

NotImplementedError: 

@jsignell
Copy link
Contributor

But I just noticed that dask.Series evaluated the data in the __iter__ method. So it might be reasonable for dask.Array to do the same. I'll open a PR on dask to carry on this discussion.

@jbrockmendel
Copy link
Member

Yah, NotImplementedError was probably too cute. What happens with list(dask_array) now?

@jsignell
Copy link
Contributor

I just opened the PR on dask so we can carry on the dask-side of the conversation over there. dask/dask#7888

@mrocklin
Copy link
Contributor

One way to fix on dask's end would to be to implement Array.iter. Is that a viable option?

I would expect Pandas to try some of the __array__ protocols first. I think that np.asarray(my_dask_array) should efficiently produce something sensible.

@jbrockmendel
Copy link
Member

We can probably do that in sanitize_array, which would avoid the problem with the NotImplementedError

@jsignell
Copy link
Contributor

Good point! If you can add that to sanitize_array then I don't think any changes are needed in dask!

@jbrockmendel
Copy link
Member

We still need the __iter__ method to exist, because the is_list_like check is explicitly a hasattr(obj, "__iter__")

@mrocklin
Copy link
Contributor

I guess my hope would be that Pandas would first check "is this thing array-like" if the answer is "no" then it would ask "ok, well, maybe it's list-like?" To me it makes sense to start with the more efficient things (numpy-ish) and then go down the list of less efficient options until we find something that works.

I don't know all of the history/nuance here though. Please ignore my comments above if they don't make sense.

@jbrockmendel
Copy link
Member

That's absolutely reasonable. In fact there's a comment https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L563 about doing exactly that.

That would make the conversion more efficient, but in order for the conversion to be done at all, we need to have is_list_like(obj), and that uses the hasattr(obj, "__iter__") check. In principle we could make that fall back to checking for __array__, but is_list_like is optimized to the bone so im reticent.

@mrocklin
Copy link
Contributor

I'm proposing a check further up in that if-elif-else chain, somewhere after if isinstance(data, np.ndarray) but before the final else clause that runs the is_list_like call. In Dask we tend to use a check that is similar to if hasattr(data, "shape") and hasattr(data, "dtype").

@mrocklin
Copy link
Contributor

Or I guess hasattr(data, "__array__") would probably be more direct here

if hasattr(data, "__array__"):
    return sanitize_array(np.asarray(data), ...)

@mrocklin
Copy link
Contributor

Oh! Unless you're saying that this function only gets called if there is an __iter__ method. That would make more sense why this is an issue. My apologies for misunderstanding.

@jsignell
Copy link
Contributor

Ok @jbrockmendel I opened a PR on the dask side to implement __iter__ on Array. It fails rather spectacularly with current pandas, but hopefully, that should give you the hook you need?

In [1]: import pandas as pd
   ...: import dask.array as da
   ...: a = da.ones((12,), chunks=4)
   ...: s = pd.Series(a, index=range(12))
   ...: s
Out[1]: 
0     dask.array<getitem, shape=(), dtype=float64, c...
1     dask.array<getitem, shape=(), dtype=float64, c...
2     dask.array<getitem, shape=(), dtype=float64, c...
3     dask.array<getitem, shape=(), dtype=float64, c...
4     dask.array<getitem, shape=(), dtype=float64, c...
5     dask.array<getitem, shape=(), dtype=float64, c...
6     dask.array<getitem, shape=(), dtype=float64, c...
7     dask.array<getitem, shape=(), dtype=float64, c...
8     dask.array<getitem, shape=(), dtype=float64, c...
9     dask.array<getitem, shape=(), dtype=float64, c...
10    dask.array<getitem, shape=(), dtype=float64, c...
11    dask.array<getitem, shape=(), dtype=float64, c...
dtype: object

@jbrockmendel
Copy link
Member

ill make a pandas PR to use __array__ as discussed above. once that is merged we can confirm that implementing __iter__ fixes the problem.

FWIW i'd implement __iter__ so that list(obj) == list(np.array(obj))`

@jsignell
Copy link
Contributor

There was some discussion on the dask side and people feel that having a greedy __iter__ is too much of a gotcha. Too easy to call by mistake. See dask/dask#7889 for reference.

mrocklin pushed a commit to dask/dask that referenced this issue Jul 21, 2021
This doesn't fix the original issue pandas-dev/pandas#38645, but hopefully it'll make it easier for pandas to know that it should sanitize dask.arrays.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Constructors Series/DataFrame/Index/pd.array Constructors Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants