-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: passing dask arrays to Series or DataFrame #38645
Comments
@keewis thanks for the report! Can confirm the change in behaviour. |
the low-level place to fix this would be in |
One way to fix on dask's end would to be to implement |
I'm going to forward this to the dask devs: cc @TomAugspurger, @jsignell, @jrbourbeau |
In that scenario would the output of For comparison, |
I think it literally just needs to have a |
This is what it looks like if I have In [1]: import pandas as pd
...: import dask.array as da
...: a = da.ones((12,), chunks=4)
...: s = pd.Series(a, index=range(12))
...: print(s.dtype)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-1-2e11dcb4eba5> in <module>
2 import dask.array as da
3 a = da.ones((12,), chunks=4)
----> 4 s = pd.Series(a, index=range(12))
5 print(s.dtype)
~/conda/envs/dask-upstream/lib/python3.8/site-packages/pandas/core/series.py in __init__(self, data, index, dtype, name, copy, fastpath)
436 data = data.copy()
437 else:
--> 438 data = sanitize_array(data, index, dtype, copy)
439
440 manager = get_option("mode.data_manager")
~/conda/envs/dask-upstream/lib/python3.8/site-packages/pandas/core/construction.py in sanitize_array(data, index, dtype, copy, raise_cast_failure, allow_2d)
562 # materialize e.g. generators, convert e.g. tuples, abc.ValueView
563 # TODO: non-standard array-likes we can convert to ndarray more efficiently?
--> 564 data = list(data)
565
566 if dtype is not None or len(data) == 0:
~/dask/dask/array/core.py in __iter__(self)
1343
1344 def __iter__(self):
-> 1345 raise NotImplementedError
1346
1347 def __len__(self):
NotImplementedError: |
But I just noticed that |
Yah, NotImplementedError was probably too cute. What happens with |
I just opened the PR on dask so we can carry on the dask-side of the conversation over there. dask/dask#7888 |
I would expect Pandas to try some of the |
We can probably do that in sanitize_array, which would avoid the problem with the NotImplementedError |
Good point! If you can add that to sanitize_array then I don't think any changes are needed in dask! |
We still need the |
I guess my hope would be that Pandas would first check "is this thing array-like" if the answer is "no" then it would ask "ok, well, maybe it's list-like?" To me it makes sense to start with the more efficient things (numpy-ish) and then go down the list of less efficient options until we find something that works. I don't know all of the history/nuance here though. Please ignore my comments above if they don't make sense. |
That's absolutely reasonable. In fact there's a comment https://github.com/pandas-dev/pandas/blob/master/pandas/core/construction.py#L563 about doing exactly that. That would make the conversion more efficient, but in order for the conversion to be done at all, we need to have |
I'm proposing a check further up in that if-elif-else chain, somewhere after |
Or I guess if hasattr(data, "__array__"):
return sanitize_array(np.asarray(data), ...) |
Oh! Unless you're saying that this function only gets called if there is an |
Ok @jbrockmendel I opened a PR on the dask side to implement In [1]: import pandas as pd
...: import dask.array as da
...: a = da.ones((12,), chunks=4)
...: s = pd.Series(a, index=range(12))
...: s
Out[1]:
0 dask.array<getitem, shape=(), dtype=float64, c...
1 dask.array<getitem, shape=(), dtype=float64, c...
2 dask.array<getitem, shape=(), dtype=float64, c...
3 dask.array<getitem, shape=(), dtype=float64, c...
4 dask.array<getitem, shape=(), dtype=float64, c...
5 dask.array<getitem, shape=(), dtype=float64, c...
6 dask.array<getitem, shape=(), dtype=float64, c...
7 dask.array<getitem, shape=(), dtype=float64, c...
8 dask.array<getitem, shape=(), dtype=float64, c...
9 dask.array<getitem, shape=(), dtype=float64, c...
10 dask.array<getitem, shape=(), dtype=float64, c...
11 dask.array<getitem, shape=(), dtype=float64, c...
dtype: object |
ill make a pandas PR to use FWIW i'd implement |
There was some discussion on the dask side and people feel that having a greedy |
This doesn't fix the original issue pandas-dev/pandas#38645, but hopefully it'll make it easier for pandas to know that it should sanitize dask.arrays.
Code Sample, a copy-pastable example
Problem description
This has been detected by
xarray
's upstream-dev CI (environment): with1.1.3
, thedtype
isfloat64
while onmaster
(installed fromscipy-wheels-nightly
) this becameobject
(and the series / dataframe containsdask
scalars). Was that change intentional? Poking around on the merged PR list, this might have been #38563 (not sure, though).To be clear, for us this only affects test code and since it would compute anyways we can easily work around this by computing the
dask
array before passing it topd.Series
orpd.DataFrame
.See also pydata/xarray#4717.
cc @TomAugspurger
The text was updated successfully, but these errors were encountered: