Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataArray.to_dask_dataframe() #7635

Merged
merged 75 commits into from
Apr 28, 2023
Merged
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
75 commits
Select commit Hold shift + click to select a range
fe5e7ad
Add feature to convert dataarray to dask dataframe. This is for the …
dsgreen2 Mar 14, 2023
4174d50
Add test for new method dataarray.to_dask_dataframe()
dsgreen2 Mar 16, 2023
ae32e83
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 16, 2023
ba47c16
Changes after review
dsgreen2 Mar 17, 2023
924d2dd
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 17, 2023
3b9165c
Updating method-dataarray-to-daskdataframe branch with remote branch
dsgreen2 Mar 18, 2023
ed559d3
Corrections in docstring and import
dsgreen2 Mar 18, 2023
65f9fab
docstring correction
dsgreen2 Mar 18, 2023
bffe3cb
Remove name parameter
dsgreen2 Mar 21, 2023
a7e423c
Add feature to convert dataarray to dask dataframe. This is for the …
dsgreen2 Mar 14, 2023
5c7a48f
Add test for new method dataarray.to_dask_dataframe()
dsgreen2 Mar 16, 2023
43a8b9b
Changes after review
dsgreen2 Mar 17, 2023
aff8eca
Corrections in docstring and import
dsgreen2 Mar 18, 2023
9e4f816
docstring correction
dsgreen2 Mar 18, 2023
bcd06c8
Remove name parameter
dsgreen2 Mar 21, 2023
4cca0a8
Corrected doc/whats-new.rst
dsgreen2 Mar 23, 2023
a2ee962
Merge branch 'method-dataarray-to-daskdataframe' of github.com:dsgree…
dsgreen2 Mar 23, 2023
34a2ac5
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 24, 2023
8c50d1e
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 24, 2023
2c9ffb3
Update whats-new.rst
dsgreen2 Mar 27, 2023
88f10bc
Merge branch 'pydata:main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 27, 2023
b19cd6f
update
dsgreen2 Mar 27, 2023
1db928f
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 27, 2023
aa6c123
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 28, 2023
dd77cff
Merge branch 'pydata:main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 28, 2023
82bdd8e
Updating local branch
dsgreen2 Mar 28, 2023
a8500a2
Space corrections in docstring
dsgreen2 Mar 28, 2023
75e20b4
Whitespace correction in docstring
dsgreen2 Mar 28, 2023
65d3a93
Add white space in docstring line
dsgreen2 Mar 28, 2023
69811fe
Whitespace correction
dsgreen2 Mar 28, 2023
4bae82c
Update line npartitions=1
dsgreen2 Mar 28, 2023
3d3b092
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2023
ee30fe4
Revert "Update line npartitions=1"
dsgreen2 Mar 28, 2023
9828442
Updating local branch
dsgreen2 Mar 28, 2023
6c2c5e3
Add whitespace in npartitions=1
dsgreen2 Mar 28, 2023
71c0656
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2023
d89f8e0
Change example in docstring
dsgreen2 Mar 28, 2023
7f3f9da
Change example in docstring
dsgreen2 Mar 28, 2023
b545282
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2023
e733b0f
Change example in docstring
dsgreen2 Mar 28, 2023
fd4501a
Example changes
dsgreen2 Mar 28, 2023
dffea4b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 28, 2023
0ab7b16
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Mar 31, 2023
6a31726
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 1, 2023
cb2feaa
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 1, 2023
40a3458
Update xarray/core/dataarray.py
dsgreen2 Apr 4, 2023
02f09d3
Update doc/whats-new.rst
dsgreen2 Apr 4, 2023
127db51
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 4, 2023
33d4960
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 7, 2023
5a2b24f
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 8, 2023
ee086a1
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 13, 2023
c732783
Add name check
dsgreen2 Apr 15, 2023
2e9cf38
Resolve conflicts
dsgreen2 Apr 16, 2023
38ebe32
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 16, 2023
ab619f9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 16, 2023
e577a1c
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 17, 2023
d07a3c6
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 19, 2023
694ba77
Add test for unnamed dataarray.
dsgreen2 Apr 19, 2023
53e9b4d
Update remote branch
dsgreen2 Apr 19, 2023
4866d8b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 19, 2023
c563989
Remove scalar array test
dsgreen2 Apr 20, 2023
c2acb6f
Remove scalar test
dsgreen2 Apr 20, 2023
87e993d
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 20, 2023
b60a226
Change error message
dsgreen2 Apr 20, 2023
4552d18
Change unnamed array error
dsgreen2 Apr 20, 2023
2466903
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 20, 2023
8bfbf3f
Update xarray/tests/test_dataarray.py
dcherian Apr 20, 2023
6eb67b3
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 20, 2023
1f67c6f
Update whats-new.rst
dsgreen2 Apr 20, 2023
8b5c607
Update whats-new.rst
dsgreen2 Apr 20, 2023
7c2e50c
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 21, 2023
8116bb1
Merge branch 'main' into method-dataarray-to-daskdataframe
dsgreen2 Apr 27, 2023
e80ece0
Update doc/whats-new.rst
dcherian Apr 28, 2023
dd58648
Merge branch 'main' into method-dataarray-to-daskdataframe
dcherian Apr 28, 2023
7ed49ad
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 28, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -631,6 +631,7 @@ DataArray methods
DataArray.from_iris
DataArray.from_series
DataArray.to_cdms2
DataArray.to_dask_dataframe
DataArray.to_dataframe
DataArray.to_dataset
DataArray.to_dict
Expand Down
2 changes: 2 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ What's New

v2023.04.0 (unreleased)
-----------------------
- Added new method :py:meth:`DataArray.to_dask_dataframe`,convert a dataarray into a dask dataframe (:issue:`7409`).

New Features
~~~~~~~~~~~~
Expand Down Expand Up @@ -70,6 +71,7 @@ New Features
By `Deepak Cherian <https://github.com/dcherian>`_.
- Improved performance in ``open_dataset`` for datasets with large object arrays (:issue:`7484`, :pull:`7494`).
By `Alex Goodman <https://github.com/agoodm>`_ and `Deepak Cherian <https://github.com/dcherian>`_.
- Added new method :py:meth:`DataArray.to_dask_dataframe`,convert a dataarray into a dask dataframe (:issue:`7409`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.


Breaking changes
~~~~~~~~~~~~~~~~
Expand Down
56 changes: 56 additions & 0 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@

from numpy.typing import ArrayLike

try:
from dask.dataframe import DataFrame as DaskDataFrame
except ImportError:
DaskDataFrame = None # type: ignore
try:
from dask.delayed import Delayed
except ImportError:
Expand Down Expand Up @@ -6670,6 +6674,58 @@ def resample(
**indexer_kwargs,
)

def to_dask_dataframe(
self,
dim_order: Sequence[Hashable] | None = None,
set_index: bool = False,
) -> DaskDataFrame:
"""Convert this array into a dask.dataframe.DataFrame.

Parameters
----------
dim_order : Sequence of Hashable or None , optional
Hierarchical dimension order for the resulting dataframe.
Array content is transposed to this order and then written out as flat
vectors in contiguous order , so the last dimension in this list
will be contiguous in the resulting DataFrame. This has a major influence
on which operations are efficient on the resulting dask dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

Copy link
Contributor Author

@dsgreen2 dsgreen2 Apr 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes .Please review them. Thanks

set_index : bool, default: False
If set_index=True, the dask DataFrame is indexed by this dataset's
coordinate. Since dask DataFrames do not support multi-indexes,
set_index only works if the dataset only contains one dimension.

Returns
-------
dask.dataframe.DataFrame

Examples
--------
>>> da = xr.DataArray(
... np.arange(4 * 3 * 2).reshape(4, 3, 2),
... dims=("time", "lat", "lon"),
... coords={
... "time": np.arange(4),
... "lat": [-30, -20, -10],
... "lon": [120, 130],
... },
... name="eg_dataarray",
... attrs={"units": "Celsius", "description": "Random temperature data"},
... )
>>> da.to_dask_dataframe(["lat", "lon", "time"])
Dask DataFrame Structure:
lat lon time eg_dataarray
npartitions=1
0 int64 int64 int64 int64
23 ... ... ... ...
Dask Name: concat-indexed, 1 graph layer

"""

name = self.name if self.name is not None else _THIS_ARRAY
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise an error for an unnamed DataArray.

This line is sticking a private xarray sentinel object in a user's DataFrame, it's going to cause issues.

import xarray as xr
import xarray.core.utils 
from xarray.core.dataarray import _THIS_ARRAY

self = xr.DataArray([1, 2, 3])

name = self.name or _THIS_ARRAY
ds = self._to_dataset_whole(name, shallow_copy=False)
df = ds.to_dask_dataframe(dim_order=None, set_index=True)

type(df.columns[0])
xarray.core.utils.ReprObject

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made the changes as suggested . Please review them.

ds = self._to_dataset_whole(name, shallow_copy=False)
return ds.to_dask_dataframe(dim_order, set_index)

# this needs to be at the end, or mypy will confuse with `str`
# https://mypy.readthedocs.io/en/latest/common_issues.html#dealing-with-conflicting-names
str = utils.UncachedAccessor(StringAccessor["DataArray"])
29 changes: 29 additions & 0 deletions xarray/tests/test_dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -3205,6 +3205,35 @@ def test_to_dataframe_0length(self) -> None:
assert len(actual) == 0
assert_array_equal(actual.index.names, list("ABC"))

@requires_dask
def test_to_dask_dataframe(self) -> None:
arr_np = np.arange(3 * 4).reshape(3, 4)
arr = DataArray(arr_np, [("B", [1, 2, 3]), ("A", list("cdef"))], name="foo")
expected = arr.to_series()
actual = arr.to_dask_dataframe()["foo"]

assert_array_equal(actual.values, expected.values)

actual = arr.to_dask_dataframe(dim_order=["A", "B"])["foo"]
assert_array_equal(arr_np.transpose().reshape(-1), actual.values)

# regression test for coords with different dimensions

arr.coords["C"] = ("B", [-1, -2, -3])
expected = arr.to_series().to_frame()
expected["C"] = [-1] * 4 + [-2] * 4 + [-3] * 4
expected = expected[["C", "foo"]]
actual = arr.to_dask_dataframe()[["C", "foo"]]

assert_array_equal(expected.values, actual.values)
assert_array_equal(expected.columns.values, actual.columns.values)

with pytest.raises(ValueError, match="does not match the set of dimensions"):
arr.to_dask_dataframe(dim_order=["B", "A", "C"])

with pytest.raises(ValueError, match=r"cannot convert a scalar"):
arr.sel(A="c", B=2).to_dataframe()

def test_to_pandas_name_matches_coordinate(self) -> None:
# coordinate with same name as array
arr = DataArray([1, 2, 3], dims="x", name="x")
Expand Down