Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merging a dask dataframe with a pandas series fails #915

Open
mrocklin opened this issue Feb 29, 2024 · 1 comment
Open

Merging a dask dataframe with a pandas series fails #915

mrocklin opened this issue Feb 29, 2024 · 1 comment

Comments

@mrocklin
Copy link
Member

In [1]: import dask

In [2]: import dask_expr as dd
   ...: df = dd.datasets.timeseries()

In [3]: df.head()
Out[3]:
                        name    id         x         y
timestamp
2000-01-01 00:00:00  Michael  1006  0.927520 -0.442859
2000-01-01 00:00:01    Kevin  1018 -0.411144 -0.037667
2000-01-01 00:00:02   Yvonne   974 -0.648850 -0.515754
2000-01-01 00:00:03   Yvonne   994  0.463103  0.560937
2000-01-01 00:00:04   Yvonne  1002 -0.511311 -0.308211

In [4]: df.merge(df.name.head())
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-254398e6f615> in ?()
----> 1 df.merge(df.name.head())

~/workspace/dask-expr/dask_expr/_collection.py in ?(self, right, how, on, left_on, right_on, left_index, right_index, suffixes, indicator, shuffle_method, npartitions, broadcast)
   2572         an internal ``shuffle``, because shuffling places all rows that have the same
   2573         index in the same partition. To avoid this error, make sure all rows with the
   2574         same ``on``-column value can fit on a single partition.
   2575         """
-> 2576         return merge(
   2577             self,
   2578             right,
   2579             how,

~/workspace/dask-expr/dask_expr/_collection.py in ?(left, right, how, on, left_on, right_on, left_index, right_index, suffixes, indicator, shuffle_method, npartitions, broadcast)
   4779     for o in [on, left_on, right_on]:
   4780         if isinstance(o, FrameBase):
   4781             raise NotImplementedError()
   4782     if not on and not left_on and not right_on and not left_index and not right_index:
-> 4783         on = [c for c in left.columns if c in right.columns]
   4784         if not on:
   4785             left_index = right_index = True
   4786

~/workspace/dask-expr/dask_expr/_collection.py in ?(.0)
-> 4783 def merge(
   4784     left,
   4785     right,
   4786     how="inner",

~/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   6289             and name not in self._accessors
   6290             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6291         ):
   6292             return self[name]
-> 6293         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'columns'
In [5]: df.head().merge(df.name.head())
Out[5]:
       name    id         x         y
0   Michael  1006  0.927520 -0.442859
1     Kevin  1018 -0.411144 -0.037667
2    Yvonne   974 -0.648850 -0.515754
3    Yvonne   974 -0.648850 -0.515754
4    Yvonne   974 -0.648850 -0.515754
5    Yvonne   994  0.463103  0.560937
6    Yvonne   994  0.463103  0.560937
7    Yvonne   994  0.463103  0.560937
8    Yvonne  1002 -0.511311 -0.308211
9    Yvonne  1002 -0.511311 -0.308211
10   Yvonne  1002 -0.511311 -0.308211
@phofl
Copy link
Collaborator

phofl commented Mar 2, 2024

That doesn't work in dask/dask either, not sure how trivial this is to add

The easiest solution might be to cast to a DataFrame

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants