Merging a dask dataframe with a pandas series fails #915

mrocklin · 2024-02-29T19:11:50Z

In [1]: import dask

In [2]: import dask_expr as dd
   ...: df = dd.datasets.timeseries()

In [3]: df.head()
Out[3]:
                        name    id         x         y
timestamp
2000-01-01 00:00:00  Michael  1006  0.927520 -0.442859
2000-01-01 00:00:01    Kevin  1018 -0.411144 -0.037667
2000-01-01 00:00:02   Yvonne   974 -0.648850 -0.515754
2000-01-01 00:00:03   Yvonne   994  0.463103  0.560937
2000-01-01 00:00:04   Yvonne  1002 -0.511311 -0.308211

In [4]: df.merge(df.name.head())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-4-254398e6f615> in ?()
----> 1 df.merge(df.name.head())

~/workspace/dask-expr/dask_expr/_collection.py in ?(self, right, how, on, left_on, right_on, left_index, right_index, suffixes, indicator, shuffle_method, npartitions, broadcast)
   2572         an internal ``shuffle``, because shuffling places all rows that have the same
   2573         index in the same partition. To avoid this error, make sure all rows with the
   2574         same ``on``-column value can fit on a single partition.
   2575         """
-> 2576         return merge(
   2577             self,
   2578             right,
   2579             how,

~/workspace/dask-expr/dask_expr/_collection.py in ?(left, right, how, on, left_on, right_on, left_index, right_index, suffixes, indicator, shuffle_method, npartitions, broadcast)
   4779     for o in [on, left_on, right_on]:
   4780         if isinstance(o, FrameBase):
   4781             raise NotImplementedError()
   4782     if not on and not left_on and not right_on and not left_index and not right_index:
-> 4783         on = [c for c in left.columns if c in right.columns]
   4784         if not on:
   4785             left_index = right_index = True
   4786

~/workspace/dask-expr/dask_expr/_collection.py in ?(.0)
-> 4783 def merge(
   4784     left,
   4785     right,
   4786     how="inner",

~/mambaforge/envs/etl-tpch/lib/python3.11/site-packages/pandas/core/generic.py in ?(self, name)
   6289             and name not in self._accessors
   6290             and self._info_axis._can_hold_identifiers_and_holds_name(name)
   6291         ):
   6292             return self[name]
-> 6293         return object.__getattribute__(self, name)

AttributeError: 'Series' object has no attribute 'columns'

In [5]: df.head().merge(df.name.head())
Out[5]:
       name    id         x         y
0   Michael  1006  0.927520 -0.442859
1     Kevin  1018 -0.411144 -0.037667
2    Yvonne   974 -0.648850 -0.515754
3    Yvonne   974 -0.648850 -0.515754
4    Yvonne   974 -0.648850 -0.515754
5    Yvonne   994  0.463103  0.560937
6    Yvonne   994  0.463103  0.560937
7    Yvonne   994  0.463103  0.560937
8    Yvonne  1002 -0.511311 -0.308211
9    Yvonne  1002 -0.511311 -0.308211
10   Yvonne  1002 -0.511311 -0.308211

phofl · 2024-03-02T16:19:08Z

That doesn't work in dask/dask either, not sure how trivial this is to add

The easiest solution might be to cast to a DataFrame

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merging a dask dataframe with a pandas series fails #915

Merging a dask dataframe with a pandas series fails #915

mrocklin commented Feb 29, 2024

phofl commented Mar 2, 2024

Merging a dask dataframe with a pandas series fails #915

Merging a dask dataframe with a pandas series fails #915

Comments

mrocklin commented Feb 29, 2024

phofl commented Mar 2, 2024