Skip to content

02_bag groupby vs foldby example fails due to known bug #213

Closed
@gmiretti

Description

@gmiretti

What happened:

In the notebook 02_bags.ipynb, the groupby vs foldby example using account data, groupby doesn't group all data and shows a different result than foldby equivalent code.

MVCE:

%run prep.py -d accounts
import json
from dask.distributed import Client
import dask.bag as db

client = Client(n_workers=4)
filename = os.path.join('data', 'accounts.*.json.gz')
lines = db.read_text(filename)
# Warning, this one takes a while...
result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()
print(sorted(result))

Shows

[('Alice', 285), ('Alice', 287), ('Alice', 308), ('Alice', 311), ('Bob', 216), ('Bob', 216), ('Bob', 234), ('Bob', 234), ('Charlie', 219), ('Charlie', 219), ('Charlie', 234), ('Charlie', 238), .... , ('Zelda', 259), ('Zelda', 259), ('Zelda', 281), ('Zelda', 284)]

What you expected to happen:

It should show the same output as this code using foldby:

# This one is comparatively fast and produces the same result.
from operator import add
def incr(tot, _):
    return tot + 1

result = js.foldby(key='name', 
                   binop=incr, 
                   initial=0, 
                   combine=add, 
                   combine_initial=0).compute()
print(sorted(result))

Output:

[('Alice', 1191), ('Bob', 900), ('Charlie', 910), ...., ('Zelda', 1083)]

Anything else we need to know?:

This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723

I put the bug report here because the has been for a while, and it took me some time to found these issues.
Regarding what to do next I would like to know maintainers opinions, the options I see are:

  • wait for the fix, keeping open the issue until then
  • also reference this issue in the tutorial
  • change the tutorial to group by numerics

Environment:

  • Dask version: 2.20 and 2021.05.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 18.04.5
  • Install method (conda): conda env create -f binder/environment.yml for 2.20 and conda update dask for 2021.05.0

also tested using Binder from the link in the Readme

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions