Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

02_bag groupby vs foldby example fails due to known bug #213

Closed
gmiretti opened this issue May 28, 2021 · 1 comment
Closed

02_bag groupby vs foldby example fails due to known bug #213

gmiretti opened this issue May 28, 2021 · 1 comment

Comments

@gmiretti
Copy link

What happened:

In the notebook 02_bags.ipynb, the groupby vs foldby example using account data, groupby doesn't group all data and shows a different result than foldby equivalent code.

MVCE:

%run prep.py -d accounts
import json
from dask.distributed import Client
import dask.bag as db

client = Client(n_workers=4)
filename = os.path.join('data', 'accounts.*.json.gz')
lines = db.read_text(filename)
# Warning, this one takes a while...
result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()
print(sorted(result))

Shows

[('Alice', 285), ('Alice', 287), ('Alice', 308), ('Alice', 311), ('Bob', 216), ('Bob', 216), ('Bob', 234), ('Bob', 234), ('Charlie', 219), ('Charlie', 219), ('Charlie', 234), ('Charlie', 238), .... , ('Zelda', 259), ('Zelda', 259), ('Zelda', 281), ('Zelda', 284)]

What you expected to happen:

It should show the same output as this code using foldby:

# This one is comparatively fast and produces the same result.
from operator import add
def incr(tot, _):
    return tot + 1

result = js.foldby(key='name', 
                   binop=incr, 
                   initial=0, 
                   combine=add, 
                   combine_initial=0).compute()
print(sorted(result))

Output:

[('Alice', 1191), ('Bob', 900), ('Charlie', 910), ...., ('Zelda', 1083)]

Anything else we need to know?:

This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723

I put the bug report here because the has been for a while, and it took me some time to found these issues.
Regarding what to do next I would like to know maintainers opinions, the options I see are:

  • wait for the fix, keeping open the issue until then
  • also reference this issue in the tutorial
  • change the tutorial to group by numerics

Environment:

  • Dask version: 2.20 and 2021.05.0
  • Python version: 3.8.10
  • Operating System: Ubuntu 18.04.5
  • Install method (conda): conda env create -f binder/environment.yml for 2.20 and conda update dask for 2021.05.0

also tested using Binder from the link in the Readme

@gmiretti gmiretti changed the title 02_bag groupby vs foldby example fails due known bug 02_bag groupby vs foldby example fails due to known bug May 28, 2021
@jsignell
Copy link
Member

jsignell commented Jul 8, 2022

This notebook has been removed from the most recent version of the tutorial, so I'll go ahead and close this issue. But thank you for reporting it!

@jsignell jsignell closed this as completed Jul 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants