Closed
Description
What happened:
In the notebook 02_bags.ipynb
, the groupby
vs foldby
example using account data, groupby
doesn't group all data and shows a different result than foldby
equivalent code.
MVCE:
%run prep.py -d accounts
import json
from dask.distributed import Client
import dask.bag as db
client = Client(n_workers=4)
filename = os.path.join('data', 'accounts.*.json.gz')
lines = db.read_text(filename)
# Warning, this one takes a while...
result = js.groupby(lambda item: item['name']).starmap(lambda k, v: (k, len(v))).compute()
print(sorted(result))
Shows
[('Alice', 285), ('Alice', 287), ('Alice', 308), ('Alice', 311), ('Bob', 216), ('Bob', 216), ('Bob', 234), ('Bob', 234), ('Charlie', 219), ('Charlie', 219), ('Charlie', 234), ('Charlie', 238), .... , ('Zelda', 259), ('Zelda', 259), ('Zelda', 281), ('Zelda', 284)]
What you expected to happen:
It should show the same output as this code using foldby
:
# This one is comparatively fast and produces the same result.
from operator import add
def incr(tot, _):
return tot + 1
result = js.foldby(key='name',
binop=incr,
initial=0,
combine=add,
combine_initial=0).compute()
print(sorted(result))
Output:
[('Alice', 1191), ('Bob', 900), ('Charlie', 910), ...., ('Zelda', 1083)]
Anything else we need to know?:
This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723
I put the bug report here because the has been for a while, and it took me some time to found these issues.
Regarding what to do next I would like to know maintainers opinions, the options I see are:
- wait for the fix, keeping open the issue until then
- also reference this issue in the tutorial
- change the tutorial to group by numerics
Environment:
- Dask version: 2.20 and 2021.05.0
- Python version: 3.8.10
- Operating System: Ubuntu 18.04.5
- Install method (conda):
conda env create -f binder/environment.yml
for 2.20 andconda update dask
for 2021.05.0
also tested using Binder from the link in the Readme
Metadata
Metadata
Assignees
Labels
No labels