You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the notebook 02_bags.ipynb, the groupby vs foldby example using account data, groupby doesn't group all data and shows a different result than foldby equivalent code.
MVCE:
%runprep.py-daccountsimportjsonfromdask.distributedimportClientimportdask.bagasdbclient=Client(n_workers=4)
filename=os.path.join('data', 'accounts.*.json.gz')
lines=db.read_text(filename)
# Warning, this one takes a while...result=js.groupby(lambdaitem: item['name']).starmap(lambdak, v: (k, len(v))).compute()
print(sorted(result))
It should show the same output as this code using foldby:
# This one is comparatively fast and produces the same result.fromoperatorimportadddefincr(tot, _):
returntot+1result=js.foldby(key='name',
binop=incr,
initial=0,
combine=add,
combine_initial=0).compute()
print(sorted(result))
This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723
I put the bug report here because the has been for a while, and it took me some time to found these issues.
Regarding what to do next I would like to know maintainers opinions, the options I see are:
wait for the fix, keeping open the issue until then
also reference this issue in the tutorial
change the tutorial to group by numerics
Environment:
Dask version: 2.20 and 2021.05.0
Python version: 3.8.10
Operating System: Ubuntu 18.04.5
Install method (conda): conda env create -f binder/environment.yml for 2.20 and conda update dask for 2021.05.0
also tested using Binder from the link in the Readme
The text was updated successfully, but these errors were encountered:
gmiretti
changed the title
02_bag groupby vs foldby example fails due known bug
02_bag groupby vs foldby example fails due to known bug
May 28, 2021
What happened:
In the notebook
02_bags.ipynb
, thegroupby
vsfoldby
example using account data,groupby
doesn't group all data and shows a different result thanfoldby
equivalent code.MVCE:
Shows
What you expected to happen:
It should show the same output as this code using
foldby
:Output:
Anything else we need to know?:
This looks like a known issue for a while. First reported in dask 2.20 and fixed in dask/dask#6640 , but still open in dask/dask#6723
I put the bug report here because the has been for a while, and it took me some time to found these issues.
Regarding what to do next I would like to know maintainers opinions, the options I see are:
Environment:
conda env create -f binder/environment.yml
for 2.20 andconda update dask
for 2021.05.0also tested using Binder from the link in the Readme
The text was updated successfully, but these errors were encountered: