-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: DaskRunner GBK is broken, pending upstream fix in dask #29365
Comments
.take-issue |
I spent a little time looking at this. I found it's possible to minimally reproduce this with dask only, configured to use a With this script: from distributed import Client, LocalCluster
from dask import bag as db
from pprint import pprint
def run():
cluster = LocalCluster()
client = Client(cluster)
b = db.from_sequence([('a', 1),
('a', 2),
('b', 3),
('b', 4)])
grouped = b.groupby(lambda s: s[0])
for i in range(10):
print(i)
pprint(list(grouped))
if __name__ == '__main__':
run() I get this output:
This is the same behavior I see in a simple beam groupby test. I'm not sure how to make sense of this. Some possible explanations:
Actually it looks like it must be the latter. Switching to integer keys works as expected (as you observed in the beam test):
We should probably raise an issue with dask. |
Wow @TheNeuralBit this is right on the money, perfect reproducer! I can translate this into a Dask issue, if you like? Edit: Working with your script a bit locally myself first, to see if I can narrow-down the source of the bug a bit further... looking like it's something the do with the interaction between dask bag and |
As Brian discovered during our sync on this today, the Dask issues already exist:
This is indeed the root issue, resulting from the fact that Python's built-in I will ping the Dask issues now. |
It's really surprising that not many people have run into this! I noticed the dask bag docs do nudge users away from groupby:
So I guess it is conceivable that no one is relying on this. It occurred to me that we are also partitioning with a hash in Beam, for the DataFrame API. I refreshed my memory on this and found it's using
Presumably Dask DataFrames are doing something similar and that's why we don't see this issue there. |
Pleased to report that dask/distributed#8400 provides a path forward for this. I'll open a draft PR bumping the dask version we use in Beam (to be completed/merged following a release of |
assert_that
I've updated the issue title to reflect my current understanding of this issue. The previously mentioned dask |
What happened?
As released in
2.51.0
, the DaskRunner's GBK implementation only works under certain conditions (e.g., if the key type is an integer). It does not work for many commonly-used patterns, including:assert_that
:beam/sdks/python/apache_beam/testing/util.py
Line 298 in 07d5d53
beam/sdks/python/apache_beam/transforms/util.py
Line 258 in 07d5d53
While both of these are problematic, the latter is especially so because, given the ubiquity of
asset_that
in the tests, it more-or-less means we can't test (or, therefore, develop) the DaskRunner at all until this issue is fixed.Over in cisaacstern/beam-daskrunner-gbk-bugs#1, I have developed a relatively comprehensive demonstration of both of these failure modes. Based on experiments there, it looks like the root cause of this issue relates to the partitioning of the underlying Dask Bag collection. As evidence of this, when the following implementation of
Create
is mocked into2.51.0
during the test session, both of these GBK-related bugs are resolved (coped from here):I am nearly certain that this issue is not with
dask.bag.Bag.groupby
, which AFAICT works fine, e.g., with partitioned collections containing string keys. Rather, this appears to have something to do with how the DaskRunner is wrapping partioned collections.(Note: I have tried to develop an MRE for the
assert_that
failure mode that does not involveassert_that
, but thus fair come up empty-handed. Initially, I thought that it might have to do with usingNone
as a key, which happens there, but the linked test cases show that, at least in a minimal example, theDaskRunner
's GBK can handleNone
keys.)I am happy to take this issue, as it is currently blocking me from finishing #27618. I may need to enlist the support of some more Dask-savvy folks in getting to the bottom of this, perhaps @jacobtomlinson will be keen!
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: