New Dask Arrow-based strings cause test failures #335

jakirkham · 2023-08-03T21:01:29Z

In the presence of pyarrow, dask by default assumes dataframes of type object to be pyarrow strings (see dask/dask#10139 (comment)).

This creates problems revealed by failing tests (e.g. test_dask_image/test_ndmeasure/test_find_objects.py::test_3d_find_objects)

dask-image/dask_image/ndmeasure/_utils/_find_objects.py

Lines 68 to 70 in 67540af

    
           meta = dd.utils.make_meta([(i, object) for i in range(ndim)]) 
        
           if isinstance(df1, Delayed): 
        
               df1 = dd.from_delayed(df1, meta=meta)

dd.from_delayed(df1, meta=meta).compute().dtypes

Working install:

0 object
1 object
2 object
dtype: object

Failing install:

0 string[pyarrow]
1 string[pyarrow]
2 string[pyarrow]
dtype: object

The failing test had come up when releasing v2023.08.0 in conda-forge/dask-image-feedstock#14.

@jakirkham found that pyarrow is installed with the conda distribution of dask, but not when installing over pip, where it just part of the [complete] target.

Also @jakirkham found that the above described conflicting behaviour can be turned off using the dask configuration.

He did this for the tests performed by the dask-image conda feedstock on v2023.08.0.

The text was updated successfully, but these errors were encountered:

jakirkham · 2023-08-03T21:02:07Z

Please fill free to edit and fill this issue out Marvin 🙂

Just needed a placeholder for tracking

jakirkham · 2023-08-03T21:22:36Z

Would also be good to make a note in issue (where feedback is being collected): dask/dask#10139

Ideally with a simple reproducer

m-albert · 2023-08-03T22:58:37Z

Despite the passing tests, potentially users who installed dask-image over conda would still experience the above described problem when using ndmeasure.find_objects (need to check whether there's more affected).

Perhaps a suitable fix for this on the dask-image side would be to set dask.config.set({"dataframe.convert-string": False}) using a context manager around the affected functionality? See this recommendation.

Edit:
Confirmed that using with dask_config.set({'dataframe.convert-string': False}): around

dask-image/dask_image/ndmeasure/__init__.py

Lines 243 to 247 in 67540af

    
           bag = db.from_sequence(arrays) 
        
           result = bag.fold(functools.partial(_find_objects, label_image.ndim), split_every=2).to_delayed() 
        
           meta = dd.utils.make_meta([(i, object) for i in range(label_image.ndim)]) 
        
           result = delayed(compute)(result)[0]  # avoid the user having to call compute twice on result 
        
           result = dd.from_delayed(result, meta=meta, prefix="find-objects-", verify_meta=False)

and

dask-image/dask_image/ndmeasure/_utils/_find_objects.py

Lines 68 to 74 in 67540af

    
           meta = dd.utils.make_meta([(i, object) for i in range(ndim)]) 
        
           if isinstance(df1, Delayed): 
        
               df1 = dd.from_delayed(df1, meta=meta) 
        
           if isinstance(df2, Delayed): 
        
               df2 = dd.from_delayed(df2, meta=meta) 
        
           ddf = dd.merge(df1, df2, how="outer", left_index=True, right_index=True) 
        
           result = ddf.apply(_merge_bounding_boxes, ndim=ndim, axis=1, meta=meta)

fixes the errors.

GenevieveBuckley · 2023-08-04T02:20:54Z

Would also be good to make a note in issue (where feedback is being collected): dask/dask#10139

Ideally with a simple reproducer

I've made a comment here, but no reproducer (I'm not planning to do more work on this, it's open for anyone who wants it)

jakirkham · 2023-08-04T06:33:16Z

Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version

Tbh I've not looked deeply into the Dask Arrow work. Have heard about it mainly in passing. So not sure how dask.config handles it

Should add this pain point is not unique to us. We had to disable this feature in Dask-SQL recently as well ( dask-contrib/dask-sql#1206 ). Unclear whether this is due to upstream bugs or if we need to make changes

GenevieveBuckley · 2023-08-08T00:25:33Z

Yeah think we are not seeing this in CI as it requires a newer version of Dask than we are testing. Perhaps we should upgrade one of the CI environments (like 3.11) to a very recent Dask version

We could add an "upstream" CI environment, that just uses whatever the latest (or even pre-release?) versions are, maybe?

jakirkham · 2023-08-08T00:39:20Z

There are Dask nightly packages. So that would be easy to add

m-albert · 2023-08-09T10:37:41Z

As far as I understand, for reproducing the test failure in CI, next to a recent dask version we'd need arrow>=7 present in the environment. This is a regular dependency of the conda , but not the pypi dask distribution.

jakirkham · 2024-02-26T19:48:37Z

Is this still an issue with recent Dask releases? Asking as they may have fixed something upstream since this occurred

GenevieveBuckley · 2024-02-27T02:57:32Z

I'm not seeing any flaky/failing tests, so I don't think this is still happening currently. I'll close the issue, and if it pops up again we can re-open.

jakirkham · 2024-02-27T04:06:32Z

Yeah I think part of the issue before was CI doesn't capture this edge case. Though maybe it should

m-albert · 2024-03-08T13:00:29Z

Reopening as this issue just came up in #355.

jakirkham · 2024-05-21T05:09:12Z

Think this may have been fixed in the intervening time. The test suite no longer fails for me

jakirkham mentioned this issue Aug 3, 2023

dask-image v2023.8.1 conda-forge/dask-image-feedstock#14

Merged

jakirkham changed the title ~~New Arrow strings cause test failures~~ New Dask Arrow-based strings cause test failures Aug 3, 2023

GenevieveBuckley mentioned this issue Aug 4, 2023

[FEEDBACK] User experience with arrow strings dask/dask#10139

Open

GenevieveBuckley closed this as completed Feb 27, 2024

m-albert mentioned this issue Mar 8, 2024

find_objects throws AttributeError #355

Closed

m-albert reopened this Mar 8, 2024

jakirkham mentioned this issue May 21, 2024

Update to 2024.5.1 conda-forge/dask-image-feedstock#30

Merged

m-albert mentioned this issue Jul 23, 2024

fix KeyError: "None of [Index(['0_x', '1_x', '0_y', '1_y'], dtype='object')] are in the [columns]" in find_objects #384

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Dask Arrow-based strings cause test failures #335

New Dask Arrow-based strings cause test failures #335

jakirkham commented Aug 3, 2023 •

edited

Loading

jakirkham commented Aug 3, 2023

jakirkham commented Aug 3, 2023

m-albert commented Aug 3, 2023 •

edited

Loading

GenevieveBuckley commented Aug 4, 2023

jakirkham commented Aug 4, 2023

GenevieveBuckley commented Aug 8, 2023

jakirkham commented Aug 8, 2023

m-albert commented Aug 9, 2023

jakirkham commented Feb 26, 2024

GenevieveBuckley commented Feb 27, 2024

jakirkham commented Feb 27, 2024

m-albert commented Mar 8, 2024

jakirkham commented May 21, 2024

New Dask Arrow-based strings cause test failures #335

New Dask Arrow-based strings cause test failures #335

Comments

jakirkham commented Aug 3, 2023 • edited Loading

jakirkham commented Aug 3, 2023

jakirkham commented Aug 3, 2023

m-albert commented Aug 3, 2023 • edited Loading

GenevieveBuckley commented Aug 4, 2023

jakirkham commented Aug 4, 2023

GenevieveBuckley commented Aug 8, 2023

jakirkham commented Aug 8, 2023

m-albert commented Aug 9, 2023

jakirkham commented Feb 26, 2024

GenevieveBuckley commented Feb 27, 2024

jakirkham commented Feb 27, 2024

m-albert commented Mar 8, 2024

jakirkham commented May 21, 2024

jakirkham commented Aug 3, 2023 •

edited

Loading

m-albert commented Aug 3, 2023 •

edited

Loading