Add Python 3.12 support #1317

hoxbro · 2024-02-08T07:42:33Z

With numba now supporting Python 3.12, we should also add support to it.

We need conda-build, to be released on conda-forge before we can solve on conda.

Builds on top of #1314

codecov · 2024-02-08T08:44:29Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.60%. Comparing base (09da5a0) to head (8b5d3d5).

❗ Current head 8b5d3d5 differs from pull request most recent head 67d73f5. Consider uploading reports for the commit 67d73f5 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1317      +/-   ##
==========================================
- Coverage   85.63%   85.60%   -0.04%     
==========================================
  Files          51       51              
  Lines       11286    11287       +1     
==========================================
- Hits         9665     9662       -3     
- Misses       1621     1625       +4

hoxbro · 2024-02-08T09:22:02Z

Tests passed, but get this warning in Python 3.12

  /home/runner/work/datashader/datashader/datashader/datashape/lexer.py:20: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
    return ast.parse('u' + s).body[0].value.s

hoxbro · 2024-04-04T16:23:15Z

Things I have done in this PR, other than upgrading to Python 3.12.

I have updated the test CI to align more with our other repos.
Don't use the new dask query-planner / dask-expr as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:
- This line is no longer valid with dask-expr (likely related to this issue: __dask_keys__ and future identifiers are different dask/dask-expr#952)
  
  datashader/datashader/data_libraries/dask.py
  
  Line 147 in 09da5a0
  
  assert list(df_array.__dask_keys__()) == list(df.__dask_keys__())
- It does not have pack_partitions in the 8_Polygons.ipynb notebook (this seem easier to fix).

datashader/core.py

phofl · 2024-04-05T14:12:31Z

Things I have done in this PR, other than upgrading to Python 3.12.

I have updated the test CI to align more with our other repos.

Don't use the new dask query-planner / dask-expr as it does not currently work with Datashader. I don't know what is needed to get it to work. There are at least two problems:

This line is no longer valid with dask-expr (likely related to this issue: dask_keys and future identifiers are different dask/dask-expr#952)

datashader/datashader/data_libraries/dask.py

Line 147 in 09da5a0

assert list(df_array.__dask_keys__()) == list(df.__dask_keys__())

It does not have pack_partitions in the 8_Polygons.ipynb notebook (this seem easier to fix).

Do you have a reproducer for the invalid line?

mrocklin · 2024-04-05T14:29:16Z

Searching that file for the text key brings up a couple places where they're doing explicit Dask things, calling schedule(dsk, keys) and constructing low level task graphs.

One option would be to convert to a legacy dask dataframe, where all of those things would continue working, at least until we entirely remove that system.

hoxbro · 2024-04-05T14:34:51Z

A MRE for the dask keys could be this:

import dask.dataframe as dd
import pandas as pd
import numpy as np
import datashader as ds

df = pd.DataFrame(
    {
        "x": np.array(([0.0] * 10 + [1] * 10)),
        "y": np.array(([0.0] * 5 + [1] * 5 + [0] * 5 + [1] * 5)),
        "i32": np.arange(20, dtype="i4"),
    }
)

ddf = dd.from_pandas(df, npartitions=2)
ddf = ddf.repartition(npartitions=2)  # If I comment out this lines it does not error. 
c = ds.Canvas(plot_width=2, plot_height=2, x_range=(0, 1), y_range=(0, 1))
c.points(ddf, "x", "y", ds.count("i32"))

phofl · 2024-04-05T14:45:11Z

thx @hoxbro

The issue at play here is as follows:

df.__dask_graph__()
df.__dask_keys__()

both change the expression (a optimization step), which makes the keys in the graph no longer match df._name that you pass into da.Array, which causes the exception.

There are different ways of addressing this:

you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103
You might be able to convert the DataFrame directly to a Dask Array, e.g. df.values or df.to_dask_array(), this takes care of everything under the hood, but I guess there is probably a reason that you are doing things differently at the moment?

hoxbro · 2024-04-05T15:19:08Z

you can optimise the DataFrame before extracting the graph, e.g. df = df.optimize() before you access the keys and the graph. This only works when it's an expression based DataFrame, not a legacy frame. This would need to be added in line 103

df = df.optimize() line makes a lot of the red test green again 👍

I will push it and see how the CI likes it.

However, I can see at least one problem still exists, where dask_expr now takes "precedence" over the custom implementation of the dask dataframe. Do you also have a one-liner for that?

import dask

dask.config.set({"dataframe.query-planning": True})

import geopandas as gpd
import dask.dataframe as dd
import spatialpandas as sp
from geodatasets import get_path

world = sp.GeoDataFrame(gpd.read_file(get_path("nybb")))
type(dd.from_pandas(world, npartitions=2))  # dask_expr._collection.DataFrame

Edit: I ran one cell, restarted the kernel and then ran the second.

hoxbro · 2024-04-05T16:31:08Z

There seem to be one other type of failing tests:

import dask.dataframe as dd
import pandas as pd
import datashader as ds

df = pd.DataFrame({"x": [0] * 5, "y": [0] * 5})
ddf = dd.from_pandas(df, npartitions=2)

cvs = ds.Canvas(plot_width=10, plot_height=10)
agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
Cell In[5], line 9
      6 ddf = dd.from_pandas(df, npartitions=2)
      8 cvs = ds.Canvas(plot_width=10, plot_height=10)
----> 9 agg = cvs.line(ddf, "x", "y", line_width=0, agg=ds.reductions.any())

File [~/projects/holoviz/repos/datashader/datashader/core.py:472](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=471), in Canvas.line(self, source, x, y, agg, axis, geometry, line_width, antialias)
    464     if not isinstance(non_cat_agg, (
    465         rd.any, rd.count, rd.max, rd.min, rd.sum, rd.summary, rd._sum_zero,
    466         rd._first_or_last, rd.mean, rd.max_n, rd.min_n, rd._first_n_or_last_n,
    467         rd._max_or_min_row_index, rd._max_n_or_min_n_row_index, rd.where,
    468     )):
    469         raise NotImplementedError(
    470             f"{type(non_cat_agg)} reduction not implemented for antialiased lines")
--> 472 return bypixel(source, self, glyph, agg, antialias=glyph.antialiased)

File [~/projects/holoviz/repos/datashader/datashader/core.py:1335](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/core.py#line=1334), in bypixel(source, canvas, glyph, agg, antialias)
   1333 with warnings.catch_warnings():
   1334     warnings.filterwarnings('ignore', r'All-NaN (slice|axis) encountered')
-> 1335     return bypixel.pipeline(source, schema, canvas, glyph, agg, antialias=antialias)

File [~/projects/holoviz/repos/datashader/datashader/utils.py:114](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/utils.py#line=113), in Dispatcher.__call__(self, head, *rest, **kwargs)
    112 typ = type(head)
    113 if typ in lk:
--> 114     return lk[typ](head, *rest, **kwargs)
    115 for cls in getmro(typ)[1:]:
    116     if cls in lk:

File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:51](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=50), in dask_pipeline(df, schema, canvas, glyph, summary, antialias, cuda)
     48 graph = df.__dask_graph__()
     50 dsk.update(optimize(graph, keys))
---> 51 return scheduler(dsk, name)

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py:90](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/threaded.py#line=89), in get(dsk, keys, cache, num_workers, pool, **kwargs)
     87     elif isinstance(pool, multiprocessing.pool.Pool):
     88         pool = MultiprocessingPoolExecutor(pool)
---> 90 results = get_async(
     91     pool.submit,
     92     pool._max_workers,
     93     dsk,
     94     keys,
     95     cache=cache,
     96     get_id=_thread_get_id,
     97     pack_exception=pack_exception,
     98     **kwargs,
     99 )
    101 # Cleanup pools associated to dead threads
    102 with pools_lock:

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:512](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=511), in get_async(submit, num_workers, dsk, result, cache, get_id, rerun_exceptions_locally, pack_exception, raise_exception, callbacks, dumps, loads, chunksize, **kwargs)
    510         _execute_task(task, data)  # Re-execute locally
    511     else:
--> 512         raise_exception(exc, tb)
    513 res, worker_id = loads(res_info)
    514 state["cache"][key] = res

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:320](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=319), in reraise(exc, tb)
    318 if exc.__traceback__ is not tb:
    319     raise exc.with_traceback(tb)
--> 320 raise exc

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py:225](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/local.py#line=224), in execute_task(key, task_info, dumps, loads, get_id, pack_exception)
    223 try:
    224     task, data = loads(task_info)
--> 225     result = _execute_task(task, data)
    226     id = get_id()
    227     result = dumps((result, id))

File [~/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py:127](http://localhost:8888/home/shh/miniconda3/envs/holoviz/lib/python3.12/site-packages/dask/core.py#line=126), in _execute_task(arg, cache, dsk)
    123     func, args = arg[0], arg[1:]
    124     # Note: Don't assign the subtask results to a variable. numpy detects
    125     # temporaries by their reference count and can execute certain
    126     # operations in-place.
--> 127     return func(*(_execute_task(a, cache) for a in args))
    128 elif not ishashable(arg):
    129     return arg

File [~/projects/holoviz/repos/datashader/datashader/data_libraries/dask.py:245](http://localhost:8888/lab/workspaces/auto-u/tree/repos/datashader/datashader/data_libraries/dask.py#line=244), in line.<locals>.chunk(df, df2)
    243 plot_start = True
    244 if df2 is not None:
--> 245     df = concat([df.iloc[-1:], df2])
    246     plot_start = False
    247 aggs = create(shape)

AttributeError: 'tuple' object has no attribute 'iloc'

phofl · 2024-04-05T17:19:01Z

That's not as trivial as this one. There are 2 options:

The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.
You can hack around this to always use legacy frame when someone calls the constructor of DaskGeoDataFrame, but this is only a short term solution.

What's the priority to make this passing?

philippjfr · 2024-04-05T17:27:23Z

The ideal solution would be to make your DaskGeoDataFrame compatible with the expression logic. geo pandas did this a few weeks ago but I think it isn't merged yet.

I'll have a look at the geopandas PR and see if I can crib from them. In the long run we definitely want full compatibility but will make a call whether to release with a short term patch based on the effort required.

hoxbro · 2024-04-08T15:52:22Z

I think the only test failing now is related to SpatialPandas Dask, which likely needs to be updated in spatialpandas itself. Because of this, the environment variable is needed for now.

The last failing test outside of that was related to attribute assignment being ignored with dask-expr, whereas it would work with classic dask.dataframe; see an MRE below. It seems to be related to dask/dask-expr#785, and can open an issue upstream if needed.

import dask
import importlib
import pandas as pd
import dask.dataframe as dd

for b in [True, False]:
    print("\ndataframe.query-planning =", b)
    dask.config.set({"dataframe.query-planning": b})
    dd = importlib.reload(dd)

    df = pd.DataFrame(data=dict(cat1=["a", "b", "c", "a", "b"], cat2=["a", "b", "c", "a", "b"]))
    ddf = dd.from_pandas(df, npartitions=2)

    ddf["cat1"] = ddf["cat1"].astype("category")
    ddf.cat2 = ddf.cat2.astype("category")
    print(ddf.dtypes)

phofl · 2024-04-08T15:53:27Z

this was fixed on Friday fortunately

philippjfr

Thanks for your help here @phofl and thanks @hoxbro for taking the initiative here.

jbednar · 2024-04-09T21:35:18Z

Wow, thanks everybody! It takes a village!

hoxbro added 11 commits January 31, 2024 18:08

Pin pytest for now.

2a01779

Update to ddf = ddf.repartition(npartitions=npartitions)

af10be7

Ignore test_ffil_limit_area

38e6964

Work-around Pandas 2.2 changes for Extension

6fb8e3d

Fix lint

2639541

Move pytest setting to pyproject.toml

5468a0d

Fix mix warnings in test suite

bc9c409

Make xfail run

51997fc

Try commenting out fastparquet for slow windows solve

001446a

Update nbsmoke_skip_run

db5a902

Update setup.py

03adc9a

hoxbro marked this pull request as draft February 8, 2024 07:54

hoxbro added 3 commits February 9, 2024 13:43

Ignore dask geopandas warning for now

b2ff68c

Update skipif to only be valid for pandas 2.2.0

b466c2f

Add NotImplementedError for duplicated.

8d3ada0

Base automatically changed from maint_ci to main February 13, 2024 07:48

hoxbro added 4 commits February 13, 2024 08:49

Add Python 3.12 support

77a8b0e

Run 3.12 on pypi for now

f01d324

Fix deprecation warning

68b982e

Merge branch 'main' into py312

42e908b

hoxbro force-pushed the py312 branch from 0565780 to 42e908b Compare February 13, 2024 07:50

hoxbro added 7 commits April 4, 2024 08:22

Merge branch 'main' into py312

b01a54b

Update install task

dc572cd

Add dask-expr to install requires

7631a28

Add setup matrix creation and cache toggle

8183503

Disable dask-expr for now

394d284

Make channel priority flexible

ca5f599

Remove dask setting to conftest

c493e25

hoxbro added 2 commits April 4, 2024 10:55

Update example conftest

646db4e

Set environment variables for query-planning

234b8d7

hoxbro marked this pull request as ready for review April 4, 2024 16:23

hoxbro requested a review from philippjfr April 4, 2024 16:23

philippjfr reviewed Apr 5, 2024

View reviewed changes

datashader/core.py Outdated Show resolved Hide resolved

hoxbro assigned hoxbro and philippjfr and unassigned hoxbro Apr 5, 2024

hoxbro added 2 commits April 5, 2024 17:28

Try adding df.optimize()

baa9e78

Add dask-expr to setup.py

dc403ae

Use legacy dask.dataframe for now

8b5d3d5

philippjfr and others added 3 commits April 8, 2024 12:37

Add fix for dask tokenize tuple

1223949

Update to work with more tests

7674d45

Fix attribute assignment

67d73f5

philippjfr approved these changes Apr 8, 2024

View reviewed changes

hoxbro merged commit c3cdea1 into main Apr 8, 2024
16 checks passed

hoxbro deleted the py312 branch April 8, 2024 16:33

hoxbro mentioned this pull request May 21, 2024

Improve compatibility with dask-expr #1335

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Python 3.12 support #1317

Add Python 3.12 support #1317

hoxbro commented Feb 8, 2024 •

edited

Loading

codecov bot commented Feb 8, 2024 •

edited

Loading

hoxbro commented Feb 8, 2024 •

edited

Loading

hoxbro commented Apr 4, 2024

phofl commented Apr 5, 2024 •

edited

Loading

mrocklin commented Apr 5, 2024

hoxbro commented Apr 5, 2024

phofl commented Apr 5, 2024 •

edited

Loading

hoxbro commented Apr 5, 2024 •

edited

Loading

hoxbro commented Apr 5, 2024

phofl commented Apr 5, 2024

philippjfr commented Apr 5, 2024

hoxbro commented Apr 8, 2024

phofl commented Apr 8, 2024

philippjfr left a comment

jbednar commented Apr 9, 2024

Add Python 3.12 support #1317

Add Python 3.12 support #1317

Conversation

hoxbro commented Feb 8, 2024 • edited Loading

codecov bot commented Feb 8, 2024 • edited Loading

Codecov Report

hoxbro commented Feb 8, 2024 • edited Loading

hoxbro commented Apr 4, 2024

phofl commented Apr 5, 2024 • edited Loading

mrocklin commented Apr 5, 2024

hoxbro commented Apr 5, 2024

phofl commented Apr 5, 2024 • edited Loading

hoxbro commented Apr 5, 2024 • edited Loading

hoxbro commented Apr 5, 2024

phofl commented Apr 5, 2024

philippjfr commented Apr 5, 2024

hoxbro commented Apr 8, 2024

phofl commented Apr 8, 2024

philippjfr left a comment

Choose a reason for hiding this comment

jbednar commented Apr 9, 2024

hoxbro commented Feb 8, 2024 •

edited

Loading

codecov bot commented Feb 8, 2024 •

edited

Loading

hoxbro commented Feb 8, 2024 •

edited

Loading

phofl commented Apr 5, 2024 •

edited

Loading

phofl commented Apr 5, 2024 •

edited

Loading

hoxbro commented Apr 5, 2024 •

edited

Loading