Recursive tokenization #3515

crusaderky · 2019-11-12T22:35:13Z

After misreading the dask documentation https://docs.dask.org/en/latest/custom-collections.html#deterministic-hashing, I was under the impression that the output of __dask_tokenize__ would be recursively parsed, like it happens for __getstate__ or __reduce__. That's not the case - the output of __dask_tokenize__ is just fed into a str() function so it has to be made explicitly recursive!

crusaderky · 2019-11-12T23:23:50Z

xref pydata/sparse#300

dcherian · 2019-11-12T23:31:43Z

xarray/tests/test_dask.py

+    # Test DataArray and Variable
+    da_a = DataArray(a)
+    da_b = DataArray(b)
+    assert dask.base.tokenize(da_a) != dask.base.tokenize(da_b)


Just so I'm clear, these are two different (but equal) xarray objects, so we don't want dask to think that they are the same?

They aren't equal. da_b[5000] == 2, while da_a[5000] == 1.

dcherian · 2019-11-12T23:32:12Z

xarray/tests/test_sparse.py

@@ -856,6 +856,10 @@ def test_dask_token():
    import dask

    s = sparse.COO.from_numpy(np.array([0, 0, 1, 2]))
+


xfail this instead?

Yeah that was an option. But it would have completely disabled the test, and the test is really about not accidentally invoking self.values (which in turn invokes self._variable._data.__array__()) on NEP18-compatible backends. It is not really about testing sparse specifically; it just conveniently relies on the fact that COO.__array__ raises an exception. A more correct, but also more laborious, test could have created a NEP18-compatible dummy class on the fly.

* upstream/master: Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515) format indexing.rst code with black (pydata#3511)

* upstream/master: Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515)

* upstream/master: (22 commits) Added fill_value for unstack (pydata#3541) Add DatasetGroupBy.quantile (pydata#3527) ensure rename does not change index type (pydata#3532) Leave empty slot when not using accessors interpolate_na: Add max_gap support. (pydata#3302) units & deprecation merge (pydata#3530) Fix set_index when an existing dimension becomes a level (pydata#3520) add Variable._replace (pydata#3528) Tests for module-level functions with units (pydata#3493) Harmonize `FillValue` and `missing_value` during encoding and decoding steps (pydata#3502) FUNDING.yml (pydata#3523) Allow appending datetime & boolean variables to zarr stores (pydata#3504) warn if dim is passed to rolling operations. (pydata#3513) Deprecate allow_lazy (pydata#3435) Recursive tokenization (pydata#3515) format indexing.rst code with black (pydata#3511) add missing pint integration tests (pydata#3508) DOC: update bottleneck repo url (pydata#3507) add drop_sel, drop_vars, map to api.rst (pydata#3506) remove syntax warning (pydata#3505) ...

crusaderky added 2 commits November 12, 2019 22:20

recursive tokenize

c1b8d4a

black

25b655d

crusaderky self-assigned this Nov 12, 2019

crusaderky requested a review from dcherian November 12, 2019 22:35

crusaderky added 5 commits November 12, 2019 22:35

What's New

4f395b6

Also test Dataset

d2f8b5d

Also test IndexVariable

7f47b45

Cleanup

c311725

tokenize sparse objects

36ad4f7

dcherian approved these changes Nov 12, 2019

View reviewed changes

crusaderky merged commit e70138b into pydata:master Nov 13, 2019

crusaderky deleted the recursive_tokenize branch November 13, 2019 00:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recursive tokenization #3515

Recursive tokenization #3515

crusaderky commented Nov 12, 2019

crusaderky commented Nov 12, 2019

dcherian Nov 12, 2019

crusaderky Nov 13, 2019

dcherian Nov 12, 2019

crusaderky Nov 13, 2019 •

edited

Loading

		@@ -856,6 +856,10 @@ def test_dask_token():
		import dask

		s = sparse.COO.from_numpy(np.array([0, 0, 1, 2]))

Recursive tokenization #3515

Recursive tokenization #3515

Conversation

crusaderky commented Nov 12, 2019

crusaderky commented Nov 12, 2019

dcherian Nov 12, 2019

Choose a reason for hiding this comment

crusaderky Nov 13, 2019

Choose a reason for hiding this comment

dcherian Nov 12, 2019

Choose a reason for hiding this comment

crusaderky Nov 13, 2019 • edited Loading

Choose a reason for hiding this comment

crusaderky Nov 13, 2019 •

edited

Loading