Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add test variants with compressible data #714

Merged
merged 9 commits into from
Mar 22, 2023
Merged

Conversation

crusaderky
Copy link
Contributor

@crusaderky crusaderky commented Mar 13, 2023

Closes #696
Run the following tests twice, once with uncompressible data and another with highly compressible data, to display differences on the network stack in one use case rather than another:

  • test_anom_mean
  • test_climatic_mean (currently skipped)
  • test_vorticity
  • test_double_diff
  • test_dot_product
  • test_map_overlap_sample

This PR increases the overall runtime from 48min to 50min.

I've deliberately not touched test_basic_sum, which is always compressible, and test_rechunk_*, which are always uncompressible, because they have already a fair amount of permutations and I didn't feel that doubling everything (with the additional challenges in readability more than in runtime) would yield a benefit worth it.

@crusaderky crusaderky self-assigned this Mar 13, 2023
@crusaderky crusaderky marked this pull request as ready for review March 13, 2023 15:33
@crusaderky
Copy link
Contributor Author

crusaderky commented Mar 14, 2023

I ran an A/B tests on distributed#7593 and I'm observing a very modest (5%), but consistent speedup in test_filter_then_average. The test uses data that is compressible at 37%. The other tests do not show any kind of change - including those running on data that is compressible at 99%. The reason is that the data compressible at 37% takes 140ms per chunk to compress, whereas identically sized data full of ones takes 14ms per chunk.

I need to scrap the current algorithm and synthetically create something similar to the zarr dataset.

@crusaderky
Copy link
Contributor Author

Holy guacamole 😱

image

@crusaderky
Copy link
Contributor Author

This is ready for review and merge.
Discussion on the findings on dask/distributed#7655

Copy link
Contributor

@milesgranger milesgranger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 👍

@crusaderky crusaderky merged commit 0717e4b into main Mar 22, 2023
@crusaderky crusaderky deleted the guido/compressible branch March 22, 2023 13:33
@crusaderky
Copy link
Contributor Author

ty @milesgranger for review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add test variants with compressible data
2 participants