Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run on zarr-python v3 with zarr-format v2 #288

Merged
merged 5 commits into from
Jan 14, 2025

Conversation

tomwhite
Copy link
Contributor

This is like #284, but the Zarr format is always v2, even when zarr-python v3 is being used. This is the approach I took for the Zarr v3 support in sgkit. This limits the scope of the changes, however there are still test failures under both zarr-python v2 and v3.

This still needs quite a lot of work, so I've opened it as a draft.

@jeromekelleher
Copy link
Contributor

Nice 👍

@coveralls
Copy link
Collaborator

coveralls commented Dec 3, 2024

Coverage Status

coverage: 98.795% (-0.08%) from 98.87%
when pulling b82fdc4 on tomwhite:zarr-python-v3-with-v2-format
into 883a37e on sgkit-dev:main.

@tomwhite
Copy link
Contributor Author

tomwhite commented Dec 3, 2024

I rebased this branch on main after #292 went in.

The failures are due to a problem with non-contiguous arrays that is fixed in zarr-developers/zarr-python#2515, but that hasn't been released yet.

The other thing I noticed is that the validation tests are a lot slower running on v3 than v2

With this PR and zarr-python v2:

pytest -vs -k "test_by_validating" --durations=20
18.65s call     tests/test_vcf_examples.py::test_by_validating[field_type_combos.vcf.gz]
2.90s call     tests/test_vcf_examples.py::test_by_validating[chr_m_indels.vcf.gz]
2.33s call     tests/test_vcf_examples.py::test_by_validating[1kg_2020_chrM.vcf.gz]
0.29s call     tests/test_vcf_examples.py::test_by_validating[sample_old_tabix.vcf.gz]
0.29s call     tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]
0.29s call     tests/test_vcf_examples.py::test_by_validating_split[sample.vcf.gz-3.split-files0]
0.29s call     tests/test_vcf_examples.py::test_by_validating_split[sample.vcf.gz-3.split-files1]
0.27s call     tests/test_vcf_examples.py::test_by_validating[issue_251.vcf.gz]
0.13s call     tests/test_vcf_examples.py::test_by_validating[sample_no_genotypes.vcf.gz]
0.12s call     tests/test_plink.py::test_by_validating[0-10-1]
0.12s call     tests/test_vcf_examples.py::test_by_validating[sample_no_genotypes_with_gt_header.vcf.gz]
0.08s call     tests/test_vcf_examples.py::test_by_validating_split[out_of_order_contigs.vcf.gz-2.split-files4]
0.08s call     tests/test_vcf_examples.py::test_by_validating_split[out_of_order_contigs.vcf.gz-2.split-files3]
0.08s call     tests/test_vcf_examples.py::test_by_validating_split[out_of_order_contigs.vcf.gz-2.split-files2]
0.08s call     tests/test_vcf_examples.py::test_by_validating[out_of_order_contigs.vcf.gz]
0.06s call     tests/test_plink.py::test_by_validating[0-3-10]
0.04s call     tests/test_plink.py::test_by_validating[0-33-3]
0.03s call     tests/test_[plink.py](https://plink.py/)::test_by_validating[0-10-10]
0.03s call     tests/test_plink.py::test_by_validating[0-99-10]

With this PR and zarr-python v3:

pytest -vs -k "test_by_validating and not field_type_combos" --durations=20
148.06s call     tests/test_vcf_examples.py::test_by_validating[chr_m_indels.vcf.gz]
22.24s call     tests/test_vcf_examples.py::test_by_validating[1kg_2020_chrM.vcf.gz]
0.61s call     tests/test_vcf_examples.py::test_by_validating[sample.vcf.gz]
0.60s call     tests/test_vcf_examples.py::test_by_validating_split[sample.vcf.gz-3.split-files0]
0.59s call     tests/test_vcf_examples.py::test_by_validating[sample_old_tabix.vcf.gz]
0.59s call     tests/test_vcf_examples.py::test_by_validating_split[sample.vcf.gz-3.split-files1]
0.32s call     tests/test_plink.py::test_by_validating[0-10-1]
0.32s call     tests/test_vcf_examples.py::test_by_validating[issue_251.vcf.gz]
0.24s call     tests/test_vcf_examples.py::test_by_validating[sample_no_genotypes.vcf.gz]
0.17s call     tests/test_vcf_examples.py::test_by_validating[sample_no_genotypes_with_gt_header.vcf.gz]
0.15s call     tests/test_plink.py::test_by_validating[0-3-10]
0.13s call     tests/test_vcf_examples.py::test_by_validating_split[out_of_order_contigs.vcf.gz-2.split-files4]
0.13s call     tests/test_vcf_examples.py::test_by_validating_split[out_of_order_contigs.vcf.gz-2.split-files2]
0.13s call     tests/test_vcf_examples.py::test_by_validating_split[out_of_order_contigs.vcf.gz-2.split-files3]
0.12s call     tests/test_vcf_examples.py::test_by_validating[out_of_order_contigs.vcf.gz]
0.09s call     tests/test_plink.py::test_by_validating[0-33-3]
0.07s call     tests/test_plink.py::test_by_validating[0-10-10]
0.04s call     tests/test_plink.py::test_by_validating[0-99-10]

I'll see if I can find out what is causing this slowdown.

@tomwhite
Copy link
Contributor Author

tomwhite commented Dec 3, 2024

I'll see if I can find out what is causing this slowdown.

It's because in v2 zarr.Array implements __iter__, which caches chunks, whereas v3 does not implement __iter__ so the chunk is retrieved for each element being iterated over.

Here's one of the places where this is happening:

format_fields[vcf_name] = vcf_type, vcf_number, iter(root[colname])

@jeromekelleher
Copy link
Contributor

Thanks for tracking that one down @tomwhite - I added the original __iter__ functionality to Zarr FWIW

@tomwhite tomwhite force-pushed the zarr-python-v3-with-v2-format branch from e38859f to b866285 Compare January 7, 2025 11:18
@tomwhite
Copy link
Contributor Author

tomwhite commented Jan 7, 2025

I updated this to reflect all the changes upstream in Zarr (there have been a lot in the last month!). The main changes are:

There is one failing test test_double_encode_partition, where the serialized size of Zarr chunks changes when written for a second time. It doesn't seem to be caused by non-deterministic compression (zarr-developers/zarr-python#1519), since I still see it when disabling compression, so this will need further investigation.

@tomwhite
Copy link
Contributor Author

tomwhite commented Jan 7, 2025

  • Replace group.array() with group.create_dataset(), since the former no longer seems to work in v3.

Opened zarr-developers/zarr-python#2667

@tomwhite
Copy link
Contributor Author

There is one failing test test_double_encode_partition, where the serialized size of Zarr chunks changes when written for a second time. It doesn't seem to be caused by non-deterministic compression (zarr-developers/zarr-python#1519), since I still see it when disabling compression, so this will need further investigation.

After some hunting I tracked this down, see zarr-developers/zarr-python#2696.

In the meantime, I have excluded test_double_encode_partition from the Zarr testing, so we can go ahead and merge this.

@tomwhite tomwhite marked this pull request as ready for review January 13, 2025 16:55
Copy link
Contributor

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@jeromekelleher jeromekelleher merged commit bb380eb into sgkit-dev:main Jan 14, 2025
16 checks passed
@tomwhite tomwhite mentioned this pull request Feb 4, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants