Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some Zarr-based datatypes #19040

Draft
wants to merge 37 commits into
base: dev
Choose a base branch
from

Conversation

davelopez
Copy link
Contributor

@davelopez davelopez commented Oct 22, 2024

Requires #17614

Includes the following datatypes:

General Zarr datatypes

  • CompressedZarrZipArchive (zarr.zip): represents a Zarr ZipStore. It seems to have some limitations (i.e. doesn't behave exactly like a DirectoryStore) or I couldn't make it work 100% but it might be useful in some cases.
  • ZarrDirectory (zarr): represents a Zarr DirectoryStore. Contains the zarr structure in the extra_files_path of the dataset. I wonder if there is a way to make the "main" dataset to really point to the actual root folder in extra_files like a symlink or something like this, but I don't know if this really makes sense.
  • ZarrRemoteUri (zarr_uri): represents a remote URI to a Zarr store. I'm not sure this is the best way to handle this, and it feels more like a "wonky workaround". It is just a text file with the URL (without the protocol, otherwise Galaxy will try to download it as a pasted URL) pointing to a remote S3 containing a zarr store (some examples can be obtained here https://www.openmicroscopy.org/2020/11/04/zarr-data.html). Obviously, this might not work if authentication is involved, etc. So there must be a better option to handle those.

OME-Zarr datatypes

  • CompressedOMEZarrZipArchive (ome_zarr.zip): Similar to CompressedZarrZipArchive but expects to find an OME/METADATA.ome.xml file in the store root so it can be easily converted/extracted to an OMEZarr directory.
  • OMEZarr (ome_zarr): Similar to ZarrDirectory but identify this datatype as an OME Zarr image.
  • OMEZarrRemoteUri (ome_zarr_uri): A subclass of ZarrRemoteUri just to be specific about this uri being an OME-Zarr remote uri.

How to test the changes?

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    • For testing the "directory" variant datatype:
    • For testing the "remote URI" variant datatype:
      • Upload a text file or create a new one pasting the following URI as content: uk1s3.embassy.ebi.ac.uk/idr/zarr/v0.4/idr0062A/6001240.zarr. Notice there is no protocol (https://). You can find more examples here.
      • Make sure the dataset with the URI is detected as zarr_uri or ome_zar_uri or set it in the upload.
    • You can use this "simple" tool wrapper to test or make your own:
<tool id="zarr_test" name="Zarr Test" version="2.18.3+galaxy0" profile="23.0">
    <description>test wrapper for zarr format</description>
    <requirements>
        <requirement type="package" version="2.18.3">zarr</requirement>
        <requirement type="package" version="2024.9.0">s3fs</requirement>
    </requirements>
    <command detect_errors="exit_code"><![CDATA[
        python $script
    ]]></command>
    <configfiles>
        <configfile name="script"><![CDATA[
import zarr
import s3fs
import zipfile
from sys import stdout

input_store = None
if '$zarrinput.extension' == 'zarr':
    print('Using local directory store as input')
    # TODO: Is there a way for the dataset file to directly reference the extra_files_path?
    input_store = '$zarrinput.extra_files_path/$zarrinput.metadata.store_root'
elif '$zarrinput.extension' == 'zarr_uri':
    print('Using remote S3 store as input')
    input_store = '$zarrinput.metadata.remote_uri'
elif '$zarrinput.extension' == 'zarr.zip':
    compression = int('$zarrinput.metadata.compression' or 0)
    print(f'Using zipped store as input with compression {compression}')
    input_store = zarr.ZipStore('$zarrinput', mode='r', compression=compression)
else:
    raise ValueError('Unsupported input format')

if input_store is None:
    raise ValueError('Unable to determine input store. Your input dataset may be an unsupported format')

input_zarr = zarr.open(input_store, mode='r')

# Create the output store where the new zarr will be written
output_zarr = zarr.open('$zarroutput.extra_files_path', mode='w')

# Do some processing here for testing
zarr.copy_store(input_zarr, output_zarr, log=stdout)

foo = output_zarr.create_group('foo')
bar = foo.create_group('bar')
baz = bar.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
        ]]></configfile>
    </configfiles>
    <inputs>
        <param name="zarrinput" type="data" format="zarr,zarr_uri,zarr.zip" label="Zarr input"/>
    </inputs>
    <outputs>
        <data name="zarroutput" format="zarr" label="${tool.name} on ${on_string}: New ZARR"/>
    </outputs>
    <help><![CDATA[

simple wrapper to test zarr datatype

    ]]></help>
</tool>

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

wm75 and others added 28 commits October 14, 2024 09:34
This will go into a sub-datatype when needed.
This should probably go in sub-classes that expect specific directory structures.
Compressed (Upload) -> Directory (Unpack) -> Compressed (Download)
- Rename generic to ZarrDirectory
- Detect Zarr version in metadata
- Add zarr.zip datatype
…file

Instead of the default behavior of downloading an empty file.
Should help when opening the Zarr ZipStore if there is compression involved
@davelopez davelopez mentioned this pull request Oct 25, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants