open_virtual_dataset fails to open tiffs #291

elyall · 2024-11-08T22:32:06Z

open_virtual_dataset says it supports TIFF files however it currently fails and is lacking test coverage. The issue starts here where the kerchunk reference to a zarr array is treated as an HDF reference that has one or more groups.

Code to reproduce

from pathlib import Path

import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType


def write_random_tiff(file_path):
    array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
    img = Image.fromarray(array)
    img.save(file_path)


file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)

open_virtual_dataset(str(file_path), filetype=FileType.tiff)

Throws:

File virtualizarr/translators/kerchunk.py:46, in extract_group(vds_refs, group)

ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

virtualizarr = 1.1.0

The text was updated successfully, but these errors were encountered:

TomNicholas · 2024-11-08T23:36:13Z

Ah sorry about this @elyall !

We do have some rudimentary tests for tiff but as the tiff reader is just a thin wrapper around the kerchunk tiff reader I (naively) expected this to "just work". I agree that we should remedy that with better test coverage.

Thanks for the reproducible example, I'll have a look at what might be going on now.

TomNicholas · 2024-11-09T00:09:07Z

So I've looked into this, and I'm a bit confused by what kerchunk returns, and am ignorant of exactly what information is stored in a tiff file. The kerchunk tiff reader basically returns this information about your example tiff file:

{'refs': {'.zattrs': '{"_ARRAY_DIMENSIONS":["Y","X"]}', '.zarray': '{\n "chunks": [\n  128,\n  128\n ],\n "compressor": null,\n "dtype": "|u1",\n "fill_value": 0,\n "filters": null,\n "order": "C",\n "shape": [\n  128,\n  128\n ],\n "zarr_format": 2\n}', '0.0': ['/private/var/folders/zt/b84f41811p5_vygj1gjp6qx40000gn/T/pytest-of-tom/pytest-38/test_random_tiff0/rand.tiff', 122, 16384]}}

VirtualiZarr's tiff reader then attempts to parse this and wrap it up into a nicer xarray.Dataset containing virtualizarr.ManifestArray objects. That's the same way that several of VirtualiZarr's other readers work (and are much better tested).

What confuses me here is how am I supposed to know the name of the array you created? Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention? Examples I find online of using rioxarray and rasterio are for opening GeoTIFF files - do they have more information in them about variable names? This is related to the failure you're seeing, because the virtualizarr parsing logic is expecting kerchunk to return something that has information about group names and array names.

Also is saying that the array dimensions are called "X" and "Y" correct? Or did kerchunk just make that up?

dcherian · 2024-11-09T00:13:30Z

Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention?

They do not. there is just a single array, and sometimes overviews.

TomNicholas · 2024-11-09T00:17:04Z

I've opened #292 to add your example as a test case and start trying to fix it @elyall.

Do tiff files have named arrays/variables? Is the expected behaviour to default to some naming convention?

They do not. there is just a single array, and sometimes overviews.

Interesting... So if it's just one unnamed array, then "open_dataset" doesn't even make sense - the tiff reader should really only support open_dataarray, where we return a single unnamed xr.DataArray.

EDIT: My suggestion seems consistent with what open_rasterio does, returning a single DataArray - https://knowledge.dea.ga.gov.au/notebooks/How_to_guides/Opening_GeoTIFFs_NetCDFs/#Opening-a-single-GeoTIFF

elyall · 2024-11-09T00:55:02Z

Yes a dataarray is a better fit. I just need the ManifestArray object.

TomNicholas · 2024-11-09T01:01:56Z

Cool. I'll be able to continue this next week, but you're welcome to work on it instead of you're interested.

mdsumner · 2024-11-09T01:36:23Z

There can be multiple arrays, including overviews. In fact before GeoTIFF had explicit overviews the multiple arrays (subdatasets in GDAL parlance) were used for overviews in some software (maybe only one tho ...).

I'm interested to explore this so hoping I can help a little 🙏

elyall · 2024-11-09T06:59:57Z

@mdsumner I'm not familiar with GeoTIFF but I'm guessing it's similar to other TIFF specs I'm aware of: a multi-page TIFF with specific metadata.

Readers like tifffile can read individual pages of multi-page TIFFs so each page probably should be treated as individual chunks wrapped in their own ManifestArray. And the default keys could be the page indices. There are lots of TIFF metadata specs out there and a subset probably have something that translates well into keys, but handling all of those cases would be difficult. It would be easiest to just allow the user to inject keys as an argument during creation.

mdsumner · 2024-11-09T11:44:58Z

Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF?

I'm not familiar with the tiff spec, mostly only with GeoTIFF usage details and the abstractions

"Pages" could be like bands, or layers, like actual pages in a pdf where the shape doesn't change, nor the pixel type. Sadly format idioms leak into these abstractions, so please indulge my questions I expect you'll have the same needs for clarification 🙏

TomNicholas · 2024-11-09T15:05:56Z

Thanks both of you.

Readers like tifffile can read individual pages of multi-page TIFFs so each page probably should be treated as individual chunks wrapped in their own ManifestArray.

Does multi-page mean (multi)subdataset analogously to multi-variable in NetCDF?

This was going to be my question too. In Xarray terminology what we need to know is whether each page of a multi-page TIFF maps to

Different slices (potentially chunked) of a single Variable/DataArray (that's what your example implies @elyall - maybe this is all we need),
Different (named) variables in the same Dataset (i.e. different arrays that can have different shapes but any common dimensions have consistent shape),
Something even more messy, which might only be representable as groups of a xarray.DataTree, or a list of unnamed DataArrays.

I'm keen to try and get this working, but it would be much appreciated if someone who actually works with (Geo)TIFF files regularly was interesting in helping maintain VirtualiZarr's TIFF reader. The rest of the package is designed to be as modular as possible to minimize the effort required, and we already have some developers who have helpfully contributed and taken ownership over other specific readers (e.g. DMR++). cc'ing @scottyhq and @sharkinsspatial as people who may be interested, but any lurkers feel free to pipe up 🪈

elyall · 2024-11-09T15:43:26Z

I've only used option 1 where all pages are slices in one array and therefore have matching dimensions, but that is not always the case.

Here's a great overview of internal TIFF structure. The Pyramid TIFF is an example of arrays of different sizes (though same dimensions) that would require separate keys (in this case a DataTree would work, example here).

As we can see it's complicated since from the underlying structure there are a lot of different TIFF specs that require their own data loaders. All IFD and SubIFD ImageBytes should be individual ManifestArrays. The one consistent thing is they are sequential so as a start the keys/schema could be input as an argument. From there individual use cases could be handled as desired.

I'm happy to help and potentially maintain a limited scope reader (TIFF spec itself shouldn't change much) but not all of the many edge cases.

mdsumner · 2024-11-10T03:50:18Z

I'm getting the same error still as the OP ??

import virtualizarr
virtualizarr.__version__
'1.1.1.dev12+g9e7d430'

from pathlib import Path

import numpy as np
from PIL import Image
from virtualizarr import open_virtual_dataset
from virtualizarr.backend import FileType


def write_random_tiff(file_path):
    array = np.random.randint(0, 255, (128, 128), dtype=np.uint8)
    img = Image.fromarray(array)
    img.save(file_path)


file_path = Path.cwd() / "test.tif"
write_random_tiff(file_path)

open_virtual_dataset(str(file_path), filetype=FileType.tiff)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/backend.py", line 190, in open_virtual_dataset
    vds = backend_cls.open_virtual_dataset(
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/readers/tiff.py", line 48, in open_virtual_dataset
    refs = extract_group(refs, group)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/scratch/pawsey0973/mdsumner/.local/lib/python3.12/site-packages/virtualizarr/translators/kerchunk.py", line 46, in extract_group
    raise ValueError(
ValueError: Multiple HDF Groups found. Must specify group= keyword to select one of []

I'm really keen to explore this, but having issues getting a working environment. will try again later :)

elyall · 2024-11-10T19:47:19Z

I got a little ahead of myself as kerchunk's tiff_to_zarr should already solve most of this problem. The heavy lifting is actually occurring in tifffile, specifically in tifffile.ZarrTiffStore. This if statement determines whether the output is a single array (DataArray) or multiple arrays (Dataset/DataTree). If multiple arrays they are sequentially indexed as you can see here. However tiff_to_zarr then plays with some metadata and potentially changes the keys (I'll need to dig into it more).

It would be great if we had an example TIFF file that has multiple arrays that someone understands well to cover the Dataset route. I found this sample GeoTIFF but it appears to be a single array. I could generate a multiscale OME-TIFF file but the keys are sequential and the data fits better in a DataTree than a Dataset.

mdsumner · 2024-11-11T03:44:17Z

there's a multi tiff here:

https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif

gdalinfo /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Driver: GTiff/GeoTIFF
Files: /vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
Size is 20, 20
Image Structure Metadata:
  INTERLEAVE=BAND
Subdatasets:
  SUBDATASET_1_NAME=GTIFF_DIR:1:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
  SUBDATASET_1_DESC=Page 1 (20P x 20L x 1B)
  SUBDATASET_2_NAME=GTIFF_DIR:2:/vsicurl/https://github.com/OSGeo/gdal/raw/refs/heads/master/autotest/gcore/data/twoimages.tif
  SUBDATASET_2_DESC=Page 2 (20P x 20L x 1B)
Corner Coordinates:
Upper Left  (    0.0,    0.0)
Lower Left  (    0.0,   20.0)
Upper Right (   20.0,    0.0)
Lower Right (   20.0,   20.0)
Center      (   10.0,   10.0)
Band 1 Block=20x20 Type=Byte, ColorInterp=Gray

TomNicholas · 2024-11-11T20:59:39Z

Oh tifffile allows you to open tiffs as Zarr arrays?! Let's just use that! Then we can skip the kerchunk step entirely. Our tiff reader could re-use code from the (WIP) Zarr reader in #271.

TomNicholas · 2024-11-12T17:13:25Z

Okay I tried an using an approach like

from tifffile import imread

store = imread(filepath, aszarr=True)

# TODO exception handling for TIFF files with multiple arrays
za = zarr.open_array(store=store, mode="r")

but ended up down a rabbit hole because a) tifffile doesn't appear to support zarr-python v3 yet and b) determining the actual byte ranges is still tricky - in kerchunk.tiff.tiff_to_zarr I think it happens in the store.write_fsspec step.

EDIT: I posted where I got to in #295 in case it's ever of any use.

I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.

elyall · 2024-11-13T23:38:02Z

a) tifffile doesn't appear to support zarr-python v3 yet

It does not at the moment.

and b) determining the actual byte ranges is still tricky - in kerchunk.tiff.tiff_to_zarr I think it happens in the store.write_fsspec step.

That's correct, and write_fsspec is a method of tifffile.ZarrTiffStore. It's designed to write to a file, but kerchunk.tiff.tiff_to_zarr writes it to an in-memory text buffer and then converts it back to json.

there's a multi tiff here:

@mdsumner That file has two arrays with the same shape and dimensions. tifffile.ZarrTiffStore.write_fsspec for better or worse treats them as a single array with two chunks, one for each image, which fits better in a xarray.DataArray. The example we need is one where a TIFF has multiple arrays of different shapes, dimensions, or coords (and probably keys for each array that hopefully tifffile can parse), which would need to be wrapped in a xarray.Dataset or xarray.DataTree. OR we need an example where tifffile.ZarrTiffStore.write_fsspec understands a TIFF spec and places references to matching arrays (same shape and dimensions) under different zarr groups.

I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.

Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce.

mdsumner · 2024-11-14T00:34:06Z

I think that's fine, multiple chunks, I think that's how they're intended to be interpreted, not as different arrays.(It's very old style)

TomNicholas · 2024-11-14T22:22:46Z

Thanks both. I put my attempt to bypass kerchunk up in #295. I've also submitted #296 to change the documentation to no longer falsely advertise that we can read tiff files when we cannot yet.

I think we should try to get the tiff reader working using an implementation that relies on the kerchunk reader first, then as a follow-up think about if we can eliminate kerchunk from that process.

Sounds good. Right now I can help most with developing tests but I'll take a look at some of the other VirtualiZarr readers and try to get a grasp of what they ultimately produce.

It really shouldn't be particularly difficult to finish #292 - it mostly requires rejigging the virtualizarr.translators.kerchunk module to handle parsing the dict that kerchunk.tiff_to_zarr returns.

elyall · 2024-11-14T23:03:22Z

it mostly requires rejigging the virtualizarr.translators.kerchunk module to handle parsing the dict that kerchunk.tiff_to_zarr returns

I've done that here. The only real change was to have virtual_vars_from_kerchunk_refs accept arr_refs as input instead of refs. Now a single array as is produced from kerchunk.tiff_to_zarr can be passed in.

The question is, do you want to create a dataarray_from_kerchunk_refs equivalent to dataset_from_kerchunk_refs or do you want to add an optional key argument to dataset_from_kerchunk_refs that assigns the array to a key and then wraps it in a Dataset? Or another alternative I haven't thought of.

TomNicholas · 2024-11-15T22:12:20Z

#297 looks good, thank you @elyall !

The question is, do you want to create a dataarray_from_kerchunk_refs equivalent to dataset_from_kerchunk_refs or do you want to add an optional key argument to dataset_from_kerchunk_refs that assigns the array to a key and then wraps it in a Dataset?

Yeah good question. The main advantage of coercing a single-image TIFF into a Dataset would be so that the user can always know that open_virtual_dataset would succeed on a .tiff file first time. Having an optional key argument wouldn't help with that, because then you have to call open_virtual_dataset('image.tiff'), get an error, then call open_virtual_dataset('image.tiff', key='dummy_var_name'), when at that point you may as well just do open_virtual_dataarray('image.tiff').

Or another alternative I haven't thought of.

We could image providing a new overloaded opening function like

def open_tiff(filepath: str) -> Dataset | DataArray | DataTree:
    ...

but I don't really like that idea either, because it's likely the next line of code just breaks depending on which type was returned, and it breaks the correspondence with xarray.open_dataset/open_datatree etc.

(Note that we have basically the same problem with not knowing if GRIB/netCDF files map to Dataset or DataTree without inspecting them once first to find out if they contain sub-groups.)

So I think I prefer that we add open_virtual_dataarray (like I started doing in #295), and depending on the type of TIFF file one of open_virtual_dataarray/dataset/datatree should succeed.

elyall mentioned this issue Nov 8, 2024

Manifest storage transformer zarr-developers/zarr-specs#287

Open

TomNicholas added bug Something isn't working Kerchunk Relating to the kerchunk library / specification itself readers labels Nov 8, 2024

TomNicholas mentioned this issue Nov 9, 2024

Fix tiff reader #292

Draft

5 tasks

This was referenced Nov 14, 2024

Tiff zarr reader #295

Draft

Clarify which features are currently available in FAQ #296

Merged

TomNicholas mentioned this issue Nov 18, 2024

Support HDF4? #216

Open

TomNicholas mentioned this issue Dec 20, 2024

Align accepted filetypes with docstring description #361

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

open_virtual_dataset fails to open tiffs #291

open_virtual_dataset fails to open tiffs #291

elyall commented Nov 8, 2024 •

edited

Loading

TomNicholas commented Nov 8, 2024 •

edited

Loading

TomNicholas commented Nov 9, 2024

dcherian commented Nov 9, 2024

TomNicholas commented Nov 9, 2024 •

edited

Loading

elyall commented Nov 9, 2024 •

edited

Loading

TomNicholas commented Nov 9, 2024

mdsumner commented Nov 9, 2024 •

edited

Loading

elyall commented Nov 9, 2024

mdsumner commented Nov 9, 2024 •

edited

Loading

TomNicholas commented Nov 9, 2024 •

edited

Loading

elyall commented Nov 9, 2024 •

edited

Loading

mdsumner commented Nov 10, 2024

elyall commented Nov 10, 2024 •

edited

Loading

mdsumner commented Nov 11, 2024

TomNicholas commented Nov 11, 2024

TomNicholas commented Nov 12, 2024 •

edited

Loading

elyall commented Nov 13, 2024 •

edited

Loading

mdsumner commented Nov 14, 2024

TomNicholas commented Nov 14, 2024

elyall commented Nov 14, 2024 •

edited

Loading

TomNicholas commented Nov 15, 2024

open_virtual_dataset fails to open tiffs #291

open_virtual_dataset fails to open tiffs #291

Comments

elyall commented Nov 8, 2024 • edited Loading

TomNicholas commented Nov 8, 2024 • edited Loading

TomNicholas commented Nov 9, 2024

dcherian commented Nov 9, 2024

TomNicholas commented Nov 9, 2024 • edited Loading

elyall commented Nov 9, 2024 • edited Loading

TomNicholas commented Nov 9, 2024

mdsumner commented Nov 9, 2024 • edited Loading

elyall commented Nov 9, 2024

mdsumner commented Nov 9, 2024 • edited Loading

TomNicholas commented Nov 9, 2024 • edited Loading

elyall commented Nov 9, 2024 • edited Loading

mdsumner commented Nov 10, 2024

elyall commented Nov 10, 2024 • edited Loading

mdsumner commented Nov 11, 2024

TomNicholas commented Nov 11, 2024

TomNicholas commented Nov 12, 2024 • edited Loading

elyall commented Nov 13, 2024 • edited Loading

mdsumner commented Nov 14, 2024

TomNicholas commented Nov 14, 2024

elyall commented Nov 14, 2024 • edited Loading

TomNicholas commented Nov 15, 2024

elyall commented Nov 8, 2024 •

edited

Loading

TomNicholas commented Nov 8, 2024 •

edited

Loading

TomNicholas commented Nov 9, 2024 •

edited

Loading

elyall commented Nov 9, 2024 •

edited

Loading

mdsumner commented Nov 9, 2024 •

edited

Loading

mdsumner commented Nov 9, 2024 •

edited

Loading

TomNicholas commented Nov 9, 2024 •

edited

Loading

elyall commented Nov 9, 2024 •

edited

Loading

elyall commented Nov 10, 2024 •

edited

Loading

TomNicholas commented Nov 12, 2024 •

edited

Loading

elyall commented Nov 13, 2024 •

edited

Loading

elyall commented Nov 14, 2024 •

edited

Loading