Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: NWB Zarr to HDMF export fails #211

Open
3 tasks done
rcpeene opened this issue Aug 8, 2024 · 8 comments · May be fixed by hdmf-dev/hdmf#1171
Open
3 tasks done

[Bug]: NWB Zarr to HDMF export fails #211

rcpeene opened this issue Aug 8, 2024 · 8 comments · May be fixed by hdmf-dev/hdmf#1171
Assignees
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users

Comments

@rcpeene
Copy link

rcpeene commented Aug 8, 2024

What happened?

Trying to export a zarr nwb as hdmf, but it yields an error

Steps to Reproduce

Running the following snippet on an nwb file

    with NWBZarrIO(str(zarr_filename), mode='r') as read_io:  # Create Zarr IO object for read
        with NWBHDF5IO(hdmf_filename, 'w') as export_io:  # Create HDF5 IO object for write
            export_io.export(src_io=read_io, write_args=dict(link_data=False))  # Export from Zarr to HDF5

I can't share the nwb file for licensing reasons



### Traceback

```shell
/opt/conda/lib/python3.9/site-packages/hdmf/common/table.py:489: UserWarning: An attribute 'name' already exists on DynamicTable 'eye_tracking' so this column cannot be accessed as an attribute, e.g., table.name; it can only be accessed using other methods, e.g., table['name'].
  self.__set_table_attr(col)
Traceback (most recent call last):
  File "/root/capsule/./code/run_capsule.py", line 57, in <module>
    if __name__ == "__main__": run()
  File "/root/capsule/./code/run_capsule.py", line 47, in run
    export_io.export(src_io=read_io, write_args=dict(link_data=False))  # Export from Zarr to HDF5
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/pynwb/__init__.py", line 399, in export
    super().export(**kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 458, in export
    super().export(**ckwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/io.py", line 166, in export
    self.write_builder(builder=bldr, **write_args)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 836, in write_builder
    self.write_group(self.__file, gbldr, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1018, in write_group
    self.write_group(group, sub_builder, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1023, in write_group
    self.write_dataset(group, sub_builder, **kwargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/utils.py", line 668, in func_call
    return func(args[0], **pargs)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1326, in write_dataset
    dset = self.__list_fill__(parent, name, data, options)
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1492, in __list_fill__
    raise e
  File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1490, in __list_fill__
    dset[:] = data
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/opt/conda/lib/python3.9/site-packages/h5py/_hl/dataset.py", line 898, in __setitem__
    val = numpy.asarray(val, dtype=dtype.base, order='C')
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 579, in __array__
    a = self[...]
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 800, in __getitem__
    result = self.get_basic_selection(pure_selection, fields=fields)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 926, in get_basic_selection
    return self._get_basic_selection_nd(selection=selection, out=out, fields=fields)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 968, in _get_basic_selection_nd
    return self._get_selection(indexer=indexer, out=out, fields=fields)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 1343, in _get_selection
    self._chunk_getitems(
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 2183, in _chunk_getitems
    self._process_chunk(
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 2096, in _process_chunk
    chunk = self._decode_chunk(cdata)
  File "/opt/conda/lib/python3.9/site-packages/zarr/core.py", line 2366, in _decode_chunk
    chunk = chunk.view(self._dtype)
ValueError: When changing to a smaller dtype, its size must be a divisor of the size of original dtype

Operating System

Linux

Python Executable

Python

Python Version

3.9

Package Versions

No response

Code of Conduct

@oruebel
Copy link
Contributor

oruebel commented Aug 9, 2024

Thanks for including the code and traceback. The issue appears to be due to some conversion between data types when exporting from Zarr to HDF5:

ValueError: When changing to a smaller dtype, its size must be a divisor of the size of original dtype

This error originates from here in the HDMF library when writing to disk:

File "/opt/conda/lib/python3.9/site-packages/hdmf/backends/hdf5/h5tools.py", line 1490, in list_fill
dset[:] = data

I can't share the nwb file for licensing reasons

Since you can't share the original data file, we'll probably need your help to get to the root of this.

Option 1 would be, if you could share a "dummy" file that has the same issue, then we could investigate, i.e., we don't really need the real data to debug, but some some file that looks similar and raises the error should be fine.

Option 2 is to do a bit more retracing of steps on your end so we can at least figure out what case causes it so that we can reproduce the issue on our end. A first step here would be to output all the properties of the dataset and data when the exception occurs, e.g., by adding a print statement here before the exception is being raised in line 1492, something along the lines of print("parent", parent, "\n", "name", name, "\n", "dset", dset, "\n", "dset.dtype", dset.dtype, "\n" , "data.dtype", data.dtype, "\n" "data", data). So that we can see what data types are being converted.

@rcpeene
Copy link
Author

rcpeene commented Aug 12, 2024

@oruebel I've received permission to share the file directly with you for examination as long as it isn't distributed. Would a onedrive link work?

@oruebel
Copy link
Contributor

oruebel commented Aug 12, 2024

I've received permission to share the file directly with you for examination as long as it isn't distributed. Would a onedrive link work?

Sure, a onedrive link should be fine. Feel free to send via Slack or email oruebel@lbl.gov so we can take a look. We'll treat the data confidentially and not share with others.

@rcpeene
Copy link
Author

rcpeene commented Aug 13, 2024

invite email sent

@rcpeene
Copy link
Author

rcpeene commented Aug 14, 2024

Any updates here? It's one of the last things holding up our data pipeline.

@oruebel
Copy link
Contributor

oruebel commented Aug 15, 2024

As far as I can tell, the issue seems to occur when copying /intervals/flash_block_presentations/tags. My guess is that this is that this is likely due the following:

  • Zarr does not natively support variable length strings, but strings are stored as object dtype with an encoding
  • My guess is that HDF5IO in turn gets confused and chooses the wrong dtype

I'll need to do a bit more digging to confirm. My guess is that the fix will likely need to be in HDMF. A possible workaround may be to wrap /intervals/flash_block_presentations/tags with H5DataIO before calling export to explicitly set the dtype, but I have not tested this yet.

@oruebel
Copy link
Contributor

oruebel commented Aug 15, 2024

What is confusing to me is that when printing from HDF5IO it shows <zarr.core.Array '/intervals/flash_block_presentations/tags' (1011,) <U0 read-only> but when opening the file with Zarr manually it shows <zarr.core.Array '/intervals/flash_block_presentations/tags' (1011,) object read-only>but I'm not sure why the dtype would be <U0 instead of object. It looks that because of this, it is actually reading the data from Zarr itself that is failing.

@oruebel oruebel linked a pull request Aug 15, 2024 that will close this issue
4 tasks
@oruebel
Copy link
Contributor

oruebel commented Aug 15, 2024

It appears the issue is that ObjectMapper in HDMF uses .astype('U') to enforce that the dtype of the dataset is unicode as specified in the schema. For Zarr datasets this fails because Zarr does not support 'U' as a dtype for variable length string.

I submitted a PR on HDMF here hdmf-dev/hdmf#1171 for this. With this change I was able to convert the file to HDF5.

@rly rly added category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users labels Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: bug errors in the code or code behavior priority: medium non-critical problem and/or affecting only a small set of users
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants