Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve html representation of datasets #1100

Open
wants to merge 32 commits into
base: dev
Choose a base branch
from

Conversation

h-mayorquin
Copy link

@h-mayorquin h-mayorquin commented Apr 19, 2024

Motivation

Improve the display of the data in the html representation of containers. Note that this PR is focused on datasets that were already written. For in memory representation the issue on what to do with things that are wrapped in an iterator or an DataIO subtype can be addressed in another PR I think.

How to test the behavior?

HDF5

I have been using this script

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
from pynwb.testing.mock.file import mock_NWBFile
from hdmf.backends.hdf5.h5_utils import H5DataIO
from pynwb.testing.mock.ophys import mock_ImagingPlane, mock_TwoPhotonSeries

import numpy as np

data=np.random.rand(500_000, 384)
timestamps = np.arange(500_000)
data = data=H5DataIO(data=data, compression=True, chunks=True)

nwbfile = mock_NWBFile()
electrical_series = mock_ElectricalSeries(data=data, nwbfile=nwbfile, rate=None, timestamps=timestamps)

imaging_plane = mock_ImagingPlane(grid_spacing=[1.0, 1.0], nwbfile=nwbfile)


data = H5DataIO(data=np.random.rand(2, 2, 2), compression=True, chunks=True)
two_photon_series = mock_TwoPhotonSeries(name="TwoPhotonSeries", imaging_plane=imaging_plane, data=data, nwbfile=nwbfile)


# Write to file
from pynwb import NWBHDF5IO
with NWBHDF5IO('ecephys_tutorial.nwb', 'w') as io:
    io.write(nwbfile)



from pynwb import NWBHDF5IO

io = NWBHDF5IO('ecephys_tutorial.nwb', 'r')
nwbfile = io.read()
nwbfile

image

Zarr

from numcodecs import Blosc
from hdmf_zarr import ZarrDataIO
import numpy as np
from pynwb.testing.mock.file import mock_NWBFile
from hdmf_zarr.nwb import NWBZarrIO
import os
import zarr
from numcodecs import Blosc, Delta

from pynwb.testing.mock.ecephys import mock_ElectricalSeries
filters = [Delta(dtype="i4")]

data_with_zarr_data_io = ZarrDataIO(
    data=np.arange(100000000, dtype='i4').reshape(10000, 10000),
    chunks=(1000, 1000),
    compressor=Blosc(cname='zstd', clevel=3, shuffle=Blosc.SHUFFLE),
    # filters=filters,
)

timestamps = np.arange(10000)

data = data_with_zarr_data_io

nwbfile = mock_NWBFile()
electrical_series_name = "ElectricalSeries"
rate = None
electrical_series = mock_ElectricalSeries(name=electrical_series_name, data=data, nwbfile=nwbfile, timestamps=timestamps, rate=None)


path = "zarr_test.nwb.zarr"
absolute_path = os.path.abspath(path)
with NWBZarrIO(path=path, mode="w") as io:
    io.write(nwbfile)
    
from hdmf_zarr.nwb import NWBZarrIO

path = "zarr_test.nwb.zarr"

io = NWBZarrIO(path=path, mode="r")
nwbfile = io.read()
nwbfile

image

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

src/hdmf/container.py Outdated Show resolved Hide resolved
@h-mayorquin h-mayorquin marked this pull request as ready for review April 23, 2024 15:14
Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Attention: Patch coverage is 82.81250% with 11 lines in your changes missing coverage. Please review.

Project coverage is 89.09%. Comparing base (06a62b9) to head (03c9f8f).

Files with missing lines Patch % Lines
src/hdmf/utils.py 87.50% 2 Missing and 2 partials ⚠️
src/hdmf/backends/io.py 50.00% 3 Missing ⚠️
src/hdmf/backends/hdf5/h5tools.py 84.61% 1 Missing and 1 partial ⚠️
src/hdmf/container.py 84.61% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##              dev    #1100   +/-   ##
=======================================
  Coverage   89.08%   89.09%           
=======================================
  Files          45       45           
  Lines        9890     9944   +54     
  Branches     2816     2825    +9     
=======================================
+ Hits         8811     8860   +49     
- Misses        763      765    +2     
- Partials      316      319    +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@rly rly requested a review from stephprince April 24, 2024 01:21
@rly rly added the category: enhancement improvements of code or code behavior label Apr 24, 2024
@rly rly added this to the 3.14.0 milestone Apr 24, 2024
Copy link
Contributor

@stephprince stephprince left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for the PR.

Could you add tests for the data html representation with hdf5 and zarr? I think we mainly have string equivalence tests for this kind of thing.

I'm also wondering if it would be nice to have the hdf5 dataset info displayed in a similar table format as the zarr arrays to make it more consistent across backends. I think we should be able to replicate this using the hdf5 dataset info as an input to a method like this: https://github.com/zarr-developers/zarr-python/blob/9d046ea0d2878af7d15b3de3ec3036fe31661340/zarr/util.py#L402

src/hdmf/container.py Outdated Show resolved Hide resolved
src/hdmf/container.py Show resolved Hide resolved
src/hdmf/container.py Outdated Show resolved Hide resolved
@h-mayorquin
Copy link
Author

OK, I added table formating for hdf5:

image

@h-mayorquin
Copy link
Author

h-mayorquin commented Apr 26, 2024

@stephprince
Concerning the test, yes, I can do it, but, can you helmp to create a container that contains array data? I just don't have experienced with the bare bones object. This is my attempt:

from hdmf.container import Container

container = Container(name="Container")
container.__fields__ = {
    "name": "data",
    "description": "test data",
}

test_data = np.array([1, 2, 3, 4, 5])
setattr(container, "data", test_data)
container.fields

But the data is not added as a field. How can I move forward?

@h-mayorquin
Copy link
Author

Related:

hdmf-dev/hdmf-zarr#186

@h-mayorquin
Copy link
Author

I added the handling division by zero, check this out what happens with external files (like Video):

image

From this example:

import remfile
import h5py

asset_path = "sub-CSHL049/sub-CSHL049_ses-c99d53e6-c317-4c53-99ba-070b26673ac4_behavior+ecephys+image.nwb"
recording_asset = dandiset.get_asset_by_path(path=asset_path)
url = recording_asset.get_content_url(follow_redirects=True, strip_query=True)
file_path = url

rfile = remfile.File(file_path)
file = h5py.File(rfile, 'r')

from pynwb import NWBHDF5IO

io = NWBHDF5IO(file=file, mode='r')

nwbfile = io.read()
nwbfile

src/hdmf/container.py Outdated Show resolved Hide resolved
@rly
Copy link
Contributor

rly commented Oct 2, 2024

@stephprince when you have time, can you review this?

@rly rly modified the milestones: 3.14.5, 3.14.6 Oct 3, 2024
@stephprince
Copy link
Contributor

Rereading through this discussion, I believe where we left off is that the we want to remove the backend-specific logic from the Container class. To do so, it was proposed that:

In this PR we:

  • Add HDMFIO.generate_dataset_html(dataset) which would implement a minimalist representation
  • Implement HDF5IO.generate_dataset_html(h5py.Dataset) to represent an h5py.Dataset

In a separate PR on hdmf_zarr we would:

  • implement ZarrIO.generate_dataset_html(Zarr.array)

In the Container class, it would look like this:

read_io = self.get_read_io()  # if the Container was read from file, this will give you the IO object that read it
if read_io is not None:
    html_repr = read_io.generate_dataset_html(my_dataset)
else:
    # The file was not read from disk so the dataset should be numpy array or a list

@h-mayorquin did you want to do this? Otherwise I can go ahead and make the proposed changes to finish up this PR.

@h-mayorquin
Copy link
Author

Hi, @stephprince

I think this is a good summary.

I am not sure how to decouple HDF5IO.generate_dataset_html(h5py.Dataset) here as hdmf seems super coupled with hdf5. Or is it the idea that we only want to exclude zarr?

This has been on the back of my mind for a while and everytime but I had other priorities. It would be great if you have time to finish it.

@stephprince
Copy link
Contributor

@h-mayorquin yes I can take a look at it and finish it up

@stephprince
Copy link
Contributor

I have pushed the updates we discussed:

  • added utility functions generate_array_html_repr and get_basic_array_info to the utils module to get basic array info and generate an array html table
  • added a static HDMFIO.generate_dataset_html() method, the HDF5/Zarr implementations collect information from the dataset and then generate the actual html representation
  • updated Container._generate_array_html() to use these methods

I tested a Zarr implementation that looks like this and can submit a PR in hdmf_zarr for that:

def generate_dataset_html(dataset):
    """Generates an html representation for a dataset for the ZarrIO class"""

    # get info from zarr array and generate html repr
    zarr_info_dict = {k:v for k, v in dataset.info_items()}
    repr_html = generate_array_html_repr(zarr_info_dict, dataset, "Zarr Array")

    return repr_html

@oruebel @h-mayorquin if you could please review and let me know if there are any remaining concerns

@h-mayorquin
Copy link
Author

Looks good to me, thanks for taking on this.

Comment on lines +1609 to +1610
def generate_dataset_html(dataset):
"""Generates an html representation for a dataset for the HDF5IO class"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be useful to raise a corresponding issue on HDMF_ZARR to have generate_dataset_html be implemented on ZarrIO as well (if we have not done this yet). @stephprince can you make and issue

Comment on lines +191 to +192
@staticmethod
def generate_dataset_html(dataset):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this function would be triggered when using ZarrIO. Did we test that this indeed works with ZarrIO?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: enhancement improvements of code or code behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants