Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Append a Dataset of References #1135

Merged
merged 39 commits into from
Aug 22, 2024
Merged

Append a Dataset of References #1135

merged 39 commits into from
Aug 22, 2024

Conversation

mavaylon1
Copy link
Contributor

@mavaylon1 mavaylon1 commented Jun 27, 2024

Motivation

What was the reasoning behind this change? Please explain the changes briefly.

How to test the behavior?

Show how to reproduce the new behavior (can be a bug fix or a new feature)

Checklist

  • Did you update CHANGELOG.md with your changes?
  • Does the PR clearly describe the problem and the solution?
  • Have you reviewed our Contributing Guide?
  • Does the PR use "Fix #XXX" notation to tell GitHub to close the relevant issue numbered XXX when the PR is merged?

Copy link

codecov bot commented Jul 2, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.91%. Comparing base (acc3d78) to head (d5ad0e4).
Report is 1 commits behind head on dev.

Additional details and impacted files
@@            Coverage Diff             @@
##              dev    #1135      +/-   ##
==========================================
+ Coverage   88.89%   88.91%   +0.01%     
==========================================
  Files          45       45              
  Lines        9844     9857      +13     
  Branches     2799     2802       +3     
==========================================
+ Hits         8751     8764      +13     
  Misses        776      776              
  Partials      317      317              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@mavaylon1 mavaylon1 mentioned this pull request Jul 3, 2024
6 tasks
@mavaylon1 mavaylon1 changed the title Zarr Append a Dataset of References Append a Dataset of References Jul 8, 2024
@mavaylon1 mavaylon1 marked this pull request as ready for review July 13, 2024 18:45
@mavaylon1 mavaylon1 requested a review from rly July 13, 2024 18:45
docs/source/install_users.rst Outdated Show resolved Hide resolved
Co-authored-by: Ryan Ly <rly@lbl.gov>
Co-authored-by: Ryan Ly <rly@lbl.gov>
@rly
Copy link
Contributor

rly commented Jul 13, 2024

Minor suggestion to a test. Looks good otherwise.

@mavaylon1 mavaylon1 requested a review from rly July 23, 2024 01:26
@rly
Copy link
Contributor

rly commented Jul 25, 2024

I added a test that raises an unexpected error:

self = <Closed HDF5 file>, name = <HDF5 object reference (null)>

    @with_phil
    def __getitem__(self, name):
        """ Open an object in the file """
    
        if isinstance(name, h5r.Reference):
            oid = h5r.dereference(name, self.id)
            if oid is None:
>               raise ValueError("Invalid HDF5 object reference")
E               ValueError: Invalid HDF5 object reference

We just chatted in person, but just to note that you were going to take a look at it.

@mavaylon1
Copy link
Contributor Author

@rly There may be a work around, but I think the problem below might be enough to just start the proxy idea.

def append(self, arg):
        child = arg
        while True:
            if child.parent is not None:
                parent = child.parent
                child = parent
            else:
                parent = child
                break
        self.io.manager.build(parent)
        builder = self.io.manager.build(arg)

        # Create Reference
        ref = self.io._create_ref(builder)
        append_data(self.dataset, ref)

When a user calls append on a reference, we build the root builder first. We then call _create_ref, which will try to create a reference return self.__file[path].ref. This fails with KeyError: "Unable to open object "new". This fails because it is trying to create a reference to an object, i.e., the new baz, within the file; however, it is not in the file. It is in the root builder.

Why isn't in the file? I am in append mode right? Let's ignore the reference for now, and just add the new baz. It works (sort of). When you read it back, the new baz is not there. We need to call write again. Once you do that it is there. That means when we try to create a reference, and it is looking for the new baz in the file only to find nothing because it is not added till write (which we never call during append).

Earlier I said in conversation that you do not need to call write. That is half true. in my method prior (seen below), you do not need to call write to append to a dataset of references, but you do need to call write to add a new baz because it itself is a new group.

Earlier I had

def append(self, arg):
        # Get Builder
        builder = self.io.manager.build(arg)

        # Get HDF5 Reference
        ref = self.io._create_ref(builder)
        append_data(self.dataset, ref)

This leads to a reference being created, but not found with the test self.assertIs(read_bucket1.baz_data.data[10], read_bucket1.bazs["new"]). This is because the reference path is just \. This is wrong. It needs to be '/bazs/new'.

Note: yes this is the same code from hdmf-zarr. I started to wonder if this could just be in hdmf because the append calls _create_ref which means all we need to do is have unique create_ref methods per backend. AKA we wouldn't need a zarr PR that has this logic, just some name changes probably.

@rly
Copy link
Contributor

rly commented Jul 29, 2024

I see. Tricky indeed. You can't create an HDF5 reference to an object that isn't in the file yet, and rebuilding the whole hierarchy on each append is not ideal. A proxy makes sense. I can't think of another workaround without severely limiting and documenting the ways in which you cannot append.

Note: yes this is the same code from hdmf-zarr. I started to wonder if this could just be in hdmf because the append calls _create_ref which means all we need to do is have unique create_ref methods per backend. AKA we wouldn't need a zarr PR that has this logic, just some name changes probably.

That sounds useful to look into. You may be able to refactor it and some fields into the base HDMFDataset class.

@mavaylon1
Copy link
Contributor Author

Add documentation here: NeurodataWithoutBorders/pynwb#1951

src/hdmf/query.py Outdated Show resolved Hide resolved
mavaylon1 and others added 4 commits August 21, 2024 17:28
@mavaylon1 mavaylon1 merged commit e0bedca into dev Aug 22, 2024
29 checks passed
@mavaylon1 mavaylon1 deleted the zarr_append branch August 22, 2024 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants