Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Render axes attributes as list of double quoted strings #155

Merged
merged 26 commits into from
Nov 15, 2021

Conversation

carterbox
Copy link
Collaborator

@carterbox carterbox commented Nov 3, 2021

Previously, arrays of strings or byte arrays were truncated to only show the first element, but this is incompatible with the Nexus standard which expects axes to be provided as an array of strings.

This PR changes the behavior of utils.decode_byte_string() and h5tree.renderAttribute so that all types of strings are converted to strings and arrays of strings are converted to lists of strings. This allows rendering of strings and lists of strings with double quotes. However, it also causes some knock-on effects where unrelated functions which always expected a string, now need to deal with lists of strings. These cases which caused existing tests to fail were patched.

Closes #129

Previously, arrays of strings or byte arrays were truncated to
only show the first element, but this is incompatible with the
Nexus standard which expects axes to be provided as an array of
strings.

Closes prjemian#129
@carterbox carterbox force-pushed the 128-double-quote-string branch from 2e4efd4 to 81ea1ce Compare November 3, 2021 21:01
punx/tests/test_issue129.py Outdated Show resolved Hide resolved
@prjemian prjemian added the task label Nov 5, 2021
@prjemian prjemian added this to the v1.0.0-release milestone Nov 5, 2021
@carterbox carterbox requested a review from prjemian November 7, 2021 21:55
@prjemian
Copy link
Owner

prjemian commented Nov 10, 2021

Actually, this is not quite what I had in mind. More like (with this file created):

import h5py

with h5py.File("/tmp/demo.h5", "w") as h5:
    nxentry = h5.create_group("Scan")
    nxentry.attrs["NX_class"] = "NXentry"

    nxdata = nxentry.create_group("data")
    nxdata.attrs["NX_class"] = "NXdata"
    nxdata.attrs["axes"] = "two_theta"
    nxdata.attrs["signal"] = "counts"
    nxdata.attrs["two_theta_indices"] = 0
    nxdata.attrs["multi_str"] = "one two three".split()

    ds = nxentry.create_dataset(
        "counts",
        data=[
            1037, 1318, 1704, 2857, 4516
        ]
    )
    ds.attrs["units"] = "counts"

    ds = nxentry.create_dataset(
        "two_theta",
        data=[
            17.92608, 17.92591, 17.92575, 17.92558, 17.92541
        ]
    )
    ds.attrs["units"] = "degrees"

Should give a report like this:

(bluesky_2022_1) prjemian@zap:~/.../prjemian/punx$ punx tree /tmp/demo.h5 

!!! WARNING: this program is not ready for distribution.

/tmp/demo.h5 : NeXus data file
  Scan:NXentry
    @NX_class = NXentry
    counts:NX_INT64[5] = [1037, 1318, 1704, 2857, 4516]
      @units = counts
    two_theta:NX_FLOAT64[5] = [17.92608, 17.92591, 17.92575, 17.92558, 17.92541]
      @units = degrees
    data:nxdata
      @NX_class = NXdata
      @axes = two_theta
      @multi_str = ["one", "two", "three"]
      @signal = counts
      @two_theta_indices = 0

Today, the code renders this one attribute: @multi_str = one (it only gets the first element in the list)

Thinking that all string attributes should be rendered with double quotes and lists of strings should be rendered with square brackets.

@prjemian
Copy link
Owner

prjemian commented Nov 10, 2021

more like this:

(bluesky_2022_1) prjemian@zap:~/.../prjemian/punx$ punx tree /tmp/demo.h5 

!!! WARNING: this program is not ready for distribution.

/tmp/demo.h5 : NeXus data file
  Scan:NXentry
    @NX_class = "NXentry"
    counts:NX_INT64[5] = [1037, 1318, 1704, 2857, 4516]
      @units = "counts"
    two_theta:NX_FLOAT64[5] = [17.92608, 17.92591, 17.92575, 17.92558, 17.92541]
      @units = "degrees"
    data:nxdata
      @NX_class = "NXdata"
      @axes = "two_theta"
      @multi_str = ["one", "two", "three"]
      @signal = "counts"
      @two_theta_indices = 0

@carterbox
Copy link
Collaborator Author

I changed the behavior of decode_byte_string because I don't think anyone would use numpy.array(["this is a test"]). When they should be doing numpy.array(["this", "is", "a", "test"]). Is that OK?

Copy link
Owner

@prjemian prjemian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carterbox This looks good. Are there any other test cases to construct? If not, merge when ready.

@carterbox
Copy link
Collaborator Author

I don't know why the CI passes. There are a few failing tests for me locally!

@prjemian
Copy link
Owner

Please copy the complete the console failure report here.

Some attributes were being rendered not using the _renderAttributes
function. Now this function has been refactored, so that all attribute
rendering occurs in the same code segment.
@carterbox
Copy link
Collaborator Author

/home/dching/Documents/punx/punx/data/Data_Q.h5 : NeXus data file
  sasentry01:NXentry
    @NX_class = ["NXentry"]
    @canSAS_class = ["SASentry"]
    end_time:NX_CHAR = b'2014-04-25T09:34:06.087'
    start_time:NX_CHAR = b'2014-04-25T09:33:43.637'
    sasdata01:NXdata
      @I_axes = ["Q,Q"]
      @NX_class = ["NXdata"]
      @Q_indices = ["0,1"]
      @canSAS_class = ["SASdata"]
      @canSAS_version = ["0.1"]
      @signal = ["I"]
      I:NX_FLOAT32[100,100] = __array
        __array = [
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0, '...', 0.0]
            [0.0, 0.0, 0.0, '...', 0.0]
            ...
            [0.0, 0.0, 0.0, '...', 0.0]
          ]
        @IGORWaveType = 2
        @axes = ["Q"]
        @signal = [1]
        @uncertainty = [""]
      Q:NX_FLOAT32[100,100] = __array
        __array = [
            [0.46678692, 0.45850673, 0.45038438, '...', 0.45968014]
            [0.45850673, 0.4500742, 0.44179693, '...', 0.45126957]
            [0.45038444, 0.44179687, 0.43336153, '...', 0.4430146]
            ...
            [0.45968014, 0.45126957, 0.4430146, '...', 0.45246175]
          ]
        @IGORWaveType = 2

This example file for the validation tests is now "wrong" because it uses a numpy array of a single string. What should I do about that?

@carterbox
Copy link
Collaborator Author

carterbox commented Nov 11, 2021

The validate tests are currently failing where they expect a string but get a list instead. For example:

punx/validate.py:239: in validate
    self.build_address_catalog()
punx/validate.py:269: in build_address_catalog
    self._group_address_catalog_(None, self.h5)
punx/validate.py:297: in _group_address_catalog_
    self._group_address_catalog_(parent, group[item])
punx/validate.py:293: in _group_address_catalog_
    obj = get_subject(parent, group)
punx/validate.py:283: in get_subject
    v = ValidationItem(parent, o)
punx/validate.py:409: in __init__
    self.classpath = self.determine_NeXus_classpath()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <punx.validate.ValidationItem object at 0x7f062d75f700>

    def determine_NeXus_classpath(self):
        """
        determine the NeXus class path
    
        :see: http://download.nexusformat.org/sphinx/preface.html#class-path-specification
    
        EXAMPLE
    
        Given this NeXus data file structure::
    
            /
                entry: NXentry
                    data: NXdata
                        @signal = data
                        data: NX_NUMBER
    
        For the "signal" attribute of this HDF5 address: ``/entry/data``,
        its NeXus class path is: ``/NXentry/NXdata@signal``
    
        The ``@signal`` attribute has the value of ``data`` which means
        that the local field named ``data`` is the plottable data.
    
        The HDF5 address of the plottable data is: ``/entry/data/data``,
        its NeXus class path is: ``/NXentry/NXdata/data``
        """
        if self.name == SLASH:
            return ""
        else:
            h5_obj = self.h5_object
    
            classpath = str(self.parent.classpath)
            if classpath == CLASSPATH_OF_NON_NEXUS_CONTENT:
                logger.log(INFORMATIVE, "%s is not NeXus content", h5_obj.name)
                return CLASSPATH_OF_NON_NEXUS_CONTENT
    
            if not classpath.endswith(SLASH):
                if utils.isHdf5Group(h5_obj):
                    if "NX_class" in h5_obj.attrs:
                        nx_class = utils.decode_byte_string(h5_obj.attrs["NX_class"])
>                       if nx_class.startswith("NX"):
E                       AttributeError: 'list' object has no attribute 'startswith'

punx/validate.py:486: AttributeError

Fails because nx_class is ["NX"] instead of "NX"

@prjemian
Copy link
Owner

I agree that the examples are incorrect. They can be used to test for invalid structures. Will the proper structure be caught by other tests? If so, then the test can be modified to check that invalid structure is detected.

It is possible that other tests and data files may need such review.

@carterbox
Copy link
Collaborator Author

carterbox commented Nov 12, 2021

unitests aren't being run on this PR. It's because I'm using my own fork instead of using branches here. Should I add:

on: [push, pull_request]

to the GitHub action?

@carterbox carterbox requested a review from prjemian November 12, 2021 19:27
@prjemian
Copy link
Owner

Should I add:

Sure.

Copy link
Owner

@prjemian prjemian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look fine to me.

@prjemian
Copy link
Owner

prjemian commented Nov 13, 2021 via email

@prjemian
Copy link
Owner

prjemian commented Nov 14, 2021

The point is that @axes in the existing test is not a list of str. The test must construct and test a test data file with two or more independent axes and a @signal with commensurate dimensions. 2-D (or higher) data. For this, need to consult the NeXus documentation and construct the data. Also need to confirm the data type of the @axes attribute, whether str or list(str).

@prjemian
Copy link
Owner

prjemian commented Nov 14, 2021

In the NeXus documentation (in the Index), the @axes attribute can be added either to the NXdata group (preferred), or to a dataset (discouraged, this is the old way, still valid but discouraged).

The 2-D example shows the structure is list(str) as shown:

@axes=["time","pressure"]

where the datasets time and pressure exist in the same group.

The unit test should create a data file with the structure of the 2-D example and assert the structure and content of the @axes attribute and also verify that the named axes exist as datasets in the same group.

@prjemian
Copy link
Owner

Random numbers for the actual data (of the datasets) are sufficient. These values will not be tested.

@prjemian
Copy link
Owner

Removed the print() calls since they do not contribute to the tests and will not appear in the pytest console output.

@prjemian
Copy link
Owner

Number data types are OS-dependent as shown in this example when testing on Windows 10:

        # compare line-by-line, except for file name
        # TODO: data size is OS-dependent?
        for ref, xture in zip(report[1:], structure[1:]):
>           assert ref.strip() == xture.strip()
E           AssertionError: assert 'data:NX_INT32[4,3] = [ ... ]' == 'data:NX_INT64[4,3] = [ ... ]'
E             - data:NX_INT64[4,3] = [ ... ]
E             ?            ^^
E             + data:NX_INT32[4,3] = [ ... ]
E             ?            ^^

punx\tests\test_issue129.py:77: AssertionError

@prjemian
Copy link
Owner

@carterbox This is ready for your comment. Should I recycle my review approval?

Copy link
Collaborator Author

@carterbox carterbox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@carterbox carterbox merged commit 7875832 into prjemian:main Nov 15, 2021
@carterbox carterbox deleted the 128-double-quote-string branch November 15, 2021 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tree: use double quotes when rendering string values
2 participants