Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_yaml.pyx deletes comments when dumping using recent versions of ruamel.yaml #1495

Closed
gtristan opened this issue Aug 12, 2021 · 10 comments · Fixed by #1546
Closed

_yaml.pyx deletes comments when dumping using recent versions of ruamel.yaml #1495

gtristan opened this issue Aug 12, 2021 · 10 comments · Fixed by #1546

Comments

@gtristan
Copy link
Contributor

gtristan commented Aug 12, 2021

With recent versions of ruamel.yaml, our test case tests/internals/yaml.py::test_roundtrip_dump is failing, and deleting comments from the test file.

This is filed upstream, but is not reproducible with vanilla ruamel, this only happens with our _yaml.pyx customizations.

ruamel shallow bisection report

Version 0.17.6

Since this version of ruamel.yaml, our loading/dumping routine results in deleting comments from the roundtrip-test.yaml sample file.

The roundtrip-test.yaml file is changed in the following way:

24,26d23
< 
< # We roundtrip integers and floats as strings, to prevent truncation.
< 
32,35d28
< 
< # We also roundtrip booleans in various forms as strings to prevent
< # normalisation to 'True' and 'False'
< 
51d43
< 
53,55d44
< 
< # That is all
< 

Version 0.17.5

In this version of ruamel.yaml, the test is broken more severely with these versions, instead of losing comments we simply get the following crash (which is an instance of ruamel.yaml issue 385):

datafiles = local('/home/tristan/work/buildstream/.tox/py39-nocover/tmp/test_roundtrip_dump_True_0'), fromdisk = True

    @pytest.mark.datafiles(os.path.join(DATA_DIR))
    @pytest.mark.parametrize("fromdisk", [(True), (False)])
    def test_roundtrip_dump(datafiles, fromdisk):
        filename = os.path.join(datafiles.dirname, datafiles.basename, "roundtrip-test.yaml")
        with open(filename, "r") as fh:
            rt_raw = fh.read()
        if fromdisk:
>           rt_loaded = _yaml.roundtrip_load(filename)

tests/internals/yaml.py:415: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
src/buildstream/_yaml.pyx:427: in buildstream._yaml.roundtrip_load
    contents = roundtrip_load_data(data, filename=filename)
src/buildstream/_yaml.pyx:461: in buildstream._yaml.roundtrip_load_data
    contents = yaml.load(contents, yaml.RoundTripLoader, preserve_quotes=True)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/main.py:1067: in load
    return loader._constructor.get_single_data()
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/constructor.py:120: in get_single_data
    node = self.composer.get_single_node()
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:76: in get_single_node
    document = self.compose_document()
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:99: in compose_document
    node = self.compose_node(None, None)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:142: in compose_node
    node = self.compose_mapping_node(anchor)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:222: in compose_mapping_node
    item_value = self.compose_node(node, item_key)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:142: in compose_node
    node = self.compose_mapping_node(anchor)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:222: in compose_mapping_node
    item_value = self.compose_node(node, item_key)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:142: in compose_node
    node = self.compose_mapping_node(anchor)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:222: in compose_mapping_node
    item_value = self.compose_node(node, item_key)
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/composer.py:112: in compose_node
    if self.parser.check_event(AliasEvent):
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/parser.py:145: in check_event
    self.current_event = self.state()
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/parser.py:636: in parse_block_mapping_value
    return self.parse_block_node_or_indentless_sequence()
.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/parser.py:350: in parse_block_node_or_indentless_sequence
    return self.parse_node(block=True, indentless_sequence=True)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <ruamel.yaml.loader.RoundTripLoader object at 0x7ff65bd85040>, block = True, indentless_sequence = True

    def parse_node(self, block=False, indentless_sequence=False):
        # type: (bool, bool) -> Any
        if self.scanner.check_token(AliasToken):
            token = self.scanner.get_token()
            event = AliasEvent(token.value, token.start_mark, token.end_mark)  # type: Any
            self.state = self.states.pop()
            return event
    
        anchor = None
        tag = None
        start_mark = end_mark = tag_mark = None
        if self.scanner.check_token(AnchorToken):
            token = self.scanner.get_token()
            self.move_token_comment(token)
            start_mark = token.start_mark
            end_mark = token.end_mark
            anchor = token.value
            if self.scanner.check_token(TagToken):
                token = self.scanner.get_token()
                tag_mark = token.start_mark
                end_mark = token.end_mark
                tag = token.value
        elif self.scanner.check_token(TagToken):
            token = self.scanner.get_token()
            start_mark = tag_mark = token.start_mark
            end_mark = token.end_mark
            tag = token.value
            if self.scanner.check_token(AnchorToken):
                token = self.scanner.get_token()
                start_mark = tag_mark = token.start_mark
                end_mark = token.end_mark
                anchor = token.value
        if tag is not None:
            handle, suffix = tag
            if handle is not None:
                if handle not in self.tag_handles:
                    raise ParserError(
                        'while parsing a node',
                        start_mark,
                        _F('found undefined tag handle {handle!r}', handle=handle),
                        tag_mark,
                    )
                tag = self.transform_tag(handle, suffix)
            else:
                tag = suffix
        # if tag == '!':
        #     raise ParserError("while parsing a node", start_mark,
        #             "found non-specific tag '!'", tag_mark,
        #      "Please check 'http://pyyaml.org/wiki/YAMLNonSpecificTag'
        #     and share your opinion.")
        if start_mark is None:
            start_mark = end_mark = self.scanner.peek_token().start_mark
        event = None
        implicit = tag is None or tag == '!'
        if indentless_sequence and self.scanner.check_token(BlockEntryToken):
            comment = None
            pt = self.scanner.peek_token()
>           if self.loader and self.loader.comment_handling is None:
E           AttributeError: 'RoundTripLoader' object has no attribute 'comment_handling'

.tox/py39-nocover/lib/python3.9/site-packages/ruamel/yaml/parser.py:413: AttributeError

Version 0.17.4

This is the last version of ruamel.yaml which behaves how we expect it to.

@gtristan
Copy link
Contributor Author

It looks to me that the main reasons we are overriding these deep ruamel codepaths are:

  • To load all scalar values as strings, this helps our internal codepaths in various places and ensures we never treat things like version numbers as if they were float/double values
  • For the same reason, to ensure that ruamel.yaml does not normalize scalar values, like turning the user's intentional True into a true, introducing needless changes in their .bst files
  • For performance reasons, we have a cython Representer object which is one of the measures taken to optimize the load process, see 865fbce

The upstream maintainer has suggested pinning the ruamel.yaml version as a solution, given that it will be hard to preserve API stability at these lower levels of the ruamel API (at the higher levels, we haven't had much trouble with the ruamel.yaml API over the years).

Unfortunately, I very much doubt that pinning the ruamel.yaml version can be a solution, as we cannot be sure what versions of ruamel.yaml a distro chooses to ship, and the virtual environment install path is really only useful to users who know what python is or have the patience to fiddle with such things.

Some potential ways forward:

  • Try to fix this by rewriting our custom Representer, potentially maintaining different representer objects as needed for different host ruamel.yaml implementations, this however seems expensive
  • Try to remove the Representer and reduce the amount of deep customization
    • It seems to me we've made a lot of optimizations in the loading codepaths, and it's worth re-evaluating if this particular optimization is worth the cost of keeping up with a churning ruamel.yaml
    • Surely we don't need this much customization only for the other required features, such as treating all scalars as strings

@nanonyme
Copy link
Contributor

nanonyme commented Sep 4, 2021

In addition to this, roundtrip_load is deprecated and will be removed.

@nanonyme
Copy link
Contributor

nanonyme commented Sep 4, 2021

Regarding: "For the same reason, to ensure that ruamel.yaml does not normalize scalar values, like turning the user's intentional True into a true, introducing needless changes in their .bst files" I've always wondered why this is done. It seems like borken behaviour and confuses the boundary between YAML declarative files and Python element implementation. If you want to have True value, then it should be a string.

@nanonyme
Copy link
Contributor

nanonyme commented Sep 4, 2021

"To load all scalar values as strings, this helps our internal codepaths in various places and ensures we never treat things like version numbers as if they were float/double values" this is iffier. In cases whether typing is not possible, user errors are quite probably when strict YAML interpretation is used. That said, again, this is about blurring the fact that the input is YAML. By making it really really be treated as regular YAML there will be less pain if you ever end up having to switch to a different YAML implementation.

@gtristan
Copy link
Contributor Author

gtristan commented Sep 9, 2021

By making it really really be treated as regular YAML there will be less pain if you ever end up having to switch to a different YAML implementation.

I haven't looked very far into the spec, but as far as I can see, types such as booleans or null values are said to have a canonical form, I'm not sure it says anything about what implementations may or may not accept for these forms (like "True", "trUe", "TRUE" etc, it's unclear whether an implementation should support these or not).

That said, I also cannot find any recommendations in the standard regarding round tripping behavior - I don't think anyone expects that a scalar value should be forced into it's canonical form as a result of round tripping, and I rather think the point of round tripping is to preserve the original content unless it was modified.

I don't think I agree at this point that ruamel.yaml provides the best UX for these types, and I don't see any evidence that suggests that other round tripping YAML implementations would also fail to retain the original literal string representation of unmodified scalar values.

@nanonyme
Copy link
Contributor

nanonyme commented Sep 9, 2021

I don't see alternative forms here https://yaml.org/spec/1.2/#id2803629. I think the things you mentioned should be considered strings

@nanonyme
Copy link
Contributor

nanonyme commented Sep 9, 2021

I also consider it wrong if ruamel.yaml normalizes strings to booleans

@nanonyme
Copy link
Contributor

nanonyme commented Sep 9, 2021

Note if BuildStream converts string "True" to boolean and expects ruamel.yaml not to dump it as true, then the bug is clearly in BuildStream.

@gtristan
Copy link
Contributor Author

gtristan commented Sep 9, 2021

I also consider it wrong if ruamel.yaml normalizes strings to booleans

Its not completely clear to me, ruamel.yaml does not normalize strings per se, so long as they are quoted or use a YAML syntax which clearly denotes the scalar value to be a string, it does however introduce changes to valid YAML when roundtripping, so I would say that the roundtripping is imperfect in the sense that some user preferences are lost.

Note if BuildStream converts string "True" to boolean and expects ruamel.yaml not to dump it as true, then the bug is clearly in BuildStream.

BuildStream forces all scalars to be treated as strings (both at load and save time) and this works around normalizations while also ensuring that things like version numbers can be written plainly without quotes or complex YAML syntaxes.

To refer back to the upstream wontfix issue, we can clearly see that plain usage of ruamel.yaml does introduce changes to boolean and null scalar values: https://sourceforge.net/p/ruamel-yaml/tickets/390/#c65a

@abderrahim
Copy link
Contributor

I've noticed this with ruamel.yaml too. I suggest sticking to ruamel.yaml < 0.17 for now, and update once we figure out how to port to the new version (or better yet, how to support both versions).

abderrahim added a commit to abderrahim/buildstream that referenced this issue Dec 13, 2021
Also drop roundtrip_load_data that was unused

Fixes apache#1495
abderrahim added a commit to abderrahim/buildstream that referenced this issue Dec 13, 2021
Also drop roundtrip_load_data that was unused

Fixes apache#1495
abderrahim added a commit to abderrahim/buildstream that referenced this issue Dec 29, 2021
Also drop roundtrip_load_data that was unused

Fixes apache#1495
abderrahim added a commit to abderrahim/buildstream that referenced this issue Feb 12, 2022
Also drop roundtrip_load_data that was unused

Fixes apache#1495
abderrahim added a commit to abderrahim/buildstream that referenced this issue Mar 16, 2022
Also drop roundtrip_load_data that was unused

Fixes apache#1495
abderrahim added a commit to abderrahim/buildstream that referenced this issue Mar 19, 2022
Also drop roundtrip_load_data that was unused

Fixes apache#1495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants