Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix trimValueWhitespaces removing needed white-spaces #226

Conversation

MartinP7r
Copy link
Contributor

@MartinP7r MartinP7r commented Aug 14, 2021

Hello everyone!

I found that the logic for trimming white-spaces (trimValueWhitespaces) may inadvertently remove white-spaces from inside the string value of the element, if XMLParser reports the content of a single element as multiple strings.

This happens for example with non-ascii characters like German umlauts:
Lösen is split up and reported separately as L and ösen.
Or where we have "mixed" strings, e.g. containing a segment of Japanese or Chinese characters:
Japanese 日本語 is split into Japanese and 日本語.

If there's a white-space adjacent to the split, as in the Japanese example above, the white-space gets trimmed before it's joined together. Resulting in Japanese日本語 without a space in the middle.

I added failing tests with the first commit.


see also: https://stackoverflow.com/a/1076951/2064473

The parser object may send the delegate several parser:foundCharacters: messages to report the characters of an element. Because string may be only part of the total character content for the current element, you should append it to the current accumulation of characters until the element changes.

see: https://stackoverflow.com/a/1076951/2064473

"The parser object may send the delegate several parser:foundCharacters: messages to report the characters of an element. Because string **may be only part of the total character content for the current element**, you should append it to the current accumulation of characters until the element changes."
@MartinP7r MartinP7r force-pushed the fix/nonStandardCharacter_SpacePreservation branch from 5ab0d61 to 7c693a0 Compare August 14, 2021 08:01
@MaxDesiatov
Copy link
Collaborator

@wooj2 Since you've introduced this in the first place, WDYT about this change?

@MartinP7r
Copy link
Contributor Author

Thank you for having a look, @MaxDesiatov .

I have an idea for a patch, but will wait on @wooj2 to weigh in before I spend time on it.

Refactor `process()` function name to `trimWhitespacesIfNeeded()`
@MartinP7r
Copy link
Contributor Author

MartinP7r commented Aug 20, 2021

I added a possible fix which seems to be not to intrusive.

It seems like a commit in #92 caused the change and added process(string:) in parser(_:foundCharacters:).
I kept it there for the relevant check for emptiness only and moved the actual trimming into parser(_:didEndElement:namespaceURI:qualifiedName:) where it's sure that no further strings will be appended.

Tests are passing, but please let me know if you can think of possible regressions.

@codecov
Copy link

codecov bot commented Aug 20, 2021

Codecov Report

Merging #226 (48393da) into main (887de88) will increase coverage by 0.06%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #226      +/-   ##
==========================================
+ Coverage   73.94%   74.00%   +0.06%     
==========================================
  Files          46       46              
  Lines        2437     2443       +6     
==========================================
+ Hits         1802     1808       +6     
  Misses        635      635              
Impacted Files Coverage Δ
Sources/XMLCoder/Auxiliaries/XMLCoderElement.swift 96.55% <100.00%> (+0.06%) ⬆️
Sources/XMLCoder/Auxiliaries/XMLStackParser.swift 94.05% <100.00%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 887de88...48393da. Read the comment docs.

@wooj2
Copy link
Contributor

wooj2 commented Aug 20, 2021

Thank you for having a look, @MaxDesiatov .

I have an idea for a patch, but will wait on @wooj2 to weigh in before I spend time on it.

Sorry for the delay. There was some confusion around trimValueWhitespaces in that I was mistakenly identified as the one who introduced this flag. That is not the case :).

If I understand the problem correctly, it seems that in the case of "Stilles Örtchen", parser(XMLParser, foundCharacters:) is getting called once w/ the value of "Stilles " and subsequently with "Örtchen". When trimValueWhitespaces is set to true, the XMLParser delegate incorrectly strips out the space between s and Ö.

Looking at your approach for a fix, I think applying trimValueWhitespaces to the value of the element after the entire element's value has been parsed(parser(XMLParser, didEndElement:...)) is elegant, and seems to make sense. The only comment that I would have is maybe adding some test cases

Nested case:

private let deCharacterXMLNested = """
<si><t xml:space="preserve">\(deCharacterString)</t></si>
""".data(using: .utf8)!

...snip...

    func testNonStandardCharactersNested() throws {
        let decoder = XMLDecoder(trimValueWhitespaces: true)

        let resultGerman = try XMLDecoder().decode(Item.self, from: deCharacterXMLNested)

        XCTAssertEqual(resultGerman.text, deCharacterString)
    }

Spaces preserved:

private let deCharacterXMLSpaced = """
<t>     \(deCharacterString)    </t>
""".data(using: .utf8)!

...snip...

    func testNonStandardCharactersSpaced() throws {
        let decoder = XMLDecoder(trimValueWhitespaces: false)

        let resultGerman = try decoder.decode(String.self, from: deCharacterXMLSpaced)

        XCTAssertEqual(resultGerman, "     \(deCharacterString)    ")
    }

I'll defer to @MaxDesiatov for the final approval and/or any other comments.

for nested xml and untrimmed case

thanks to @wooj2 for the suggestion!
@MartinP7r
Copy link
Contributor Author

MartinP7r commented Aug 21, 2021

Sorry for the confusion @wooj2!
Thank you for having a look at my solution and especially for suggesting the additional tests.
I think I don't fully understand the XML specification for whitespace preservation, yet, so I highly appreciate it!
I went ahead and added the tests. I hope that's OK.


If I may, I have another slightly different question about whitespaces in XML.
(I can open a new issue if that makes more sense.)

In http://www.xmlplease.com/xml/xmlspace/ under 3. Whitespace only text nodes it says that something like

<root xml:space="preserve">
    <test> This is     great. </test>
</root>

would have the consecutive spaces between "is" and "great" (...) be reduced to just one space character. like in normal HTML parsing. This doesn't seem to be the case for the current implementation of XMLCoder.
I.e. a test as below

private let multipleSpacesBetween = "A      word"
private let multipleSpacesBetweenXML = """
<t>\(multipleSpacesBetween)</t>
""".data(using: .utf8)!

...snip...

    func testMultipleSpacesBetween() throws {
        let decoder = XMLDecoder(trimValueWhitespaces: true)

        let resultJapanese = try decoder.decode(String.self, from: multipleSpacesBetweenXML)

        XCTAssertEqual(resultJapanese, "A word")
    }

produces

XCTAssertEqual failed: ("A      word") is not equal to ("A word")

I'm sorry if I've missed something, but could you tell me if that is the way it should be or might this warrant a fix?

Best regards,
Martin

@MaxDesiatov
Copy link
Collaborator

Sorry, I'm a bit confused about XCTAssertEqual failed: ("A word") is not equal to ("A word"). Does this make any sense to you? These strings seem to be equal to each other at a first glance. Am I missing something? Is the comparison flawed, or is there some other whitespace character there that makes these strings not equal to each other?

@MartinP7r
Copy link
Contributor Author

MartinP7r commented Aug 22, 2021

I'm sorry! I did not realize that GitHub's markdown parser processes whitespace differently for code in single-backticks than for three-backticks. So, the single-backtick version reduces consecutive whitespaces into one, just like normal HTML (which is ironically just what I was asking about and the documentation I linked is about).

I edited it to use three-backticks, so the error message displays correctly now.

XCTAssertEqual failed: ("A      word") is not equal to ("A word")

edit:
As I said, this question is probably unrelated to the current PR.
I just wanted to ask whether or not this is working as intended or if it should be as described in the linked documentation. I'd be happy to open a separate issue or PR with the failing tests above for this if you think it's relevant.

@MartinP7r MartinP7r marked this pull request as ready for review August 22, 2021 01:53
@MaxDesiatov
Copy link
Collaborator

Stuff like this related to the actual low level parsing of XML is a question about Foundation's XMLParser behavior. We only provide helpers on top of it.

I wonder how with this caveat in the standard one is supposed to encode or decode any strings with multiple spaces in them? The fact that multiple whitespaces aren't currently collapsed into one is either a happy accident or a deliberate decision made by Foundation authors. Either way, if that's not directly related to this PR, it does make sense to open a separate issue or PR for that.

@MartinP7r
Copy link
Contributor Author

That makes a lot of sense. Thank you for the explanation!🙂
Since it's the underlying parser's behavior this is working as intended.
Thank you for taking the time to answer this and sorry for the excursion.

@MaxDesiatov
Copy link
Collaborator

No problem at all, thanks for digging into all of these details! Very illuminating and super helpful 🙂

@MaxDesiatov
Copy link
Collaborator

BTW, is there still anything to do within the scope of this PR? Or is it fully ready for review now?

@MartinP7r
Copy link
Contributor Author

From my perspective it's complete.
Please let me know if you think there's something missing.

@MaxDesiatov MaxDesiatov changed the title [WIP] fix trimValueWhitespaces removing needed white-spaces Fix trimValueWhitespaces removing needed white-spaces Aug 22, 2021
Copy link
Collaborator

@MaxDesiatov MaxDesiatov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, many thanks!

@MaxDesiatov MaxDesiatov merged commit 5c5b3f2 into CoreOffice:main Aug 22, 2021
@mflint
Copy link
Collaborator

mflint commented Nov 5, 2021

Thank-you for raising this issue, and fixing it, @MartinP7r.

I found that this was affecting text containing &amp;. For example, "one &amp; two" would be parsed into the string one&two. I love investigating a problem, and then finding that someone else has already fixed it! ❤️

@MaxDesiatov
Copy link
Collaborator

Thanks for letting me know this is a widespread problem. I'll tag 0.13.1 with this fix soon.

@MartinP7r
Copy link
Contributor Author

MartinP7r commented Nov 7, 2021

@mflint, thank you for your kind words! It's very rewarding to hear that this is helpful to others 😃

Also, thank you for bringing up your use case. It might not be entirely necessary, but since XML entities like &amp;, &lt; or &gt; are somewhat special, I opened a PR (#234) adding another test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants