-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix trimValueWhitespaces
removing needed white-spaces
#226
Fix trimValueWhitespaces
removing needed white-spaces
#226
Conversation
see: https://stackoverflow.com/a/1076951/2064473 "The parser object may send the delegate several parser:foundCharacters: messages to report the characters of an element. Because string **may be only part of the total character content for the current element**, you should append it to the current accumulation of characters until the element changes."
5ab0d61
to
7c693a0
Compare
@wooj2 Since you've introduced this in the first place, WDYT about this change? |
Thank you for having a look, @MaxDesiatov . I have an idea for a patch, but will wait on @wooj2 to weigh in before I spend time on it. |
Refactor `process()` function name to `trimWhitespacesIfNeeded()`
I added a possible fix which seems to be not to intrusive. It seems like a commit in #92 caused the change and added Tests are passing, but please let me know if you can think of possible regressions. |
Codecov Report
@@ Coverage Diff @@
## main #226 +/- ##
==========================================
+ Coverage 73.94% 74.00% +0.06%
==========================================
Files 46 46
Lines 2437 2443 +6
==========================================
+ Hits 1802 1808 +6
Misses 635 635
Continue to review full report at Codecov.
|
Sorry for the delay. There was some confusion around If I understand the problem correctly, it seems that in the case of "Stilles Örtchen", Looking at your approach for a fix, I think applying Nested case:
Spaces preserved:
I'll defer to @MaxDesiatov for the final approval and/or any other comments. |
for nested xml and untrimmed case thanks to @wooj2 for the suggestion!
Sorry for the confusion @wooj2! If I may, I have another slightly different question about whitespaces in XML. In http://www.xmlplease.com/xml/xmlspace/ under 3. Whitespace only text nodes it says that something like <root xml:space="preserve">
<test> This is great. </test>
</root> would have private let multipleSpacesBetween = "A word"
private let multipleSpacesBetweenXML = """
<t>\(multipleSpacesBetween)</t>
""".data(using: .utf8)!
...snip...
func testMultipleSpacesBetween() throws {
let decoder = XMLDecoder(trimValueWhitespaces: true)
let resultJapanese = try decoder.decode(String.self, from: multipleSpacesBetweenXML)
XCTAssertEqual(resultJapanese, "A word")
} produces
I'm sorry if I've missed something, but could you tell me if that is the way it should be or might this warrant a fix? Best regards, |
Sorry, I'm a bit confused about |
I'm sorry! I did not realize that GitHub's markdown parser processes whitespace differently for code in single-backticks than for three-backticks. So, the single-backtick version reduces consecutive whitespaces into one, just like normal HTML (which is ironically just what I was asking about and the documentation I linked is about). I edited it to use three-backticks, so the error message displays correctly now.
edit: |
Stuff like this related to the actual low level parsing of XML is a question about Foundation's I wonder how with this caveat in the standard one is supposed to encode or decode any strings with multiple spaces in them? The fact that multiple whitespaces aren't currently collapsed into one is either a happy accident or a deliberate decision made by Foundation authors. Either way, if that's not directly related to this PR, it does make sense to open a separate issue or PR for that. |
That makes a lot of sense. Thank you for the explanation!🙂 |
No problem at all, thanks for digging into all of these details! Very illuminating and super helpful 🙂 |
BTW, is there still anything to do within the scope of this PR? Or is it fully ready for review now? |
From my perspective it's complete. |
trimValueWhitespaces
removing needed white-spacestrimValueWhitespaces
removing needed white-spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, many thanks!
Thank-you for raising this issue, and fixing it, @MartinP7r. I found that this was affecting text containing |
Thanks for letting me know this is a widespread problem. I'll tag 0.13.1 with this fix soon. |
Hello everyone!
I found that the logic for trimming white-spaces (
trimValueWhitespaces
) may inadvertently remove white-spaces from inside the string value of the element, ifXMLParser
reports the content of a single element as multiple strings.This happens for example with non-ascii characters like German umlauts:
Lösen
is split up and reported separately asL
andösen
.Or where we have "mixed" strings, e.g. containing a segment of Japanese or Chinese characters:
Japanese 日本語
is split intoJapanese
and日本語
.If there's a white-space adjacent to the split, as in the Japanese example above, the white-space gets trimmed before it's joined together. Resulting in
Japanese日本語
without a space in the middle.I added failing tests with the first commit.
see also: https://stackoverflow.com/a/1076951/2064473