-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fixes for XML Encoding and Decoding #288
Conversation
* Removes deprecation on XML.stringToValue(). It now provides unescaping for strings to convert XML entities back into values. * New unescape function to handle XML entities -> value conversion.
@eduveks. Please try this pull request. The implementation does not encode all unicode values. In your example, only the Euro sign would be escaped. Please ensure that you place the following at the top of your generated XML before transporting to another system: <?xml version=\"1.0\" encoding=\"UTF-8\"?> So for your example in #285, the xml sent to another system would look something like this: <?xml version="1.0" encoding="UTF-8"?>
<root><amount>10,00 €</amount><description>Ação Válida</description></root> At this time, I'm not willing to encode more than what is in this PR as it now matches what is in the Also, looking at XML specs, as long as you specify the encoding at the top as UTF-8, all readers should properly read the unicode characters. |
@johnjaylward Thanks for the quick response. What would it take to implement this in such a way that existing applications are not changed at all? |
@stleary most applications should not be affected. This is really only performing the entity encode on the same character space that |
No applications affected would be preferable. Going forward, changes to existing behavior will only be accepted in the case of bugs where the existing behavior is clearly worse than the fix. I called this out in the wiki, but probably did not phrase or emphasize it enough. Fixed now. |
If you want I can replace the check in the XML to only escape ISO Control characters: That is a much tighter range, and those really should be escaped. Do you think that range is acceptable? |
…he encoding to only encode control characters as defined by ISO standard.
@stleary the latest push has the check for ISO control characters only. No applications should be affected for normal characters now. I will update the PR description accordingly. |
I think we need to try to do this in such a way that there is no risk of changing existing applications. Or are you saying that this is a larger problem, and JSON-Java does not handle ISO control chars according to the spec? |
Correct. Control characters are technically not allowed in XML. Most parsers will allow things like \n (new line) but it's supposed to be encoded. |
I fixed the wording in the description. I incorrectly noted that the HTM LEntities reversibility was a change in this PR, but it is not. The handling of HTMLEntities (or any unrecognized named entity) is the same now as on master. The change is the handling of valid XML entities when encountered in an XML document and parsing to JSON. |
…an entire code point, not surrogate pairs like in JSON.
What would it take to make these changes optional? In other words, the developer has to opt-in in order to get the new behavior? |
Good news! @johnjaylward, to work correctly I was forced to change your code:
Do not works properly because the latin character with accents is not detected as special. But if you use:
Works very well! Before the for I declared the asciiEncoder:
By this way in the XMLTest.testJsonToXmlEscape now the result of XML.toString(json) will be:
It's is perfect because ç, ã, á, ... will be now very well encoded! With this change I already tested in my services and works like a charm. Right? |
@eduveks 2 things to note with your modification to my pull request:
@stleary I could look into either adding another flag parameter, or creating a new method. However, I feel that both changes in this PR are bug fixes, not features. When decoding XML, it should be the correct and proper behaviour to also decode the escape entities. Then when output somewhere else it should be the callers responsibility to re-encode it to whatever format they need. By leaving encoded entities like < or & it could lead to un-noticed bugs elsewhere. I think I know where you are coming from, that people using this feature probably have a decoding call themselves somewhere to make up for this bug. But I just can't see that as a good reason to leave this bug in there. |
@stleary I'm going through the XML specification and want to clean this up to better match the XML specification. The 3 sections I'm looking at in particular are:
I'd like to change the processing to better match the specs in these areas. My proposed changes include:
I think the changes for section 2.2 are important, but I'm not sure on 2.11. The spec says it has to perform "as if it normalized all line breaks". However, the note in section 2.3 states that it expectes the processor to remove all #xD characters. This not only seems unclear, but our parser currently doesn't seem to have any issues with new lines; I'm not sure we need any changes there. Maybe some new test cases are in order that have mixed new line characters? |
Thanks, have not had time to review this yet. Suddenly got super-busy at work. Some quick thoughts:
|
Recommend accepting this change, and starting the 3 day timer. After merge, new Maven releases will be put on hold for several months, in case any problems surface. |
This pull request is in reference to issue #285. It modifies how the XML parser handles string values and also updates the
escape
function to properly encode control characters using XML entities. Better reversibility is now available that entities are decoded on read. Previously they were encoded but not decoded.What problem does this code solve?
Fixes bug where control characters were being placed in XML as unescaped values. Also unescapes encoded values when reading XML into a JSONObject to support full reversibility of XML -> JSON -> XML.
Risks
Control Character Changes
Low risk bug fix. No users should be expecting control characters output as unicode characters. They will now see ISO Control characters encoded using the &#xXXXX; notation as defined by the running JVMs
Character.isISOControl(int)
method.Reversibility Changes
Low risk. Full reversibility is not offered with the XML class but improvements have been made in recent releases for Arrays (#170) and other minor items. Full reversibility is supposed to be offered through there JSONML class as it uses a normalized XML structure.
HTML Entities are still not supported and will still break reversibility in both XML and JSONML classes, however regular XML entities are now fully supported and are properly converted between the two object types. Most XML parsers will fail if they see an HTML specific entity when run in strict mode, so this is not a major concern here.
Example of broken HTML entities (not valid XML and should break most parsers):
This would turn into JSON that looks like this:
If the JSON is converted back to XML, proper XML would be generated like so:
The resulting XML does not match the original. This is probably the desired behavior, but should be noted. This has not changed in this pull request. This is current behaviour that is being noted here.
The parsing that has changed is in the case of valid XML entities. Now when valid XML entities are encountered in an XML document, they will be decoded upon conversion to JSON. The JSON -> XML conversion will still encode the entities (< > ' " &) as well as now encoding control characters. The XML -> JSON -> XML -> JSON will have the proper values over multiple iterations.
Changes to the API?
Yes, I removed the deprecation that I previously added to
XML.stringToValue
. I also added a new public static method to the XML class calledunescape
to complement the existing public static methodescape
.Will this require a new release?
Can be rolled into the next release, pending comments.
Should the documentation be updated?
No documentation updates required.
Does it break the unit tests?
No, but new unit tests have been added in a separate pull request.