-
Notifications
You must be signed in to change notification settings - Fork 92
Discard control characters in Utility.escape #203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I encountered some input that happened to contain some control characters. This PR removes them from the output. The previous behaviour used to exclude control characters in the smaller range. This PR excludes the ones between 0x7F (ASCII 127) and 0x9F (159), plus possible future updates that Java supports.
Hosam, Thanks for reporting this. What version of scala-xml are you using?
Previous to your PR or a previously released version of scala-xml? Recently, the code responsible for this was changed, but the escaping shouldn't have changed. It should remained the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure this change is desirable.
The C0 control characters are all illegal except CR, LF, TAB. That is, they're not allowed in XML 1.0 documents. However, the C1 controls, while discouraged (excepting U+0085, which is not discouraged), are legal in XML 1.0 and XML 1.1. This change makes formerly legal characters now illegal.
That's my reading, as well.
Agree, but it is probably true that scala-xml should technically be making 0x7F illegal? Presumably, there aren't a lot of ^? lurking out there in XML-land. |
@ashawley I'm using scala-xml v1.1.0. The previous behaviour is in relation to this PR. @mbeckerle Thanks for the information. I'm not an expert here. My rationale is that control characters such as I'm no longer sure about the best course of action. What do you recommend? |
We should probably study the issue further with tests. Unfortunately, we don't have good test coverage on |
Thanks, @ashawley. Unfortunately, I didn't have time to work on this further but thanks for your PR. 👍 |
Ok, the new tests against
Above, So this PR would effectively filter out I suppose it also raises some of the same line of questions that were raised when we switched to the UTF-8 file encoding in #121. In particular, can we assume every user of scala-xml works with a UTF-8 file encoding? It seems like this change assumes so. I don't know enough about Unicode to know whether there is a risk of regression for those who are in something other than UTF-8. |
I believe this issue is unrelated to what character set encoding is being used. The problematic C1 control codes that are the so called isoControl characters are characters, not bytes in some encoding. If a data file contains UTF-16 encoding, these control code simply occupy 16 bits instead of 8. They are Unicode-standard code points. I believe we need to adhere to the XML standard here, and cannot make any character code, not even 7F illegal when the XML standard says it is legal. Apache Daffodil (incubator) a data parser that would definitely be broken hugely by such a change, and our users would encounter the issue whenever they try to parse data that contains these character codes and render it as XML. That said, I'm not opposed to an option where these C1 controls are treated the same as the C0 controls, (except for U+0085 which should always be allowed) it just can't be the default behavior, which I believe must strictly respect the XML standards. |
Indeed, the XML spec says that XML "document authors" are "encouraged" to avoid these and other characters. I suppose the compiler should be changed to avoid allowing them in XML literals. All other XML data that passes through this library is in the author's hands. I spent some more time looking at this, and it might be possible to add an option that can avoid these "discouraged control characters". We could probably add a parameter to |
I'm closing this since as per @ashawley's review, simply discarding the characters doesn't seem like the right path forward here. Perhaps a new PR could take up his suggestion about an alternate path. |
I encountered some input that happened to contain some control characters. This PR removes them from the output.
The previous behaviour used to exclude control characters in the smaller range. This PR excludes the ones between
0x7F
(ASCII 127) and0x9F
(ASCII 159), plus possible future updates that Java supports.