Skip to content

Discard control characters in Utility.escape #203

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Discard control characters in Utility.escape #203

wants to merge 1 commit into from

Conversation

hosamaly
Copy link
Contributor

@hosamaly hosamaly commented Apr 5, 2018

I encountered some input that happened to contain some control characters. This PR removes them from the output.

The previous behaviour used to exclude control characters in the smaller range. This PR excludes the ones between 0x7F (ASCII 127) and 0x9F (ASCII 159), plus possible future updates that Java supports.

I encountered some input that happened to contain some control characters. This PR removes them from the output.

The previous behaviour used to exclude control characters in the smaller range. This PR excludes the ones between 0x7F (ASCII 127) and 0x9F (159), plus possible future updates that Java supports.
@ashawley
Copy link
Member

ashawley commented Apr 5, 2018

Hosam,

Thanks for reporting this.

What version of scala-xml are you using?

The previous behaviour used to exclude control characters in the smaller range.

Previous to your PR or a previously released version of scala-xml?

Recently, the code responsible for this was changed, but the escaping shouldn't have changed. It should remained the same.

Copy link

@mbeckerle mbeckerle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure this change is desirable.

The C0 control characters are all illegal except CR, LF, TAB. That is, they're not allowed in XML 1.0 documents. However, the C1 controls, while discouraged (excepting U+0085, which is not discouraged), are legal in XML 1.0 and XML 1.1. This change makes formerly legal characters now illegal.

@ashawley
Copy link
Member

ashawley commented Apr 6, 2018

The C0 control characters are all illegal except CR, LF, TAB. That is, they're not allowed in XML 1.0 documents. However, the C1 controls, while discouraged (excepting U+0085, which is not discouraged), are legal in XML 1.0 and XML 1.1.

That's my reading, as well.

This change makes formerly legal characters now illegal.

Agree, but it is probably true that scala-xml should technically be making 0x7F illegal? Presumably, there aren't a lot of ^? lurking out there in XML-land.

@hosamaly
Copy link
Contributor Author

hosamaly commented Apr 6, 2018

@ashawley I'm using scala-xml v1.1.0. The previous behaviour is in relation to this PR.

@mbeckerle Thanks for the information. I'm not an expert here. My rationale is that control characters such as 0x7F, 0x9D, and 0x9F are probably a security risk (although I'm not an expert there either).

I'm no longer sure about the best course of action. What do you recommend?

@ashawley
Copy link
Member

ashawley commented Apr 6, 2018

We should probably study the issue further with tests. Unfortunately, we don't have good test coverage on Utility.escape in my opinion. I'd be willing to accept a new pull request that documents the current behavior in unit tests. After we merge it, we can better study your proposal, because we are friendly to your change in some form. I will try to work on writing unit tests, but feel free to beat me to it.

@hosamaly
Copy link
Contributor Author

hosamaly commented Apr 9, 2018

Thanks, @ashawley. Unfortunately, I didn't have time to work on this further but thanks for your PR. 👍

@ashawley
Copy link
Member

Ok, the new tests against Utility.escape show that the proposed change to Utility.escape causes only two differences:

[error] Test scala.xml.UtilityTest.escapeAsciiControlCharsTest failed: expected:< \n\r\t[^?]> but was:< \n\r\t[]>, took 0.042 sec
[error] Test scala.xml.UtilityTest.escapeUnicodeExtendedControlCodesTest failed: expected:<[�]> but was:<[]>, took 0.004 sec

Above, ^? is DEL (\u007F) and the is \u0080.

So this PR would effectively filter out 0x7F from output as well as the range of obscure continuation bytes in UTF-8 (0x80 to 0x9F). That's the extent of this PR. It does exactly what the javadoc Hosam linked to for java.lang.Character.isISOControl says. That seems acceptable to me, but it raises the concerns Mike Beckerle raised about the "Characters" section of the XML standard.

I suppose it also raises some of the same line of questions that were raised when we switched to the UTF-8 file encoding in #121. In particular, can we assume every user of scala-xml works with a UTF-8 file encoding? It seems like this change assumes so. I don't know enough about Unicode to know whether there is a risk of regression for those who are in something other than UTF-8.

@mbeckerle
Copy link

I believe this issue is unrelated to what character set encoding is being used.

The problematic C1 control codes that are the so called isoControl characters are characters, not bytes in some encoding. If a data file contains UTF-16 encoding, these control code simply occupy 16 bits instead of 8. They are Unicode-standard code points.

I believe we need to adhere to the XML standard here, and cannot make any character code, not even 7F illegal when the XML standard says it is legal. Apache Daffodil (incubator) a data parser that would definitely be broken hugely by such a change, and our users would encounter the issue whenever they try to parse data that contains these character codes and render it as XML.

That said, I'm not opposed to an option where these C1 controls are treated the same as the C0 controls, (except for U+0085 which should always be allowed) it just can't be the default behavior, which I believe must strictly respect the XML standards.

@ashawley
Copy link
Member

ashawley commented May 6, 2018

Indeed, the XML spec says that XML "document authors" are "encouraged" to avoid these and other characters. I suppose the compiler should be changed to avoid allowing them in XML literals. All other XML data that passes through this library is in the author's hands.

I spent some more time looking at this, and it might be possible to add an option that can avoid these "discouraged control characters". We could probably add a parameter to XML.write or PrettyPrinter.format.

@SethTisue
Copy link
Member

I'm closing this since as per @ashawley's review, simply discarding the characters doesn't seem like the right path forward here. Perhaps a new PR could take up his suggestion about an alternate path.

@SethTisue SethTisue closed this Dec 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants