Discard control characters in Utility.escape #203

hosamaly · 2018-04-05T19:10:19Z

I encountered some input that happened to contain some control characters. This PR removes them from the output.

The previous behaviour used to exclude control characters in the smaller range. This PR excludes the ones between 0x7F (ASCII 127) and 0x9F (ASCII 159), plus possible future updates that Java supports.

I encountered some input that happened to contain some control characters. This PR removes them from the output. The previous behaviour used to exclude control characters in the smaller range. This PR excludes the ones between 0x7F (ASCII 127) and 0x9F (159), plus possible future updates that Java supports.

ashawley · 2018-04-05T20:21:53Z

Hosam,

Thanks for reporting this.

What version of scala-xml are you using?

The previous behaviour used to exclude control characters in the smaller range.

Previous to your PR or a previously released version of scala-xml?

Recently, the code responsible for this was changed, but the escaping shouldn't have changed. It should remained the same.

mbeckerle

I am not sure this change is desirable.

The C0 control characters are all illegal except CR, LF, TAB. That is, they're not allowed in XML 1.0 documents. However, the C1 controls, while discouraged (excepting U+0085, which is not discouraged), are legal in XML 1.0 and XML 1.1. This change makes formerly legal characters now illegal.

ashawley · 2018-04-06T04:23:23Z

The C0 control characters are all illegal except CR, LF, TAB. That is, they're not allowed in XML 1.0 documents. However, the C1 controls, while discouraged (excepting U+0085, which is not discouraged), are legal in XML 1.0 and XML 1.1.

That's my reading, as well.

This change makes formerly legal characters now illegal.

Agree, but it is probably true that scala-xml should technically be making 0x7F illegal? Presumably, there aren't a lot of ^? lurking out there in XML-land.

hosamaly · 2018-04-06T10:13:51Z

@ashawley I'm using scala-xml v1.1.0. The previous behaviour is in relation to this PR.

@mbeckerle Thanks for the information. I'm not an expert here. My rationale is that control characters such as 0x7F, 0x9D, and 0x9F are probably a security risk (although I'm not an expert there either).

I'm no longer sure about the best course of action. What do you recommend?

ashawley · 2018-04-06T14:28:30Z

We should probably study the issue further with tests. Unfortunately, we don't have good test coverage on Utility.escape in my opinion. I'd be willing to accept a new pull request that documents the current behavior in unit tests. After we merge it, we can better study your proposal, because we are friendly to your change in some form. I will try to work on writing unit tests, but feel free to beat me to it.

hosamaly · 2018-04-09T10:01:51Z

Thanks, @ashawley. Unfortunately, I didn't have time to work on this further but thanks for your PR. 👍

ashawley · 2018-04-11T19:06:32Z

Ok, the new tests against Utility.escape show that the proposed change to Utility.escape causes only two differences:

[error] Test scala.xml.UtilityTest.escapeAsciiControlCharsTest failed: expected:< \n\r\t[^?]> but was:< \n\r\t[]>, took 0.042 sec
[error] Test scala.xml.UtilityTest.escapeUnicodeExtendedControlCodesTest failed: expected:<[�]> but was:<[]>, took 0.004 sec

Above, ^? is DEL (\u007F) and the� is \u0080.

So this PR would effectively filter out 0x7F from output as well as the range of obscure continuation bytes in UTF-8 (0x80 to 0x9F). That's the extent of this PR. It does exactly what the javadoc Hosam linked to for java.lang.Character.isISOControl says. That seems acceptable to me, but it raises the concerns Mike Beckerle raised about the "Characters" section of the XML standard.

I suppose it also raises some of the same line of questions that were raised when we switched to the UTF-8 file encoding in #121. In particular, can we assume every user of scala-xml works with a UTF-8 file encoding? It seems like this change assumes so. I don't know enough about Unicode to know whether there is a risk of regression for those who are in something other than UTF-8.

mbeckerle · 2018-04-23T14:31:17Z

I believe this issue is unrelated to what character set encoding is being used.

The problematic C1 control codes that are the so called isoControl characters are characters, not bytes in some encoding. If a data file contains UTF-16 encoding, these control code simply occupy 16 bits instead of 8. They are Unicode-standard code points.

I believe we need to adhere to the XML standard here, and cannot make any character code, not even 7F illegal when the XML standard says it is legal. Apache Daffodil (incubator) a data parser that would definitely be broken hugely by such a change, and our users would encounter the issue whenever they try to parse data that contains these character codes and render it as XML.

That said, I'm not opposed to an option where these C1 controls are treated the same as the C0 controls, (except for U+0085 which should always be allowed) it just can't be the default behavior, which I believe must strictly respect the XML standards.

ashawley · 2018-05-06T17:52:18Z

Indeed, the XML spec says that XML "document authors" are "encouraged" to avoid these and other characters. I suppose the compiler should be changed to avoid allowing them in XML literals. All other XML data that passes through this library is in the author's hands.

I spent some more time looking at this, and it might be possible to add an option that can avoid these "discouraged control characters". We could probably add a parameter to XML.write or PrettyPrinter.format.

SethTisue · 2020-12-05T19:49:50Z

I'm closing this since as per @ashawley's review, simply discarding the characters doesn't seem like the right path forward here. Perhaps a new PR could take up his suggestion about an alternate path.

mbeckerle reviewed Apr 5, 2018

View reviewed changes

ashawley mentioned this pull request Apr 7, 2018

Add tests for Utility.escape #206

Merged

SethTisue closed this Dec 5, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Discard control characters in Utility.escape #203

Discard control characters in Utility.escape #203

Uh oh!

hosamaly commented Apr 5, 2018 •

edited

Loading

Uh oh!

ashawley commented Apr 5, 2018

Uh oh!

mbeckerle left a comment

Uh oh!

ashawley commented Apr 6, 2018

Uh oh!

hosamaly commented Apr 6, 2018 •

edited

Loading

Uh oh!

ashawley commented Apr 6, 2018

Uh oh!

hosamaly commented Apr 9, 2018

Uh oh!

ashawley commented Apr 11, 2018

Uh oh!

mbeckerle commented Apr 23, 2018

Uh oh!

ashawley commented May 6, 2018

Uh oh!

SethTisue commented Dec 5, 2020

Uh oh!

Uh oh!

Discard control characters in Utility.escape #203

Discard control characters in Utility.escape #203

Uh oh!

Conversation

hosamaly commented Apr 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashawley commented Apr 5, 2018

Uh oh!

mbeckerle left a comment

Choose a reason for hiding this comment

Uh oh!

ashawley commented Apr 6, 2018

Uh oh!

hosamaly commented Apr 6, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ashawley commented Apr 6, 2018

Uh oh!

hosamaly commented Apr 9, 2018

Uh oh!

ashawley commented Apr 11, 2018

Uh oh!

mbeckerle commented Apr 23, 2018

Uh oh!

ashawley commented May 6, 2018

Uh oh!

SethTisue commented Dec 5, 2020

Uh oh!

Uh oh!

hosamaly commented Apr 5, 2018 •

edited

Loading

hosamaly commented Apr 6, 2018 •

edited

Loading