Skip to content

ISO-8859-1 is used as default encoding #121

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Rogach opened this issue Nov 8, 2016 · 18 comments
Closed

ISO-8859-1 is used as default encoding #121

Rogach opened this issue Nov 8, 2016 · 18 comments
Milestone

Comments

@Rogach
Copy link
Contributor

Rogach commented Nov 8, 2016

ISO-8859-1 (also known as latin1) is used as default encoding of XML.save:
https://github.com/scala/scala-xml/blob/master/src/main/scala/scala/xml/XML.scala#L67

This results in errors if document contains non-latin1 characters. Why this encoding is used by default instead of UTF-8?

@ashawley
Copy link
Member

ashawley commented Nov 8, 2016

Why this encoding is used by default instead of UTF-8?

Presumably for backwards compatability.

In 2003, ISO-8859-1 became the default for save in 5d19bda.

Changing the default to UTF-8 would be a good improvement to make in a major release.

Would it help to add a mention of ISO-8859-1 to the documentation, with an example of using UTF-8?

    scala.xml.XML.save("foo.xml", <foo/>, "UTF-8")

@ashawley
Copy link
Member

ashawley commented Nov 8, 2016

That ISO-8859-1 was the default, was mentioned in Scala 2.6.1:

http://www.scala-lang.org/api/2.6.1/scala/xml/XML$object.html

Today, ISO-8859-1 is not mentioned:

http://scala-lang.org/files/archive/api/2.12.0/scala-xml/scala/xml/XML$.html

It was removed in 31b4fe1.

@Rogach
Copy link
Contributor Author

Rogach commented Nov 8, 2016

I understand the value of backwards compatibility, but sadly the next major release would probably never happen.

Yes, adding a mention to documentation would definitely help! Actually, if value was displayed in actual function signature (enc: String = "ISO-8859-1") it would be immediately obvious, but since it is factored out into value scaladoc only displays enc: String = encoding.

Should I make a pull request with this change?

P.S. "Another big XML commit" - and here I am, teaching people about value of good commit messages =)

@SethTisue
Copy link
Member

I'm not sure I agree that an actual fix would need to wait until the next major version

@Rogach
Copy link
Contributor Author

Rogach commented Nov 8, 2016

How else would it be possible? It would probably break code that depends on latin1 characters outside of ASCII range.

@SethTisue
Copy link
Member

I guess it depends how much meaning you attach to the version number. I'm not a semver purist, but some are. I would argue that user code that didn't specify an encoding had a latent bug in it and we should feel free to break that code, with a prominent explanation in the release note.

Basically I think we should ship a fix sooner rather than later. I don't have a strong opinion about the version number.

@Rogach
Copy link
Contributor Author

Rogach commented Nov 8, 2016

I, for one, am very much for it - my code would not break :)

@ashawley
Copy link
Member

ashawley commented Nov 9, 2016

The value of encoding in save is used in the XML declaration:

<?xml version='1.0' encoding='ISO-8859-1'?>

However, the default is not to put in an XML declaration. Only if someone did one of these two would the explicit mention of ISO-8859-1 be included:

scala.xml.XML.save("foo.xml", <foo/>, "ISO-8859-1", true)
scala.xml.XML.save("foo.xml", <foo/>, xmlDecl = true)

The default also writes bytes in ISO-8859-1. As rogach alluded, ASCII is forward compatible with UTF-8, but ISO-8859-1 is not.

According to the XML spec, "parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration". So writing something other than ASCII, UTF-8 or UTF-16 without an appropriate declaration, as this library does, is not compliant.

Here's proof from our friend, the Xerces parsing library, that chokes on a simple example:

scala> val latin1 = ((32 to 126).toStream ++ (160 to 255)).map(_.toChar)
latin1: scala.collection.immutable.Stream[Char] = Stream( , ?)

scala> scala.xml.XML.save("foo.xml", <x>{ latin1.mkString }</x>)

scala> scala.xml.XML.loadFile("foo.xml")
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
...

Adding a declaration, with xmlDecl set to true, allows it to be parsed:

scala> scala.xml.XML.save("foo.xml", <x>{ latin1.mkString }</x>, xmlDecl = true)

scala> scala.xml.XML.loadFile("foo.xml")
res3: scala.xml.Elem = <x> !&quot;#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ</x>

The proposal to make UTF-8 default, would make save compliant, and fix the bug:

scala> scala.xml.XML.save("foo.xml", <x>{ latin1.mkString }</x>, "UTF-8")

res3: scala.xml.Elem = <x> !&quot;#$%&amp;'()*+,-./0123456789:;&lt;=&gt;?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz
{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×
ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ</x>

It is a "latent bug", as Seth suggests. For users of save not specifying an encoding, fixing the default to be UTF-8 could introduce bugs in their code, but that depends on

  1. if they're using the high-order bits of ISO-8859-1, and
  2. what is consuming the invalid XML they produce.

There's a chance that something in (2) could behave in a more permissive or dumber way than Xerces did above.

@Rogach
Copy link
Contributor Author

Rogach commented Nov 10, 2016

@ashawley - awesome summary!

I think that most other libs will expect utf-8 as default by now, so keeping latin1 as default encoding only creates more latent bugs in the wild.

Question to developers: should I make a pull request with the change or somebody else will do it?

@ashawley
Copy link
Member

Yeah, seems it's worth a PR that fixes the issue. Thank you

@SethTisue
Copy link
Member

Should it default to java.nio.charset.Charset.defaultCharset, or to some fixed default such as UTF-8?

@Rogach
Copy link
Contributor Author

Rogach commented Nov 10, 2016

I would prefer fixed default - it is saner and less prone to surprises. And I believe everybody should use UTF-8 as default charset anyway.

@Rogach
Copy link
Contributor Author

Rogach commented Nov 10, 2016

Here's the PR: #122

ashawley added a commit to ashawley/scala-xml that referenced this issue Nov 11, 2016
Try to closely mimic bug in XML.save and XML.loadFile, but write tests
that don't use the file system.

Will fail in 1.0.6 and earlier:

    expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>>
    but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>>

Will be fixed in scala#122.
Rogach pushed a commit to Rogach/scala-xml that referenced this issue Nov 11, 2016
Try to closely mimic bug in XML.save and XML.loadFile, but write tests
that don't use the file system.

Will fail in 1.0.6 and earlier:

    expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>>
    but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>>

Will be fixed in scala#122.
Rogach pushed a commit to Rogach/scala-xml that referenced this issue Feb 4, 2017
Try to closely mimic bug in XML.save and XML.loadFile, but write tests
that don't use the file system.

Will fail in 1.0.6 and earlier:

    expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>>
    but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>>

Will be fixed in scala#122.
Rogach pushed a commit to Rogach/scala-xml that referenced this issue May 24, 2017
Try to closely mimic bug in XML.save and XML.loadFile, but write tests
that don't use the file system.

Will fail in 1.0.6 and earlier:

    expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>>
    but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>>

Will be fixed in scala#122.
@ashawley
Copy link
Member

The fix was merged in #122. Thanks for reporting this and for submitting the fix. With any luck, this will be released in 1.0.7 relatively soon.

@kaabawan
Copy link

What is the proper way to default it to UTF-8?
I tried -Dfile.encoding=UTF8 JVM option but it did not work

@Rogach
Copy link
Contributor Author

Rogach commented Nov 29, 2019

@kaabawan - If you use scala-xml version 1.1.0 or above it should default to using UTF-8 automatically. If you are using older scala-xml, then you will need to explicitly pass proper encoding into the .save method, since ISO-8859-1 was hardcoded.

@SethTisue
Copy link
Member

@kaabawan note that it's possible to declare a dependency on 1.1.0 but not actually get 1.1.0, as described in #195 ; you might want to check your actual classpath and make sure you're actually getting the version you expected

@ashawley
Copy link
Member

ashawley commented Nov 29, 2019

It's also documented in the Wiki (which is linked in the last comment in that issue):

https://github.com/scala/scala-xml/wiki/Getting-started

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants