-
Notifications
You must be signed in to change notification settings - Fork 92
ISO-8859-1 is used as default encoding #121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Presumably for backwards compatability. In 2003, ISO-8859-1 became the default for Changing the default to UTF-8 would be a good improvement to make in a major release. Would it help to add a mention of ISO-8859-1 to the documentation, with an example of using UTF-8? scala.xml.XML.save("foo.xml", <foo/>, "UTF-8") |
That ISO-8859-1 was the default, was mentioned in Scala 2.6.1: http://www.scala-lang.org/api/2.6.1/scala/xml/XML$object.html Today, ISO-8859-1 is not mentioned: http://scala-lang.org/files/archive/api/2.12.0/scala-xml/scala/xml/XML$.html It was removed in 31b4fe1. |
I understand the value of backwards compatibility, but sadly the next major release would probably never happen. Yes, adding a mention to documentation would definitely help! Actually, if value was displayed in actual function signature ( Should I make a pull request with this change? P.S. "Another big XML commit" - and here I am, teaching people about value of good commit messages =) |
I'm not sure I agree that an actual fix would need to wait until the next major version |
How else would it be possible? It would probably break code that depends on latin1 characters outside of ASCII range. |
I guess it depends how much meaning you attach to the version number. I'm not a semver purist, but some are. I would argue that user code that didn't specify an encoding had a latent bug in it and we should feel free to break that code, with a prominent explanation in the release note. Basically I think we should ship a fix sooner rather than later. I don't have a strong opinion about the version number. |
I, for one, am very much for it - my code would not break :) |
The value of <?xml version='1.0' encoding='ISO-8859-1'?> However, the default is not to put in an XML declaration. Only if someone did one of these two would the explicit mention of ISO-8859-1 be included: scala.xml.XML.save("foo.xml", <foo/>, "ISO-8859-1", true)
scala.xml.XML.save("foo.xml", <foo/>, xmlDecl = true) The default also writes bytes in ISO-8859-1. As rogach alluded, ASCII is forward compatible with UTF-8, but ISO-8859-1 is not. According to the XML spec, "parsed entities which are stored in an encoding other than UTF-8 or UTF-16 must begin with a text declaration". So writing something other than ASCII, UTF-8 or UTF-16 without an appropriate declaration, as this library does, is not compliant. Here's proof from our friend, the Xerces parsing library, that chokes on a simple example: scala> val latin1 = ((32 to 126).toStream ++ (160 to 255)).map(_.toChar)
latin1: scala.collection.immutable.Stream[Char] = Stream( , ?)
scala> scala.xml.XML.save("foo.xml", <x>{ latin1.mkString }</x>)
scala> scala.xml.XML.loadFile("foo.xml")
com.sun.org.apache.xerces.internal.impl.io.MalformedByteSequenceException: Invalid byte 1 of 1-byte UTF-8 sequence.
... Adding a declaration, with
The proposal to make UTF-8 default, would make
It is a "latent bug", as Seth suggests. For users of
There's a chance that something in (2) could behave in a more permissive or dumber way than Xerces did above. |
@ashawley - awesome summary! I think that most other libs will expect utf-8 as default by now, so keeping latin1 as default encoding only creates more latent bugs in the wild. Question to developers: should I make a pull request with the change or somebody else will do it? |
Yeah, seems it's worth a PR that fixes the issue. Thank you |
Should it default to |
I would prefer fixed default - it is saner and less prone to surprises. And I believe everybody should use UTF-8 as default charset anyway. |
Here's the PR: #122 |
Try to closely mimic bug in XML.save and XML.loadFile, but write tests that don't use the file system. Will fail in 1.0.6 and earlier: expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>> but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>> Will be fixed in scala#122.
Try to closely mimic bug in XML.save and XML.loadFile, but write tests that don't use the file system. Will fail in 1.0.6 and earlier: expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>> but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>> Will be fixed in scala#122.
Try to closely mimic bug in XML.save and XML.loadFile, but write tests that don't use the file system. Will fail in 1.0.6 and earlier: expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>> but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>> Will be fixed in scala#122.
Try to closely mimic bug in XML.save and XML.loadFile, but write tests that don't use the file system. Will fail in 1.0.6 and earlier: expected:<...klmnopqrstuvwxyz{|}~[ ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ]</x>> but was:<...klmnopqrstuvwxyz{|}~[????????????????????????????????????????????????????????????????????????????????????????????????]</x>> Will be fixed in scala#122.
The fix was merged in #122. Thanks for reporting this and for submitting the fix. With any luck, this will be released in 1.0.7 relatively soon. |
What is the proper way to default it to UTF-8? |
@kaabawan - If you use |
It's also documented in the Wiki (which is linked in the last comment in that issue): https://github.com/scala/scala-xml/wiki/Getting-started |
ISO-8859-1 (also known as latin1) is used as default encoding of
XML.save
:https://github.com/scala/scala-xml/blob/master/src/main/scala/scala/xml/XML.scala#L67
This results in errors if document contains non-latin1 characters. Why this encoding is used by default instead of UTF-8?
The text was updated successfully, but these errors were encountered: