ParsingException on unicode U+FFFF character #254

anilkumarmyla · 2018-03-16T18:18:30Z

Self explanatory with following code

Welcome to Scala 2.12.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_162).
Type in expressions for evaluation. Or try :help.

scala> import spray.json._
import spray.json._

scala> val a: String = """{"hello":"a\uFFFFworld"}"""
a: String = {"hello":"a�world"}

scala> a.parseJson
spray.json.JsonParser$ParsingException: Unexpected end-of-input at input index 11 (line 1, position 12), expected '"':
{"hello":"a
           ^

  at spray.json.JsonParser.fail(JsonParser.scala:217)
  at spray.json.JsonParser.require(JsonParser.scala:200)
  at spray.json.JsonParser.string(JsonParser.scala:148)
  at spray.json.JsonParser.value(JsonParser.scala:67)
  at spray.json.JsonParser.members$1(JsonParser.scala:85)
  at spray.json.JsonParser.object(JsonParser.scala:90)
  at spray.json.JsonParser.value(JsonParser.scala:64)
  at spray.json.JsonParser.parseJsValue(JsonParser.scala:46)
  at spray.json.JsonParser.parseJsValue(JsonParser.scala:42)
  at spray.json.JsonParser$.apply(JsonParser.scala:28)
  at spray.json.RichString.parseJson(package.scala:50)
  ... 36 elided

scala> a.getBytes
res1: Array[Byte] = Array(123, 34, 104, 101, 108, 108, 111, 34, 58, 34, 97, -17, -65, -65, 119, 111, 114, 108, 100, 34, 125)

scala> a.replaceAll("\uFFFF", "").parseJson
res2: spray.json.JsValue = {"hello":"aworld"}

scala> a.replaceAll("\uFFFF", "").getBytes
res3: Array[Byte] = Array(123, 34, 104, 101, 108, 108, 111, 34, 58, 34, 97, 119, 111, 114, 108, 100, 34, 125)

scala>

The text was updated successfully, but these errors were encountered:

ramanmishra · 2018-05-08T17:07:05Z

can you please share your build.sbt. or sprayJson library version.

anilkumarmyla · 2018-05-10T02:56:46Z

can you please share your build.sbt. or sprayJson library version.

happens with the latest version - 1.3.4

jrudolph · 2018-07-10T14:50:19Z

Hi @anilkumarmyla, that's so by design (but could be documented better). According to the unicode standard \uffff is a non-character that is reserved for "process-internal" usages. That's exactly how it is used inside of spray-json: It designates the end of input.

plokhotnyuk · 2018-07-10T15:14:06Z

IFYK:

"Because of this complicated history and confusing changes of wording in the standard over the years regarding what are now known as noncharacters, there is still considerable disagreement about their use and whether they should be considered "illegal" or "invalid" in various contexts. Particularly for implementations prior to Unicode 3.1, it should not be surprising to find legacy behavior treating U+FFFE and U+FFFF as invalid in Unicode 16-bit strings. And U+FFFF and U+10FFFF are, indeed, known to be used in various implementations as sentinels. For example, the value FFFF is used for WEOF in Windows implementations.

For up-to-date Unicode implementations, however, one should use caution when choosing sentinel values. U+FFFF and U+10FFFF still have interesting numerical properties which render them likely choices for internal use as sentinels, but implementers should be aware of the fact that those values, as for all noncharacters in the standard, are also valid in Unicode strings, must be converted between UTFs, and may be encountered in Unicode data—not necessarily used with the same interpretation as for one's own sentinel use. Just be careful out there!"

http://www.unicode.org/faq/private_use.html#sentinel6

jrudolph · 2018-07-10T15:27:33Z

Thanks for the added information. There's also the paragraph directly before:

Unicode 4.0 also added an entire new informative section about noncharacters, which recommended the use of U+FFFF and U+10FFFF "for internal purposes as sentinels." That new text also stated that "[noncharacters] are forbidden for use in open interchange of Unicode text data," a claim which was stronger than the formal definition. And it made a contrast between noncharacters and "valid character value[s]", implying that noncharacters were not valid. Of course, noncharacters could not be interpreted in open interchange, but the text in this section had not really caught up with the implications of the change of wording in the conformance requirements for UTFs. The text still echoed the sense of "invalid" associated with noncharacters in Unicode 3.0.

So, yes it's complicated but I also think it's arguably still a good enough solution right now. Let's reopen to add a note to the documentation that those code points are not supported by the parser.

jrudolph closed this as completed Jul 10, 2018

jrudolph reopened this Jul 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParsingException on unicode U+FFFF character #254

ParsingException on unicode U+FFFF character #254

anilkumarmyla commented Mar 16, 2018

ramanmishra commented May 8, 2018

anilkumarmyla commented May 10, 2018

jrudolph commented Jul 10, 2018

plokhotnyuk commented Jul 10, 2018

jrudolph commented Jul 10, 2018

ParsingException on unicode U+FFFF character #254

ParsingException on unicode U+FFFF character #254

Comments

anilkumarmyla commented Mar 16, 2018

ramanmishra commented May 8, 2018

anilkumarmyla commented May 10, 2018

jrudolph commented Jul 10, 2018

plokhotnyuk commented Jul 10, 2018

jrudolph commented Jul 10, 2018