Skip to content
Closed
68 changes: 67 additions & 1 deletion json/src/ser.rs
Original file line number Diff line number Diff line change
Expand Up @@ -494,8 +494,12 @@ pub fn escape_bytes<W>(wr: &mut W, bytes: &[u8]) -> Result<()>
try!(wr.write_all(b"\""));

let mut start = 0;
let mut last_byte = 0u8;

for (i, byte) in bytes.iter().enumerate() {
let last_byte_was_c2 = last_byte == b'\xC2';
last_byte = *byte;

let escaped = match *byte {
b'"' => b"\\\"",
b'\\' => b"\\\\",
Expand All @@ -504,6 +508,28 @@ pub fn escape_bytes<W>(wr: &mut W, bytes: &[u8]) -> Result<()>
b'\n' => b"\\n",
b'\r' => b"\\r",
b'\t' => b"\\t",
b'\x00' ... b'\x1F' | b'\x7F' => {
if start < i {
try!(wr.write_all(&bytes[start..i]));
}

try!(write!(wr,"\\u{:04X}", *byte));

start = i + 1;

continue;
},
b'\x80' ... b'\x9F' if last_byte_was_c2 => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of trying to implement half of a UTF8 decoder here, it would be better to have escape_str not implemented in terms of escape_bytes but instead use .chars() to walk through the UTF8 string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely agree and was about to open an issue to question the soundness of exposing a method which promise to encode binary content even when it is not supported by the standard.

Removing this method and only implementing escape_str() will provide these benefits:

  • Doesn't give the false hope that the library can magically escape binary string despite the standard not allowing it. (Trying to do it anyway will result in UB)
  • Guarantying that the generated JSON will be valid and parseable by other valid JSON implementations
  • Simplifying the implementation which would be able to use .chars() like said above

In general I feel it to be more Rusty to enforce a sensible interface sticking to what the standard specifies and then to guaranty valid JSON generation.

I'll post new patches using .chars().

if start < (i - 1) {
try!(wr.write_all(&bytes[start..(i - 1)]));
}

try!(write!(wr,"\\u{:04X}", *byte));

start = i + 1;

continue;
},
_ => { continue; }
};

Expand All @@ -529,7 +555,47 @@ pub fn escape_bytes<W>(wr: &mut W, bytes: &[u8]) -> Result<()>
pub fn escape_str<W>(wr: &mut W, value: &str) -> Result<()>
where W: io::Write
{
escape_bytes(wr, value.as_bytes())
let mut start = 0;
Copy link
Member

@oli-obk oli-obk May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment mentioning why this doesn't simply forward to escape_bytes

Edit: nevermind, it should go away anyway...


try!(wr.write_all(b"\""));

for (i, char) in value.char_indices() {
let escaped = match char {
'"' => b"\\\"",
'\\' => b"\\\\",
'\u{08}' => b"\\b",
'\u{0c}' => b"\\f",
'\n' => b"\\n",
'\r' => b"\\r",
'\t' => b"\\t",
'\u{00}' ... '\u{1F}' | '\u{7F}' | '\u{80}' ... '\u{9F}' => {
// only Unicode C0 control characters ('\u{00}' ... '\u{1F}') are mandated to be escaped by ECMA-404.
// DEL ('\u{7F}') and C1 ('\u{80}' ... '\u{9F}') control characters are also escaped for convenience.

debug_assert_eq!(char.len_utf16(), 1); // C0, DEL and C1 control characters fit on one utf16 code unit by specification.
try!(write!(wr, "{}\\u{:04X}", &value[start..i], char as u32));

start = i + char.len_utf8();
continue;
},
_ => continue
};

if start < i {
try!(wr.write_all(&value[start..i].as_bytes()));
Copy link
Member

@oli-obk oli-obk May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be before the match above? Otherwise you loose all chars before a control sequence. Also add a test for such situations.

Or rather, repeat it inside the control sequence arm, so it doesn't write every char by its own

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic if based on the one used in escape_bytes().

I've also added tons of new tests to validate this new code: commit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you missed.
This line does what in only one call to write!() what the two calls to write_all!() do below the match.

try!(write!(wr, "{}\u{:04X}", &value[start..i], char as u32));

In the control character code block, I merged the first write_all!() into the formated write!() because it seems to me to be more efficient in this special case.

}
try!(wr.write_all(escaped));

debug_assert_eq!(char.len_utf8(), 1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test for non control sequence utf8 chars in a string. The previous code handled those I think?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the new tests pretty much cover everything ;-)
commit

start = i + 1;
}

if start != value.len() {
try!(wr.write_all(&value[start..].as_bytes()));
}

try!(wr.write_all(b"\""));
Ok(())
}

#[inline]
Expand Down
Loading