-
Notifications
You must be signed in to change notification settings - Fork 622
Correctly escape ASCII control characters in strings #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
00212bc
dd47b9d
8013546
659e99d
83a8775
026f49b
f523b41
6413201
32eb21e
632a555
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -494,8 +494,12 @@ pub fn escape_bytes<W>(wr: &mut W, bytes: &[u8]) -> Result<()> | |
| try!(wr.write_all(b"\"")); | ||
|
|
||
| let mut start = 0; | ||
| let mut last_byte = 0u8; | ||
|
|
||
| for (i, byte) in bytes.iter().enumerate() { | ||
| let last_byte_was_c2 = last_byte == b'\xC2'; | ||
| last_byte = *byte; | ||
|
|
||
| let escaped = match *byte { | ||
| b'"' => b"\\\"", | ||
| b'\\' => b"\\\\", | ||
|
|
@@ -504,6 +508,28 @@ pub fn escape_bytes<W>(wr: &mut W, bytes: &[u8]) -> Result<()> | |
| b'\n' => b"\\n", | ||
| b'\r' => b"\\r", | ||
| b'\t' => b"\\t", | ||
| b'\x00' ... b'\x1F' | b'\x7F' => { | ||
| if start < i { | ||
| try!(wr.write_all(&bytes[start..i])); | ||
| } | ||
|
|
||
| try!(write!(wr,"\\u{:04X}", *byte)); | ||
|
|
||
| start = i + 1; | ||
|
|
||
| continue; | ||
| }, | ||
| b'\x80' ... b'\x9F' if last_byte_was_c2 => { | ||
| if start < (i - 1) { | ||
| try!(wr.write_all(&bytes[start..(i - 1)])); | ||
| } | ||
|
|
||
| try!(write!(wr,"\\u{:04X}", *byte)); | ||
|
|
||
| start = i + 1; | ||
|
|
||
| continue; | ||
| }, | ||
| _ => { continue; } | ||
| }; | ||
|
|
||
|
|
@@ -529,7 +555,47 @@ pub fn escape_bytes<W>(wr: &mut W, bytes: &[u8]) -> Result<()> | |
| pub fn escape_str<W>(wr: &mut W, value: &str) -> Result<()> | ||
| where W: io::Write | ||
| { | ||
| escape_bytes(wr, value.as_bytes()) | ||
| let mut start = 0; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment mentioning why this doesn't simply forward to escape_bytes Edit: nevermind, it should go away anyway... |
||
|
|
||
| try!(wr.write_all(b"\"")); | ||
|
|
||
| for (i, char) in value.char_indices() { | ||
| let escaped = match char { | ||
| '"' => b"\\\"", | ||
| '\\' => b"\\\\", | ||
| '\u{08}' => b"\\b", | ||
| '\u{0c}' => b"\\f", | ||
| '\n' => b"\\n", | ||
| '\r' => b"\\r", | ||
| '\t' => b"\\t", | ||
| '\u{00}' ... '\u{1F}' | '\u{7F}' | '\u{80}' ... '\u{9F}' => { | ||
| // only Unicode C0 control characters ('\u{00}' ... '\u{1F}') are mandated to be escaped by ECMA-404. | ||
| // DEL ('\u{7F}') and C1 ('\u{80}' ... '\u{9F}') control characters are also escaped for convenience. | ||
|
|
||
| debug_assert_eq!(char.len_utf16(), 1); // C0, DEL and C1 control characters fit on one utf16 code unit by specification. | ||
| try!(write!(wr, "{}\\u{:04X}", &value[start..i], char as u32)); | ||
|
|
||
| start = i + char.len_utf8(); | ||
| continue; | ||
| }, | ||
| _ => continue | ||
| }; | ||
|
|
||
| if start < i { | ||
| try!(wr.write_all(&value[start..i].as_bytes())); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't this be before the match above? Otherwise you loose all chars before a control sequence. Also add a test for such situations. Or rather, repeat it inside the control sequence arm, so it doesn't write every char by its own
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This logic if based on the one used in I've also added tons of new tests to validate this new code: commit
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think I understand what you missed.
In the control character code block, I merged the first write_all!() into the formated write!() because it seems to me to be more efficient in this special case. |
||
| } | ||
| try!(wr.write_all(escaped)); | ||
|
|
||
| debug_assert_eq!(char.len_utf8(), 1); | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a test for non control sequence utf8 chars in a string. The previous code handled those I think?
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the new tests pretty much cover everything ;-) |
||
| start = i + 1; | ||
| } | ||
|
|
||
| if start != value.len() { | ||
| try!(wr.write_all(&value[start..].as_bytes())); | ||
| } | ||
|
|
||
| try!(wr.write_all(b"\"")); | ||
| Ok(()) | ||
| } | ||
|
|
||
| #[inline] | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of trying to implement half of a UTF8 decoder here, it would be better to have
escape_strnot implemented in terms ofescape_bytesbut instead use.chars()to walk through the UTF8 string.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely agree and was about to open an issue to question the soundness of exposing a method which promise to encode binary content even when it is not supported by the standard.
Removing this method and only implementing escape_str() will provide these benefits:
In general I feel it to be more Rusty to enforce a sensible interface sticking to what the standard specifies and then to guaranty valid JSON generation.
I'll post new patches using .chars().