-
Notifications
You must be signed in to change notification settings - Fork 622
Correctly escape ASCII control characters in strings #58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
The travis build fails on rust nightly but it doesn't seem to be related to this PR changes. |
|
7F and 80-9F are also Unicode control characters. |
json/src/ser.rs
Outdated
| try!(wr.write_all(&bytes[start..i])); | ||
| } | ||
|
|
||
| try!(wr.write_fmt(format_args!("\\u{:04X}", c))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rustdoc for write_fmt says not to use it, use write! instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Corrected
|
It's true that theses codepoints are considered as control characters by Unicode itself, but they are not explicitly mentioned in the JSON specification: However it would be absolutely legal and probably a good idea to escape them anyway. Ruby does escape them for instance. I'll add them. |
Verify the escaping of: - DEL: Ox7F - C1 list: 0x80-0x9F
|
|
||
| continue; | ||
| }, | ||
| b'\x80' ... b'\x9F' if last_byte_was_c2 => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think instead of trying to implement half of a UTF8 decoder here, it would be better to have escape_str not implemented in terms of escape_bytes but instead use .chars() to walk through the UTF8 string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I absolutely agree and was about to open an issue to question the soundness of exposing a method which promise to encode binary content even when it is not supported by the standard.
Removing this method and only implementing escape_str() will provide these benefits:
- Doesn't give the false hope that the library can magically escape binary string despite the standard not allowing it. (Trying to do it anyway will result in UB)
- Guarantying that the generated JSON will be valid and parseable by other valid JSON implementations
- Simplifying the implementation which would be able to use .chars() like said above
In general I feel it to be more Rusty to enforce a sensible interface sticking to what the standard specifies and then to guaranty valid JSON generation.
I'll post new patches using .chars().
|
Now that It will however break the API so I don't know if I should proceed. Maybe we could reimplement |
This new implementation does escape Unicode C0, DEL and C1 control characters. It also use its own logic and does not rely on ser::escape_bytes(). Escaping C0 control characters is mandated by ECMA-404. Escaping DEL and C1 control characters is a useful convenience often done by other JSON implementations.
json/src/ser.rs
Outdated
| start = i + char.len_utf8(); | ||
| continue; | ||
| }, | ||
| _ => { continue; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: no block, no extra spaces
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, could you be more specific?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well the indentation made sense with the first patterns, but here a _ => continue, is enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
The escape_bytes codepaths aren't tested anymore, since they arent called in the string conversion. Add specific tests? Edit: nevermind, they can't be reached by serialization anyway... |
| }; | ||
|
|
||
| if start < i { | ||
| try!(wr.write_all(&value[start..i].as_bytes())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be before the match above? Otherwise you loose all chars before a control sequence. Also add a test for such situations.
Or rather, repeat it inside the control sequence arm, so it doesn't write every char by its own
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic if based on the one used in escape_bytes().
I've also added tons of new tests to validate this new code: commit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand what you missed.
This line does what in only one call to write!() what the two calls to write_all!() do below the match.
try!(write!(wr, "{}\u{:04X}", &value[start..i], char as u32));
In the control character code block, I merged the first write_all!() into the formated write!() because it seems to me to be more efficient in this special case.
|
Wonderful. |
|
I also ran the benchmarks and we can see that some tests took a noticeable performance hit:
Here are the most noticeable: However this new code better reuse UTF-8 code from the stdlib and, as the PR says, correctly escapes control characters. My CPU: i7-3517U |
|
And BTW, the "newly rewritten but soon to be removed" code of |
|
Thanks for calling out the performance difference but I am not concerned about that. When we have two correct implementations that we need to decide between, we can turn to performance to guide us, but here we are talking about an incorrect implementation vs a correct implementation so performance is secondary. @oli-obk I think your feedback has been addressed and this is ready to merge. |
This patch escapes ASCII control characters in the range 0x00...0x1f, in accordance with the JSON spec. Fixes serde-rs#58
This PR fixes #51.
I also took the liberty to add more tests in both string serialization and deserialization since it's very easy to miss some subtle parts of the standard.
The code for escaping control characters is definitely not very elegant but the other solutions I found were either using a vec (with the associated heap allocation and bound checking), having more code duplication or having worst control flow.
I someone suggest a better solution, I'd be happy to amend this PR.