Correctly escape ASCII control characters in strings #58

PaulGrandperrin · 2016-04-25T23:49:21Z

This PR fixes #51.

I also took the liberty to add more tests in both string serialization and deserialization since it's very easy to miss some subtle parts of the standard.

The code for escaping control characters is definitely not very elegant but the other solutions I found were either using a vec (with the associated heap allocation and bound checking), having more code duplication or having worst control flow.

I someone suggest a better solution, I'd be happy to amend this PR.

PaulGrandperrin · 2016-04-25T23:58:57Z

The travis build fails on rust nightly but it doesn't seem to be related to this PR changes.

dtolnay · 2016-04-26T00:51:14Z

7F and 80-9F are also Unicode control characters.

dtolnay · 2016-04-26T01:03:59Z

json/src/ser.rs

+                    try!(wr.write_all(&bytes[start..i]));
+                }
+
+                try!(wr.write_fmt(format_args!("\\u{:04X}", c)));


The rustdoc for write_fmt says not to use it, use write! instead.

PaulGrandperrin · 2016-04-26T07:43:09Z

It's true that theses codepoints are considered as control characters by Unicode itself, but they are not explicitly mentioned in the JSON specification:
http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

However it would be absolutely legal and probably a good idea to escape them anyway.

Ruby does escape them for instance.

I'll add them.

Verify the escaping of: - DEL: Ox7F - C1 list: 0x80-0x9F

dtolnay · 2016-04-28T17:28:18Z

json/src/ser.rs

+
+                continue;
+            },
+            b'\x80' ... b'\x9F' if last_byte_was_c2 => {


I think instead of trying to implement half of a UTF8 decoder here, it would be better to have escape_str not implemented in terms of escape_bytes but instead use .chars() to walk through the UTF8 string.

I absolutely agree and was about to open an issue to question the soundness of exposing a method which promise to encode binary content even when it is not supported by the standard.

Removing this method and only implementing escape_str() will provide these benefits:

Doesn't give the false hope that the library can magically escape binary string despite the standard not allowing it. (Trying to do it anyway will result in UB)

Guarantying that the generated JSON will be valid and parseable by other valid JSON implementations

Simplifying the implementation which would be able to use .chars() like said above

In general I feel it to be more Rusty to enforce a sensible interface sticking to what the standard specifies and then to guaranty valid JSON generation.

I'll post new patches using .chars().

PaulGrandperrin · 2016-04-30T20:57:17Z

Now that ser::escape_str() has its own implementation, I think we can remove ser::escape_bytes() and replace all references to it.

It will however break the API so I don't know if I should proceed.

Maybe we could reimplement ser::escape_bytes() using ser::escape_str() with String::from_utf8 or String::from_utf8_lossy or String::from_utf8_unchecked

This new implementation does escape Unicode C0, DEL and C1 control characters. It also use its own logic and does not rely on ser::escape_bytes(). Escaping C0 control characters is mandated by ECMA-404. Escaping DEL and C1 control characters is a useful convenience often done by other JSON implementations.

dtolnay · 2016-05-01T06:59:04Z

Nicely done. Thanks for seeing this through. I filed #60 to get rid of escape_bytes in the next breaking release.

@oli-obk or @erickt please take a look.

oli-obk · 2016-05-01T07:06:54Z

json/src/ser.rs

+                start = i + char.len_utf8();
+                continue;
+            },
+            _        => { continue; }


Nit: no block, no extra spaces

I don't understand, could you be more specific?

Well the indentation made sense with the first patterns, but here a _ => continue, is enough

oli-obk · 2016-05-01T07:10:55Z

The escape_bytes codepaths aren't tested anymore, since they arent called in the string conversion. Add specific tests?

Edit: nevermind, they can't be reached by serialization anyway...

oli-obk · 2016-05-01T07:20:40Z

json/src/ser.rs

+        };
+
+        if start < i {
+            try!(wr.write_all(&value[start..i].as_bytes()));


Shouldn't this be before the match above? Otherwise you loose all chars before a control sequence. Also add a test for such situations.

Or rather, repeat it inside the control sequence arm, so it doesn't write every char by its own

This logic if based on the one used in escape_bytes().

I've also added tons of new tests to validate this new code: commit

I think I understand what you missed.
This line does what in only one call to write!() what the two calls to write_all!() do below the match.

try!(write!(wr, "{}\u{:04X}", &value[start..i], char as u32));

In the control character code block, I merged the first write_all!() into the formated write!() because it seems to me to be more efficient in this special case.

oli-obk · 2016-05-01T12:33:11Z

Wonderful. ~~I don't understand the algorithm, but~~ got it now. the tests lgtm

PaulGrandperrin · 2016-05-01T15:24:26Z

I also ran the benchmarks and we can see that some tests took a noticeable performance hit:

master (7bc9b0a)

test bench_log::bench_copy                                      ... bench:          35 ns/iter (+/- 2) = 17285 MB/s
test bench_log::bench_decoder                                   ... bench:      18,073 ns/iter (+/- 1,326) = 33 MB/s
test bench_log::bench_deserializer                              ... bench:       5,493 ns/iter (+/- 137) = 110 MB/s
test bench_log::bench_encoder                                   ... bench:       3,336 ns/iter (+/- 56) = 181 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_escape    ... bench:       3,252 ns/iter (+/- 399) = 186 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_no_escape ... bench:       2,003 ns/iter (+/- 50) = 302 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_escape    ... bench:       1,934 ns/iter (+/- 78) = 312 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_no_escape ... bench:       1,124 ns/iter (+/- 163) = 538 MB/s
test bench_log::bench_manual_serialize_vec_escape               ... bench:       2,052 ns/iter (+/- 163) = 294 MB/s
test bench_log::bench_manual_serialize_vec_no_escape            ... bench:       1,315 ns/iter (+/- 36) = 460 MB/s
test bench_log::bench_serializer                                ... bench:       2,605 ns/iter (+/- 112) = 232 MB/s
test bench_log::bench_serializer_my_mem_writer0                 ... bench:       3,613 ns/iter (+/- 689) = 167 MB/s
test bench_log::bench_serializer_my_mem_writer1                 ... bench:       2,190 ns/iter (+/- 35) = 276 MB/s
test bench_log::bench_serializer_slice                          ... bench:       3,069 ns/iter (+/- 1,791) = 197 MB/s
test bench_log::bench_serializer_vec                            ... bench:       2,369 ns/iter (+/- 352) = 255 MB/s

this PR (632a555)

test bench_log::bench_copy                                      ... bench:          35 ns/iter (+/- 2) = 17285 MB/s
test bench_log::bench_decoder                                   ... bench:      18,576 ns/iter (+/- 170) = 32 MB/s
test bench_log::bench_deserializer                              ... bench:       5,594 ns/iter (+/- 905) = 108 MB/s
test bench_log::bench_encoder                                   ... bench:       3,214 ns/iter (+/- 174) = 188 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_escape    ... bench:       3,692 ns/iter (+/- 310) = 163 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_no_escape ... bench:       1,998 ns/iter (+/- 196) = 302 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_escape    ... bench:       2,580 ns/iter (+/- 37) = 234 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_no_escape ... bench:       1,103 ns/iter (+/- 39) = 548 MB/s
test bench_log::bench_manual_serialize_vec_escape               ... bench:       2,698 ns/iter (+/- 199) = 224 MB/s
test bench_log::bench_manual_serialize_vec_no_escape            ... bench:       1,286 ns/iter (+/- 117) = 470 MB/s
test bench_log::bench_serializer                                ... bench:       3,188 ns/iter (+/- 67) = 189 MB/s
test bench_log::bench_serializer_my_mem_writer0                 ... bench:       4,138 ns/iter (+/- 330) = 146 MB/s
test bench_log::bench_serializer_my_mem_writer1                 ... bench:       2,707 ns/iter (+/- 55) = 223 MB/s
test bench_log::bench_serializer_slice                          ... bench:       3,786 ns/iter (+/- 2,035) = 159 MB/s
test bench_log::bench_serializer_vec                            ... bench:       2,914 ns/iter (+/- 74) = 207 MB/s

Here are the most noticeable:
manual_serialize_my_mem_writer1_escape : 75% of original speed
manual_serialize_vec_escape: 76% of original speed
serializer: 81% of original speed
serializer_my_mem_writer1: 81% of original speed
serializer_slice: 81% of original speed
serializer_vec: 81% of original speed

However this new code better reuse UTF-8 code from the stdlib and, as the PR says, correctly escapes control characters.

My CPU: i7-3517U

PaulGrandperrin · 2016-05-01T15:39:40Z

And BTW, the "newly rewritten but soon to be removed" code of escape_bytes() also passes all the hundreds of new tests and has almost exactly the same performance profile as the new escape_str().

dtolnay · 2016-05-01T16:14:09Z

Thanks for calling out the performance difference but I am not concerned about that. When we have two correct implementations that we need to decide between, we can turn to performance to guide us, but here we are talking about an incorrect implementation vs a correct implementation so performance is secondary.

@oli-obk I think your feedback has been addressed and this is ready to merge.

oli-obk · 2016-05-01T17:03:14Z

@oli-obk I think your feedback has been addressed and this is ready to merge.

As I said on the issue, I want to clear the rules on merging with @erickt first

But this looks good to merge to me

This patch escapes ASCII control characters in the range 0x00...0x1f, in accordance with the JSON spec. Fixes serde-rs#58

dtolnay reviewed Apr 26, 2016
View reviewed changes

PaulGrandperrin added 3 commits April 26, 2016 09:35

Correctly escape control characters in strings

00212bc

Add more string serialization tests

dd47b9d

Add more string deserialization tests

8013546

PaulGrandperrin force-pushed the master branch from 6ae244d to 8013546 Compare April 26, 2016 07:35

PaulGrandperrin added 3 commits April 26, 2016 18:58

Escape ASCII DEL control character in strings

659e99d

Escape Unicode C1 control characters in strings

83a8775

Add more tests on string control character escaping

026f49b

Verify the escaping of: - DEL: Ox7F - C1 list: 0x80-0x9F

PaulGrandperrin force-pushed the master branch from 4a0aeaa to 026f49b Compare April 26, 2016 16:59

Fix Unicode C1 control character escaping

f523b41

dtolnay reviewed Apr 28, 2016
View reviewed changes

PaulGrandperrin force-pushed the master branch from 516a1e7 to a342eb4 Compare April 30, 2016 20:59

PaulGrandperrin force-pushed the master branch from a342eb4 to 6413201 Compare April 30, 2016 21:18

dtolnay mentioned this pull request May 1, 2016

Remove escape_bytes in 0.8.0 #60

Closed

oli-obk reviewed May 1, 2016
View reviewed changes

Add tons of new tests in test_write_str()

32eb21e

Minor coding style fix in ser::escape_str()

632a555

raphlinus mentioned this pull request May 4, 2016

Need to escape ASCII control characters #51

Closed

raphlinus added a commit to raphlinus/json that referenced this pull request May 4, 2016

Correctly escape ASCII control characters in strings

55cc618

This patch escapes ASCII control characters in the range 0x00...0x1f, in accordance with the JSON spec. Fixes serde-rs#58

raphlinus mentioned this pull request May 4, 2016

Correctly escape ASCII control characters in strings #65

Closed

PaulGrandperrin mentioned this pull request May 5, 2016

WIP: Allow different string escaping strategies #66

Closed

anowell mentioned this pull request May 10, 2016

stdout/stderr control characters aren't properly escaped algorithmiaio/langpacks#5

Closed

npgm mentioned this pull request Jun 25, 2016

Correctly escape ASCII control characters in strings #85

Merged

dtolnay closed this Jun 25, 2016

Correctly escape ASCII control characters in strings #58

Correctly escape ASCII control characters in strings #58

Uh oh!

Conversation

PaulGrandperrin commented Apr 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaulGrandperrin commented Apr 25, 2016

Uh oh!

dtolnay commented Apr 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulGrandperrin commented Apr 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PaulGrandperrin commented Apr 30, 2016

Uh oh!

dtolnay commented May 1, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oli-obk commented May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oli-obk May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oli-obk commented May 1, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PaulGrandperrin commented May 1, 2016

Uh oh!

PaulGrandperrin commented May 1, 2016

Uh oh!

dtolnay commented May 1, 2016

Uh oh!

oli-obk commented May 1, 2016

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

PaulGrandperrin commented Apr 25, 2016 •

edited

Loading

oli-obk commented May 1, 2016 •

edited

Loading

oli-obk May 1, 2016 •

edited

Loading

oli-obk commented May 1, 2016 •

edited

Loading