Skip to content

Conversation

@PaulGrandperrin
Copy link

@PaulGrandperrin PaulGrandperrin commented Apr 25, 2016

This PR fixes #51.

I also took the liberty to add more tests in both string serialization and deserialization since it's very easy to miss some subtle parts of the standard.

The code for escaping control characters is definitely not very elegant but the other solutions I found were either using a vec (with the associated heap allocation and bound checking), having more code duplication or having worst control flow.

I someone suggest a better solution, I'd be happy to amend this PR.

@PaulGrandperrin
Copy link
Author

The travis build fails on rust nightly but it doesn't seem to be related to this PR changes.

@dtolnay
Copy link
Member

dtolnay commented Apr 26, 2016

7F and 80-9F are also Unicode control characters.

json/src/ser.rs Outdated
try!(wr.write_all(&bytes[start..i]));
}

try!(wr.write_fmt(format_args!("\\u{:04X}", c)));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rustdoc for write_fmt says not to use it, use write! instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected

@PaulGrandperrin
Copy link
Author

It's true that theses codepoints are considered as control characters by Unicode itself, but they are not explicitly mentioned in the JSON specification:
http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

However it would be absolutely legal and probably a good idea to escape them anyway.

Ruby does escape them for instance.

I'll add them.


continue;
},
b'\x80' ... b'\x9F' if last_byte_was_c2 => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think instead of trying to implement half of a UTF8 decoder here, it would be better to have escape_str not implemented in terms of escape_bytes but instead use .chars() to walk through the UTF8 string.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I absolutely agree and was about to open an issue to question the soundness of exposing a method which promise to encode binary content even when it is not supported by the standard.

Removing this method and only implementing escape_str() will provide these benefits:

  • Doesn't give the false hope that the library can magically escape binary string despite the standard not allowing it. (Trying to do it anyway will result in UB)
  • Guarantying that the generated JSON will be valid and parseable by other valid JSON implementations
  • Simplifying the implementation which would be able to use .chars() like said above

In general I feel it to be more Rusty to enforce a sensible interface sticking to what the standard specifies and then to guaranty valid JSON generation.

I'll post new patches using .chars().

@PaulGrandperrin
Copy link
Author

Now that ser::escape_str() has its own implementation, I think we can remove ser::escape_bytes() and replace all references to it.

It will however break the API so I don't know if I should proceed.

Maybe we could reimplement ser::escape_bytes() using ser::escape_str() with String::from_utf8 or String::from_utf8_lossy or String::from_utf8_unchecked

This new implementation does escape Unicode C0, DEL and C1 control characters.
It also use its own logic and does not rely on ser::escape_bytes().

Escaping C0 control characters is mandated by ECMA-404.
Escaping DEL and C1 control characters is a useful convenience often
done by other JSON implementations.
@dtolnay
Copy link
Member

dtolnay commented May 1, 2016

Nicely done. Thanks for seeing this through. I filed #60 to get rid of escape_bytes in the next breaking release.

@oli-obk or @erickt please take a look.

json/src/ser.rs Outdated
start = i + char.len_utf8();
continue;
},
_ => { continue; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: no block, no extra spaces

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, could you be more specific?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well the indentation made sense with the first patterns, but here a _ => continue, is enough

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@oli-obk
Copy link
Member

oli-obk commented May 1, 2016

The escape_bytes codepaths aren't tested anymore, since they arent called in the string conversion. Add specific tests?

Edit: nevermind, they can't be reached by serialization anyway...

};

if start < i {
try!(wr.write_all(&value[start..i].as_bytes()));
Copy link
Member

@oli-obk oli-obk May 1, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be before the match above? Otherwise you loose all chars before a control sequence. Also add a test for such situations.

Or rather, repeat it inside the control sequence arm, so it doesn't write every char by its own

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic if based on the one used in escape_bytes().

I've also added tons of new tests to validate this new code: commit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand what you missed.
This line does what in only one call to write!() what the two calls to write_all!() do below the match.

try!(write!(wr, "{}\u{:04X}", &value[start..i], char as u32));

In the control character code block, I merged the first write_all!() into the formated write!() because it seems to me to be more efficient in this special case.

@oli-obk
Copy link
Member

oli-obk commented May 1, 2016

Wonderful. I don't understand the algorithm, but got it now. the tests lgtm

@PaulGrandperrin
Copy link
Author

I also ran the benchmarks and we can see that some tests took a noticeable performance hit:

test bench_log::bench_copy                                      ... bench:          35 ns/iter (+/- 2) = 17285 MB/s
test bench_log::bench_decoder                                   ... bench:      18,073 ns/iter (+/- 1,326) = 33 MB/s
test bench_log::bench_deserializer                              ... bench:       5,493 ns/iter (+/- 137) = 110 MB/s
test bench_log::bench_encoder                                   ... bench:       3,336 ns/iter (+/- 56) = 181 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_escape    ... bench:       3,252 ns/iter (+/- 399) = 186 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_no_escape ... bench:       2,003 ns/iter (+/- 50) = 302 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_escape    ... bench:       1,934 ns/iter (+/- 78) = 312 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_no_escape ... bench:       1,124 ns/iter (+/- 163) = 538 MB/s
test bench_log::bench_manual_serialize_vec_escape               ... bench:       2,052 ns/iter (+/- 163) = 294 MB/s
test bench_log::bench_manual_serialize_vec_no_escape            ... bench:       1,315 ns/iter (+/- 36) = 460 MB/s
test bench_log::bench_serializer                                ... bench:       2,605 ns/iter (+/- 112) = 232 MB/s
test bench_log::bench_serializer_my_mem_writer0                 ... bench:       3,613 ns/iter (+/- 689) = 167 MB/s
test bench_log::bench_serializer_my_mem_writer1                 ... bench:       2,190 ns/iter (+/- 35) = 276 MB/s
test bench_log::bench_serializer_slice                          ... bench:       3,069 ns/iter (+/- 1,791) = 197 MB/s
test bench_log::bench_serializer_vec                            ... bench:       2,369 ns/iter (+/- 352) = 255 MB/s
test bench_log::bench_copy                                      ... bench:          35 ns/iter (+/- 2) = 17285 MB/s
test bench_log::bench_decoder                                   ... bench:      18,576 ns/iter (+/- 170) = 32 MB/s
test bench_log::bench_deserializer                              ... bench:       5,594 ns/iter (+/- 905) = 108 MB/s
test bench_log::bench_encoder                                   ... bench:       3,214 ns/iter (+/- 174) = 188 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_escape    ... bench:       3,692 ns/iter (+/- 310) = 163 MB/s
test bench_log::bench_manual_serialize_my_mem_writer0_no_escape ... bench:       1,998 ns/iter (+/- 196) = 302 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_escape    ... bench:       2,580 ns/iter (+/- 37) = 234 MB/s
test bench_log::bench_manual_serialize_my_mem_writer1_no_escape ... bench:       1,103 ns/iter (+/- 39) = 548 MB/s
test bench_log::bench_manual_serialize_vec_escape               ... bench:       2,698 ns/iter (+/- 199) = 224 MB/s
test bench_log::bench_manual_serialize_vec_no_escape            ... bench:       1,286 ns/iter (+/- 117) = 470 MB/s
test bench_log::bench_serializer                                ... bench:       3,188 ns/iter (+/- 67) = 189 MB/s
test bench_log::bench_serializer_my_mem_writer0                 ... bench:       4,138 ns/iter (+/- 330) = 146 MB/s
test bench_log::bench_serializer_my_mem_writer1                 ... bench:       2,707 ns/iter (+/- 55) = 223 MB/s
test bench_log::bench_serializer_slice                          ... bench:       3,786 ns/iter (+/- 2,035) = 159 MB/s
test bench_log::bench_serializer_vec                            ... bench:       2,914 ns/iter (+/- 74) = 207 MB/s

Here are the most noticeable:
manual_serialize_my_mem_writer1_escape : 75% of original speed
manual_serialize_vec_escape: 76% of original speed
serializer: 81% of original speed
serializer_my_mem_writer1: 81% of original speed
serializer_slice: 81% of original speed
serializer_vec: 81% of original speed

However this new code better reuse UTF-8 code from the stdlib and, as the PR says, correctly escapes control characters.

My CPU: i7-3517U

@PaulGrandperrin
Copy link
Author

And BTW, the "newly rewritten but soon to be removed" code of escape_bytes() also passes all the hundreds of new tests and has almost exactly the same performance profile as the new escape_str().

@dtolnay
Copy link
Member

dtolnay commented May 1, 2016

Thanks for calling out the performance difference but I am not concerned about that. When we have two correct implementations that we need to decide between, we can turn to performance to guide us, but here we are talking about an incorrect implementation vs a correct implementation so performance is secondary.

@oli-obk I think your feedback has been addressed and this is ready to merge.

@oli-obk
Copy link
Member

oli-obk commented May 1, 2016

@oli-obk I think your feedback has been addressed and this is ready to merge.

As I said on the issue, I want to clear the rules on merging with @erickt first

But this looks good to merge to me

raphlinus added a commit to raphlinus/json that referenced this pull request May 4, 2016
This patch escapes ASCII control characters in the range 0x00...0x1f, in accordance with the JSON spec.

Fixes serde-rs#58
@dtolnay dtolnay closed this Jun 25, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Need to escape ASCII control characters

3 participants