-
-
Notifications
You must be signed in to change notification settings - Fork 798
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write 4-byte characters (surrogate pairs) instead of escapes #1335
Conversation
|
The change that you have made to PackageVersion.java.in breaks the build - please revert it. |
4c25457
to
5eabe2e
Compare
@pjfanning Thanks for your review. I've added the feature. Not sure if StreamWriteFeature is the right place... |
src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java
Outdated
Show resolved
Hide resolved
Seems ok to me. Do you know if other Generator implementations handle surrogates ok? Eg WriterBasedJsonGenerator. |
@rnetuka @cowtowncoder It looks like the Windows CI build cannot handle UTF-8 chars in source files. |
969c112
to
8a197e7
Compare
@pjfanning WriterBasedJsonGenerator works fine and writes the surrogate pairs automatically. As for UTF-8 chars in source files, do you mean the ones in the new test? If so, I can just replace them with escapes. |
The Windows build doesn't like any of your non-ASCII chars. Could you have escape all the non-ASCII chars? |
Overall looks good; as long as overhead is localized into handling of 4-byte cases it should be fine (performance would be my main concern in general). |
src/test/java/com/fasterxml/jackson/core/json/StringGenerationTest.java
Outdated
Show resolved
Hide resolved
src/test/java/com/fasterxml/jackson/core/json/StringGenerationTest.java
Outdated
Show resolved
Hide resolved
@rnetuka Forgot to say: big thank you for contributing this!!! I'll try to make sure this gets in 2.18.0 release; as I said this looks pretty good already. |
8a197e7
to
bd3e8db
Compare
@cowtowncoder Based on you review, I've updated the PR with following changes:
I've renamed the new feature to COMBINE_UNICODE_SURROGATES which I believe will better describe its purpose. I've also put the feature only to JsonGenerator (previously I put it also in StreamWriteFeature, but that would make it kind of duplicate). Thank you for you review. If you have any more comments, just let me know. |
bd3e8db
to
8f0de85
Compare
@rnetuka Just realized I don't have a CLA from you. However, I do have corporate-CLA from Red Hat -- and it looks like you are employed by Red Hat. If this is the case, I think you are covered and we are good. Also: I see failed unit tests -- is this PR still work-in-progress? |
8f0de85
to
6fad16d
Compare
@cowtowncoder Yes, I'm Red Hat employee. As for the test, I've noticed Surrogate223Test#surrogatesCharBacked is written for an option to escape the surrogate pairs for StringWriter/WriterBasedJsonGenerator. WriterBasedJsonGenerator always combines the surrogate pairs together, so I've removed the latter part of the test. Please, take a look and confirm this is ok for you. To sum it up, if merged at this point: Unless you request other changes, I consider this PR done. |
private int _outputSurrogatePair(char highSurrogate, char lowSurrogate, int outputPtr) { | ||
String s = String.valueOf(highSurrogate) + lowSurrogate; | ||
byte[] bytes = s.getBytes(StandardCharsets.UTF_8); | ||
System.arraycopy(bytes, 0, _outputBuffer, outputPtr, bytes.length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yikes. This is rather wasteful way of doing this... Let me think we really shouldn't have do too this much allocation of things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would
private int _outputSurrogatePair(char highSurrogate, char lowSurrogate, int outputPtr) {
int unicode = 0x10000;
unicode += (highSurrogate & 0x03FF) << 10;
unicode += (lowSurrogate & 0x03FF);
byte[] utf8 = new byte[4];
utf8[0] = (byte) (0b11110000 + ((unicode & 0b00000000_00011100_00000000_00000000) >> 18));
utf8[1] = (byte) (0b10000000 + ((unicode & 0b00000000_00000011_11110000_00000000) >> 12));
utf8[2] = (byte) (0b10000000 + ((unicode & 0b00000000_00000000_00001111_11000000) >> 6));
utf8[3] = (byte) (0b10000000 + (unicode & 0b00000000_00000000_00000000_00111111));
System.arraycopy(utf8, 0, _outputBuffer, outputPtr, utf8.length);
return outputPtr + utf8.length;
}
be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. :)
Can also eliminate temporary buffer, assign directly in _outputBuffer
.
Thank you for working with me here!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied with minor modifications; also realized unit test was not verifying actual round-trip encode/decode, added. Now catches some of problems I tried.
(ideally we'd have more tests for surrogate pairs but will have to do for now).
Will merge.
Thank you for contributing this, @rnetuka ! Will merge in 2.18 for inclusion in 2.18.0 release. |
Issue: #223
A fix for issue #223. Some languages (Japanese, Chinese, ...) use 4-byte characters and it's annoying to have them marshaled as escapes.
I've tried to fix this with minimum intrusions. I've only added this feature/fix to _writeStringSegment2 method(s) and left the rest without it. If you like the changes for other methods calling _outputMultiByteChar, just let me know.