Fix Unicode handling in Java bindings #533
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary:
This PR fixes how the Java bindings deal with Unicode.
Java's Unicode handling methods in JNI (e.g.
GetStringUTFChars
,NewStringUTF
) deal with "modified UTF-8" encoding.According to most descriptions I found online, UTF-8 and modified UTF-8 are quite similar with one of the main differences being how they deal with embedded null bytes:
Unfortunately, this does not mean that UTF-8 and modified UTF-8 are interoperable in cases without embedded null bytes. In fact, the encodings can be quite different and have different lengths.
This causes problems when doing something like this:
because the JNI string methods in the Java bindings will encode strings in "modified UTF-8" and Numpy on the python side will try to decode them as UTF-8.
To fix this issue, I changed the Java bindings to explicitly do a UTF-8 conversion in C++.
Java internally represents strings in UTF-16 so the bindings now get that representation and explicitly convert to UTF-8 (and do the reverse when going from C++ to Java).
Test Plan: