Fix Unicode handling in Java bindings #533

VivekPanyam · 2022-02-19T02:23:27Z

Summary:

This PR fixes how the Java bindings deal with Unicode.

Java's Unicode handling methods in JNI (e.g. GetStringUTFChars, NewStringUTF) deal with "modified UTF-8" encoding.

According to most descriptions I found online, UTF-8 and modified UTF-8 are quite similar with one of the main differences being how they deal with embedded null bytes:

Unfortunately, this does not mean that UTF-8 and modified UTF-8 are interoperable in cases without embedded null bytes. In fact, the encodings can be quite different and have different lengths.

This causes problems when doing something like this:

Java -> Neuropod Java bindings -> C++ -> Neuropod Python Model

because the JNI string methods in the Java bindings will encode strings in "modified UTF-8" and Numpy on the python side will try to decode them as UTF-8.

To fix this issue, I changed the Java bindings to explicitly do a UTF-8 conversion in C++.

Java internally represents strings in UTF-16 so the bindings now get that representation and explicitly convert to UTF-8 (and do the reverse when going from C++ to Java).

Test Plan:

Changed the Java string model tests to include Unicode characters
Added a test to run a Python Neuropod model from Java (failed before implementing the fix, succeeded after the fix)

codecov · 2022-02-19T03:51:13Z

Codecov Report

Merging #533 (cde3f5c) into master (c4ed8c3) will increase coverage by 0.00%.
The diff coverage is 90.47%.

@@           Coverage Diff           @@
##           master     #533   +/-   ##
=======================================
  Coverage   86.58%   86.58%           
=======================================
  Files         104      104           
  Lines        6940     6955   +15     
=======================================
+ Hits         6009     6022   +13     
- Misses        931      933    +2

Impacted Files	Coverage Δ
...rce/neuropod/bindings/java/src/main/native/utils.h	`100.00% <ø> (ø)`
...thon_bridge/_neuropod_native_bootstrap/executor.py	`90.14% <50.00%> (-2.51%)`	⬇️
...rc/main/native/com_uber_neuropod_NeuropodTensor.cc	`72.26% <100.00%> (ø)`
...ce/neuropod/bindings/java/src/main/native/utils.cc	`72.30% <100.00%> (+6.26%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ed8c3...cde3f5c. Read the comment docs.

Fix Unicode handling in Java bindings

cde3f5c

VivekPanyam mentioned this pull request Feb 19, 2022

Support for newer versions of Torchscript #530

Open

voe09 approved these changes Feb 22, 2022

View reviewed changes

VivekPanyam merged commit 2c45e86 into master Feb 23, 2022

VivekPanyam deleted the fix_java_unicode branch February 23, 2022 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Unicode handling in Java bindings #533

Fix Unicode handling in Java bindings #533

VivekPanyam commented Feb 19, 2022

codecov bot commented Feb 19, 2022 •

edited

Loading

Fix Unicode handling in Java bindings #533

Fix Unicode handling in Java bindings #533

Conversation

VivekPanyam commented Feb 19, 2022

Summary:

Test Plan:

codecov bot commented Feb 19, 2022 • edited Loading

Codecov Report

codecov bot commented Feb 19, 2022 •

edited

Loading