Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Unicode handling in Java bindings #533

Merged
merged 1 commit into from
Feb 23, 2022
Merged

Conversation

VivekPanyam
Copy link
Collaborator

Summary:

This PR fixes how the Java bindings deal with Unicode.

Java's Unicode handling methods in JNI (e.g. GetStringUTFChars, NewStringUTF) deal with "modified UTF-8" encoding.

According to most descriptions I found online, UTF-8 and modified UTF-8 are quite similar with one of the main differences being how they deal with embedded null bytes:

Unfortunately, this does not mean that UTF-8 and modified UTF-8 are interoperable in cases without embedded null bytes. In fact, the encodings can be quite different and have different lengths.

This causes problems when doing something like this:

Java -> Neuropod Java bindings -> C++ -> Neuropod Python Model

because the JNI string methods in the Java bindings will encode strings in "modified UTF-8" and Numpy on the python side will try to decode them as UTF-8.

To fix this issue, I changed the Java bindings to explicitly do a UTF-8 conversion in C++.

Java internally represents strings in UTF-16 so the bindings now get that representation and explicitly convert to UTF-8 (and do the reverse when going from C++ to Java).

Test Plan:

  • Changed the Java string model tests to include Unicode characters
  • Added a test to run a Python Neuropod model from Java (failed before implementing the fix, succeeded after the fix)

@codecov
Copy link

codecov bot commented Feb 19, 2022

Codecov Report

Merging #533 (cde3f5c) into master (c4ed8c3) will increase coverage by 0.00%.
The diff coverage is 90.47%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #533   +/-   ##
=======================================
  Coverage   86.58%   86.58%           
=======================================
  Files         104      104           
  Lines        6940     6955   +15     
=======================================
+ Hits         6009     6022   +13     
- Misses        931      933    +2     
Impacted Files Coverage Δ
...rce/neuropod/bindings/java/src/main/native/utils.h 100.00% <ø> (ø)
...thon_bridge/_neuropod_native_bootstrap/executor.py 90.14% <50.00%> (-2.51%) ⬇️
...rc/main/native/com_uber_neuropod_NeuropodTensor.cc 72.26% <100.00%> (ø)
...ce/neuropod/bindings/java/src/main/native/utils.cc 72.30% <100.00%> (+6.26%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c4ed8c3...cde3f5c. Read the comment docs.

@VivekPanyam VivekPanyam merged commit 2c45e86 into master Feb 23, 2022
@VivekPanyam VivekPanyam deleted the fix_java_unicode branch February 23, 2022 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants