-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
utf-8 surrogate lossy conversion inconsistency #56786
Labels
T-libs-api
Relevant to the library API team, which will review and decide on the PR/issue.
Comments
Closed
estebank
added
the
T-libs-api
Relevant to the library API team, which will review and decide on the PR/issue.
label
Dec 13, 2018
jnqnfe
added a commit
to jnqnfe/gong
that referenced
this issue
Dec 14, 2018
On Windows, `std` lossily converts an unpaired surrogate encoding into a single unicode replacement character, whereas on Unix, and also via `String`'s raw byte conversion, three unicode replacement characters are used. This comes from the fact that different code paths are used which take different approaches: `std::sys_common::wtf8::to_string_lossy` for Windows `OsStr` lossy conversion, vs. `core`'s `run_utf8_validation` function for Unix and for creaing a `String` from raw bytes. The inconsistency causes a problem in correct short option argument byte consumption tracking in our `OsStr` parsing code, due to using lossy `OsStr` conversion in combination with `std::str::from_utf8`. A [bug report][1] and [pull request][2] have been filed against `std`. [1]: rust-lang/rust#56786 [2]: rust-lang/rust#56787
jnqnfe
added a commit
to jnqnfe/rust
that referenced
this issue
Dec 14, 2018
With this commit, lossy UTF-8 conversion of OsStr/OsString on Windows will output three Unicode replacement characters (U+FFFD), one per byte, for surrogate byte sequences, instead of just one, making it consistent with lossy conversion on Unix and with the lossy conversion of raw bytes sequences. fixes rust-lang#56786
Reconsidering this, perhaps I was assessing it too much from the "internal" utf-8 representation point of view. Although there is an inconsistency, it does not really matter, and changing it makes things less efficient and introduces breaking change. I've addressed things much more simply now in my library. I'm going to close this as not worth changing. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Lossy conversion of "unpaired surrogate" code points (U+D800 to U+DFFF) is inconsistent, resulting in three Unicode replacement characters on Unix, while only one on Windows.
Examples
Let's take code point U+D800 as an example.
Raw byte array:
This results in three because core's
run_utf8_validation
function returns aUtf8Error
witherror_len
ofSome(1)
due to the byte sequence being outside of a valid range per the match block. The two "continuation" bytes are then assessed individually, each also resulting in the same.Unix OsStr:
This goes through the same code paths as above.
Windows OsStr:
This goes through different code paths; it uses
std::sys_common::wtf8
code, specificallyWtf8::to_string_lossy
is of interest, where it explicitly replaces the surrogate sequences with single Unicode replacement characters.One reason why using one replacement character may have been chosen is because of efficient replacement in lossy conversion, since both the sequences to be replaced and the replacement character are three bytes, thus an in place replacement for the
self
consumingWtf8Buf::into_string_lossy
implementation.Background
Or how I ended up here...
I am working on v2.0 of my command line argument parsing library
gong
. One of the new features isOsStr
based parsing.My updated test suite is failing on Windows with a short option set involving such byte sequences as above. I have determined that this is due to this inconsistency, and due to my solution combining lossy conversion with use of
std::str::from_utf8
.Fyi: For
OsStr
based parsing, I lossily convert tostr
, use thestr
based parser, then convert the resulting "items", extracting portions of the originalOsStr
for data values. (Thus there is a one-to-one mapping between parser "items", e.g. known/unknown short option character, from thestr
parser result to theOsStr
parser results). For short option sets, for correct extraction of in-same-arg data values, the number of bytes consumed from the originalOsStr
argument must be tracked, which requires discovering how many bytes a replacement character came from in the lossy conversion. For this I usedstd::str::from_utf8
since the WindowsOsStr
is just UTF-8 with some extra permissible code points. However in this test the wrong string slice gets taken for the data value because this inconsistency causes the byte consumption tracking to go wrong.Note that my solution does not just stop and print an error on encountering a problem like an unknown short, it returns a detailed analysis to the caller for them to take action on.
Until such time as this gets fixed in
core
/std
, I don't think there's any other good option for my library but to duplicate and modify a chunk of the relevant code to give a consistent count, or implement my own fixed WindowsOsStr
lossy converter :/edit: the latter is what I have done. you can see all the hacks necessary for OsStr support in the temporary 'temp' branch I pushed to check compilation of the feature on Windows.
The text was updated successfully, but these errors were encountered: