utf-8 surrogate lossy conversion inconsistency

Lossy conversion of "unpaired surrogate" code points (U+D800 to U+DFFF) is inconsistent, resulting in three Unicode replacement characters on Unix, while only one on Windows.

# Examples

Let's take code point U+D800 as an example.

Raw byte array:

```rust
let bytes = [ 0xed, 0xa0, 0x80 ];
let string = String::from_utf8_lossy(&bytes[..]);

assert_eq!(string, "���");
```

This results in three because core's `run_utf8_validation` function returns a `Utf8Error` with `error_len` of `Some(1)` due to the byte sequence being outside of a valid range per the match block. The two "continuation" bytes are then assessed individually, each also resulting in the same.

Unix OsStr:

```rust
use std::ffi::OsStr;
use std::os::unix::ffi::OsStrExt;

let bytes = [ 0xed, 0xa0, 0x80 ];
let os_str = OsStr::from_bytes(&bytes[..]);

assert_eq!(os_str.to_string_lossy(), "���");
```

This goes through the same code paths as above.

Windows OsStr:

```rust
use std::ffi::OsString;
use std::os::windows::prelude::*;

let source = [ 0xD800 ];
let os_string = OsString::from_wide(&source[..]);
let os_str = os_string.as_os_str();

assert_eq!(os_str.to_string_lossy(), "�");
```

This goes through different code paths; it uses `std::sys_common::wtf8` code, specifically `Wtf8::to_string_lossy` is of interest, where it explicitly replaces the surrogate sequences with single Unicode replacement characters.

One reason why using one replacement character may have been chosen is because of efficient replacement in lossy conversion, since both the sequences to be replaced and the replacement character are three bytes, thus an in place replacement for the `self` consuming `Wtf8Buf::into_string_lossy` implementation.

# Background

Or how I ended up here...

I am working on v2.0 of my command line argument parsing library `gong`. One of the new features is `OsStr` based parsing.

My updated test suite is failing on Windows with a short option set involving such byte sequences as above. I have determined that this is due to this inconsistency, and due to my solution combining lossy conversion with use of `std::str::from_utf8`.

Fyi: For `OsStr` based parsing, I lossily convert to `str`, use the `str` based parser, then convert the resulting "items", extracting portions of the original `OsStr` for data values. (Thus there is a one-to-one mapping between parser "items", e.g. known/unknown short option character, from the `str` parser result to the `OsStr` parser results). For short option sets, for correct extraction of in-same-arg data values, the number of bytes consumed from the original `OsStr` argument must be tracked, which requires discovering how many bytes a replacement character came from in the lossy conversion. For this I used `std::str::from_utf8` since the Windows `OsStr` is just UTF-8 with some extra permissible code points. However in this test the wrong string slice gets taken for the data value because this inconsistency causes the byte consumption tracking to go wrong.

Note that my solution does not just stop and print an error on encountering a problem like an unknown short, it returns a detailed analysis to the caller for them to take action on.

Until such time as this gets fixed in `core`/`std`, I don't think there's any other good option for my library but to duplicate and modify a chunk of the relevant code to give a consistent count, or implement my own fixed Windows `OsStr` lossy converter :/

edit: the latter is what I have done. you can see all the hacks necessary for OsStr support in the temporary 'temp' branch I pushed to check compilation of the feature on Windows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

utf-8 surrogate lossy conversion inconsistency #56786

Examples

Background

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

utf-8 surrogate lossy conversion inconsistency #56786

Description

Examples

Background

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions