Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for explicit-endian String::from_utf16 #116258

Open
1 of 6 tasks
CAD97 opened this issue Sep 29, 2023 · 2 comments
Open
1 of 6 tasks

Tracking Issue for explicit-endian String::from_utf16 #116258

CAD97 opened this issue Sep 29, 2023 · 2 comments
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@CAD97
Copy link
Contributor

CAD97 commented Sep 29, 2023

Feature gate: #![feature(str_from_utf16_endian)]

This is a tracking issue for versions of String::from_utf16 which take &[u8] and use a specific endianness.

Public API

impl String {
    fn from_utf16le(v: &[u8]) -> Result<String, FromUtf16Error>;
    fn from_utf16le_lossy(v: &[u8]) -> String;
    fn from_utf16be(v: &[u8]) -> Result<String, FromUtf16Error>;
    fn from_utf16be_lossy(v: &[u8]) -> String;
}

Steps / History

Unresolved Questions

  • Ideal naming; options include from_utf16le, from_utf16_le, from_le_utf16, from_le_utf16_bytes, and other such combinations.
  • Should these methods get the with_capacity+push implementation used for from_utf16 while collect doesn't reserve capacity? (Collecting into a Result<Vec<_>> doesn't reserve the capacity in advance #48994)
  • Tweaks to the error type: FromUtf16Error currently displays as "invalid utf-16: lone surrogate found" which isn't correct for an error due to odd byte length.

Footnotes

  1. https://std-dev-guide.rust-lang.org/feature-lifecycle/stabilization.html

@CAD97 CAD97 added T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC labels Sep 29, 2023
@CAD97 CAD97 changed the title Tracking Issue for endian specific String::from_utf16 Tracking Issue for explicit-endian String::from_utf16 Sep 29, 2023
@zachs18
Copy link
Contributor

zachs18 commented Jan 19, 2024

Perhaps as an unresolved question: with these added, FromUtf16Error's Display impl is no longer always accurate; it says "invalid utf-16: lone surrogate found", but these functions introduce a new failure case: the &[u8] was of odd length. Making FromUtf16Error hold information about which kind of error occurred would require making it not a ZST anymore, which could degrade performance since currently Result<String, FromUtf16Error> is (non-guaranteed-ly) null-pointer-optimized to be the same size as String. (see below)

Alternately, they could return some new FromUtf16BytesError type which can represent both errors, so that String::from_utf16 can still return the null-pointer-optimized Result<String, FromUtf16Error>.

(Alternately, FromUtf16Error's Display impl could be updated to say something like "invalid utf-16: lone surrogate found, or odd length byte string passed".)

@CAD97
Copy link
Contributor Author

CAD97 commented Jan 21, 2024

To note, Result<String, enum { L, R }> is still niched. The data pointer is null and the other 2×usize are available to carry the Err payload. The only performance hit would be constructing or inspecting the error payload.

But that said, I also think just rendering the error as invalid utf-16 would be sufficient. Adding a new variant to the existing enum is also fine, but I don't think making a new error type is particularly helpful.

An alternative would be to panic if given an odd-length slice, since that's trivial to precheck. But not a particularly good alternative.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

2 participants