Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add explicit-endian String::from_utf16 variants #95967

Merged
merged 4 commits into from
Oct 11, 2023
Merged

Conversation

CAD97
Copy link
Contributor

@CAD97 CAD97 commented Apr 12, 2022

This adds the following APIs under feature(str_from_utf16_endian):

impl String {
    pub fn from_utf16le(v: &[u8]) -> Result<String, FromUtf16Error>;
    pub fn from_utf16le_lossy(v: &[u8]) -> String;
    pub fn from_utf16be(v: &[u8]) -> Result<String, FromUtf16Error>;
    pub fn from_utf16be_lossy(v: &[u8]) -> String;
}

These are versions of String::from_utf16 that explicitly take UTF-16LE and UTF-16BE. Notably, we can do better than just the obvious decode_utf16(v.array_chunks::<2>().copied().map(u16::from_le_bytes)).collect() in that:

  • We handle the case where the byte slice is not an even number of bytes, and
  • In the case that the UTF-16 is native endian and the slice is aligned, we can forward to String::from_utf16.

If the Unicode Consortium actively defines how to handle character replacement when decoding a UTF-16 bytestream with a trailing odd byte, I was unable to find reference. However, the behavior implemented here is fairly self-evidently correct: replace the single errant byte with the replacement character.

@rust-highfive
Copy link
Collaborator

Thank you for submitting a new PR for the library teams! If this PR contains a stabilization of a library feature that has not already completed FCP in its tracking issue, introduces new or changes existing unstable library APIs, or changes our public documentation in ways that create new stability guarantees then please comment with r? rust-lang/libs-api @rustbot label +T-libs-api to request review from a libs-api team reviewer. If you're unsure where your change falls no worries, just leave it as is and the reviewer will take a look and make a decision to forward on if necessary

@rust-highfive
Copy link
Collaborator

r? @joshtriplett

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 12, 2022
@ChrisDenton ChrisDenton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label May 13, 2022
@ChrisDenton
Copy link
Member

Surely it would be more consistent with String::from_utf16 and u16::from_be to take a &[u16] slice? This would also sidestep the trailing byte issue.

@CAD97
Copy link
Contributor Author

CAD97 commented May 25, 2022

The point of this API surface is to take u8 units like u16::from_be_bytes. If you read a file, you get Vec<u8>, so in order to read a UTF-16 file, someone needs to decode the &[u8] into UTF-16. For the case where you already have &[u16], we have char::decode_utf16.

@ChrisDenton
Copy link
Member

Ah, I see, it's to deserialize bytes from say a file or other stream. That makes sense.

@CAD97
Copy link
Contributor Author

CAD97 commented Jun 22, 2022

(triaging my open PRs)

I've inlined this into the project that needed it, and it's been working fine there. (Though tbf I haven't exercised that codepath much.)

I think this is right on the edge of interesting enough to include in std; the forwarding to the aligned version when possible is beneficial but easy to overlook.

The one thing that maybe should be added is a note to use decode_utf16 if you already have native-endian &[u16], to avoid a &[16] -> &[u8] -> &[u16] loop just to call these functions (and to explain the lack of a ne variant). Along those lines, maybe these should be named with _bytes in the name to match the convention from u16 where [to|from]_xe works on just u16 but [to|from]_xe_bytes converts to [u8; 2].

I'm happy to do either, just S-waiting-on-review.

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 24, 2022
match (cfg!(target_endian = "little"), unsafe { v.align_to::<u16>() }) {
(true, (&[], v, &[])) => Self::from_utf16(v),
_ => decode_utf16(v.array_chunks::<2>().copied().map(u16::from_le_bytes))
.collect::<Result<_, _>>()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be better to use String::with_capacity() + String::push() instead of collect::<Result<_, _>>() until #48994 is fixed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing this would match the implementation in from_utf16. However, I expect these functions will be much less performance critical, so adding the extra code is perhaps unneeded? I'm not sure.

@JohnCSimon JohnCSimon added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 8, 2022
@pitaj
Copy link
Contributor

pitaj commented Apr 30, 2023

Ping from triage @joshtriplett. It's been 10 months since last activity.

@dtolnay dtolnay assigned dtolnay and unassigned joshtriplett Sep 28, 2023
Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me. Please open a tracking issue.

maybe these should be named with _bytes in the name to match the convention from u16

I find the names currently in the PR to be all right and would be happy landing them as is. I'm sure we'll discuss the naming further as part of stabilization, though.

Thanks!

library/alloc/src/string.rs Outdated Show resolved Hide resolved
@rust-log-analyzer

This comment has been minimized.

@CAD97
Copy link
Contributor Author

CAD97 commented Sep 29, 2023

Tracking issue linked: #116258

@rust-log-analyzer

This comment has been minimized.

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@dtolnay
Copy link
Member

dtolnay commented Oct 11, 2023

@bors r+

@bors
Copy link
Contributor

bors commented Oct 11, 2023

📌 Commit 5facc32 has been approved by dtolnay

It is now in the queue for this repository.

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Oct 11, 2023
bors added a commit to rust-lang-ci/rust that referenced this pull request Oct 11, 2023
Rollup of 4 pull requests

Successful merges:

 - rust-lang#95967 (Add explicit-endian String::from_utf16 variants)
 - rust-lang#116530 (delay a bug when encountering an ambiguity in MIR typeck)
 - rust-lang#116611 (Document `diagnostic_namespace` feature)
 - rust-lang#116612 (Remove unused dominator iterator)

r? `@ghost`
`@rustbot` modify labels: rollup
@bors bors merged commit 38654ad into rust-lang:master Oct 11, 2023
@rustbot rustbot added this to the 1.75.0 milestone Oct 11, 2023
rust-timer added a commit to rust-lang-ci/rust that referenced this pull request Oct 11, 2023
Rollup merge of rust-lang#95967 - CAD97:from-utf16, r=dtolnay

Add explicit-endian String::from_utf16 variants

This adds the following APIs under `feature(str_from_utf16_endian)`:

```rust
impl String {
    pub fn from_utf16le(v: &[u8]) -> Result<String, FromUtf16Error>;
    pub fn from_utf16le_lossy(v: &[u8]) -> String;
    pub fn from_utf16be(v: &[u8]) -> Result<String, FromUtf16Error>;
    pub fn from_utf16be_lossy(v: &[u8]) -> String;
}
```

These are versions of `String::from_utf16` that explicitly take [UTF-16LE and UTF-16BE](https://unicode.org/faq/utf_bom.html#gen7). Notably, we can do better than just the obvious `decode_utf16(v.array_chunks::<2>().copied().map(u16::from_le_bytes)).collect()` in that:

- We handle the case where the byte slice is not an even number of bytes, and
- In the case that the UTF-16 is native endian and the slice is aligned, we can forward to `String::from_utf16`.

If the Unicode Consortium actively defines how to handle character replacement when decoding a UTF-16 bytestream with a trailing odd byte, I was unable to find reference. However, the behavior implemented here is fairly self-evidently correct: replace the single errant byte with the replacement character.
@CAD97 CAD97 deleted the from-utf16 branch October 11, 2023 04:55
bors-ferrocene bot added a commit to ferrocene/ferrocene that referenced this pull request Oct 12, 2023
48: Pull upstream master 2023 10 12 r=tshepang a=Dajamante

* rust-lang/rust#113487
* rust-lang/rust#116506
* rust-lang/rust#116448
* rust-lang/rust#116640
  * rust-lang/rust#116627
  * rust-lang/rust#116597
  * rust-lang/rust#116436
  * rust-lang/rust#116315
  * rust-lang/rust#116219
* rust-lang/rust#113218
* rust-lang/rust#115937
* rust-lang/rust#116014
* rust-lang/rust#116623
* rust-lang/rust#112818
* rust-lang/rust#115948
* rust-lang/rust#116622
* rust-lang/rust#116621
  * rust-lang/rust#116612
  * rust-lang/rust#116611
  * rust-lang/rust#116530
  * rust-lang/rust#95967
* rust-lang/rust#116578
* rust-lang/rust#113915
* rust-lang/rust#116605
  * rust-lang/rust#116574
  * rust-lang/rust#116560
  * rust-lang/rust#116559
  * rust-lang/rust#116503
  * rust-lang/rust#116444
  * rust-lang/rust#116250
  * rust-lang/rust#109422
* rust-lang/rust#116598
  * rust-lang/rust#116596
  * rust-lang/rust#116595
  * rust-lang/rust#116589
  * rust-lang/rust#116586
* rust-lang/rust#116551
* rust-lang/rust#116409
* rust-lang/rust#116548
* rust-lang/rust#116366
* rust-lang/rust#109882
* rust-lang/rust#116497
* rust-lang/rust#116532
* rust-lang/rust#116569
  * rust-lang/rust#116561
  * rust-lang/rust#116556
  * rust-lang/rust#116549
  * rust-lang/rust#116543
  * rust-lang/rust#116537
  * rust-lang/rust#115882
* rust-lang/rust#116142
* rust-lang/rust#115238
* rust-lang/rust#116533
* rust-lang/rust#116096
* rust-lang/rust#116468
* rust-lang/rust#116515
* rust-lang/rust#116454
* rust-lang/rust#116183
* rust-lang/rust#116514
* rust-lang/rust#116509
* rust-lang/rust#116487
* rust-lang/rust#116486
* rust-lang/rust#116450
* rust-lang/rust#114623
* rust-lang/rust#116416
* rust-lang/rust#116437
* rust-lang/rust#100806
* rust-lang/rust#116330
* rust-lang/rust#116310
* rust-lang/rust#115583
* rust-lang/rust#116457
* rust-lang/rust#116508
* rust-lang/rust#109214
* rust-lang/rust#116318
* rust-lang/rust#116501
  * rust-lang/rust#116500
  * rust-lang/rust#116458
  * rust-lang/rust#116400
  * rust-lang/rust#116277
* rust-lang/rust#114709
* rust-lang/rust#116492
  * rust-lang/rust#116484
  * rust-lang/rust#116481
  * rust-lang/rust#116474
  * rust-lang/rust#116466
  * rust-lang/rust#116423
  * rust-lang/rust#116297
  * rust-lang/rust#114564
* rust-lang/rust#114811
* rust-lang/rust#116489
* rust-lang/rust#115304

Co-authored-by: Peter Hall <peter.hall@hyperexponential.com>
Co-authored-by: Emanuele Vannacci <emanuele.vannacci@gmail.com>
Co-authored-by: Neven Villani <vanille@crans.org>
Co-authored-by: Alex Macleod <alex@macleod.io>
Co-authored-by: Tamir Duberstein <tamird@gmail.com>
Co-authored-by: Eduardo Sánchez Muñoz <eduardosm-dev@e64.io>
Co-authored-by: koka <koka.code@gmail.com>
Co-authored-by: bors <bors@rust-lang.org>
Co-authored-by: Philipp Krones <hello@philkrones.com>
Co-authored-by: Camille GILLOT <gillot.camille@gmail.com>
Co-authored-by: Esteban Küber <esteban@kuber.com.ar>
Co-authored-by: Ralf Jung <post@ralfj.de>
bors-ferrocene bot added a commit to ferrocene/ferrocene that referenced this pull request Oct 13, 2023
48: Pull upstream master 2023 10 12 r=tshepang a=Dajamante

* rust-lang/rust#113487
* rust-lang/rust#116506
* rust-lang/rust#116448
* rust-lang/rust#116640
  * rust-lang/rust#116627
  * rust-lang/rust#116597
  * rust-lang/rust#116436
  * rust-lang/rust#116315
  * rust-lang/rust#116219
* rust-lang/rust#113218
* rust-lang/rust#115937
* rust-lang/rust#116014
* rust-lang/rust#116623
* rust-lang/rust#112818
* rust-lang/rust#115948
* rust-lang/rust#116622
* rust-lang/rust#116621
  * rust-lang/rust#116612
  * rust-lang/rust#116611
  * rust-lang/rust#116530
  * rust-lang/rust#95967
* rust-lang/rust#116578
* rust-lang/rust#113915
* rust-lang/rust#116605
  * rust-lang/rust#116574
  * rust-lang/rust#116560
  * rust-lang/rust#116559
  * rust-lang/rust#116503
  * rust-lang/rust#116444
  * rust-lang/rust#116250
  * rust-lang/rust#109422
* rust-lang/rust#116598
  * rust-lang/rust#116596
  * rust-lang/rust#116595
  * rust-lang/rust#116589
  * rust-lang/rust#116586
* rust-lang/rust#116551
* rust-lang/rust#116409
* rust-lang/rust#116548
* rust-lang/rust#116366
* rust-lang/rust#109882
* rust-lang/rust#116497
* rust-lang/rust#116532
* rust-lang/rust#116569
  * rust-lang/rust#116561
  * rust-lang/rust#116556
  * rust-lang/rust#116549
  * rust-lang/rust#116543
  * rust-lang/rust#116537
  * rust-lang/rust#115882
* rust-lang/rust#116142
* rust-lang/rust#115238
* rust-lang/rust#116533
* rust-lang/rust#116096
* rust-lang/rust#116468
* rust-lang/rust#116515
* rust-lang/rust#116454
* rust-lang/rust#116183
* rust-lang/rust#116514
* rust-lang/rust#116509
* rust-lang/rust#116487
* rust-lang/rust#116486
* rust-lang/rust#116450
* rust-lang/rust#114623
* rust-lang/rust#116416
* rust-lang/rust#116437
* rust-lang/rust#100806
* rust-lang/rust#116330
* rust-lang/rust#116310
* rust-lang/rust#115583
* rust-lang/rust#116457
* rust-lang/rust#116508
* rust-lang/rust#109214
* rust-lang/rust#116318
* rust-lang/rust#116501
  * rust-lang/rust#116500
  * rust-lang/rust#116458
  * rust-lang/rust#116400
  * rust-lang/rust#116277
* rust-lang/rust#114709
* rust-lang/rust#116492
  * rust-lang/rust#116484
  * rust-lang/rust#116481
  * rust-lang/rust#116474
  * rust-lang/rust#116466
  * rust-lang/rust#116423
  * rust-lang/rust#116297
  * rust-lang/rust#114564
* rust-lang/rust#114811
* rust-lang/rust#116489
* rust-lang/rust#115304

Co-authored-by: Emanuele Vannacci <emanuele.vannacci@gmail.com>
Co-authored-by: Neven Villani <vanille@crans.org>
Co-authored-by: Alex Macleod <alex@macleod.io>
Co-authored-by: Tamir Duberstein <tamird@gmail.com>
Co-authored-by: Eduardo Sánchez Muñoz <eduardosm-dev@e64.io>
Co-authored-by: koka <koka.code@gmail.com>
Co-authored-by: bors <bors@rust-lang.org>
Co-authored-by: Philipp Krones <hello@philkrones.com>
Co-authored-by: Camille GILLOT <gillot.camille@gmail.com>
Co-authored-by: Esteban Küber <esteban@kuber.com.ar>
Co-authored-by: Ralf Jung <post@ralfj.de>
Co-authored-by: ShE3py <52315535+she3py@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.