Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request: standard library function/type/whatever to parse utf-8 from an iterator #90643

Open
Lokathor opened this issue Nov 6, 2021 · 5 comments
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Lokathor
Copy link
Contributor

Lokathor commented Nov 6, 2021

There's char::decode_utf16, and there's an iterator adapter to turn a byte iterator over utf-8 into a char iterator that's used by str::chars, but there's no thing like decode_utf8 where an iterator over bytes is decoded as things go.

@the8472
Copy link
Member

the8472 commented Nov 6, 2021

Do you mean Iterator<Item=u8> -> Iterator<Item=char>? I think that would be much slower than batch validation of a [u8].
Or more something along the lines of String.try_append_from_utf8_iter(it: impl Iterator<Item=u8>) -> Result<...>?

@Lokathor
Copy link
Contributor Author

Lokathor commented Nov 6, 2021

I mean the first one, and also you can't presume that the person has the full slice available, which is the exact situation I found myself in.

@hkratz
Copy link
Contributor

hkratz commented Nov 6, 2021

We had decode_utf8, but it was deprecated in #49970 and removed in #52814. Main reasoning for the decision is apparently outlined here: #33906 (comment).

@clubby789
Copy link
Contributor

@rustbot label +T-libs-api +C-feature-request

@rustbot rustbot added C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Mar 30, 2023
@mqudsi
Copy link
Contributor

mqudsi commented Apr 29, 2023

I read through the linked issues but unless I missed it, it seems the incongruity of the API wasn't discussed.

I'm not sure why char::decode_from_utf8() was deprecated then removed after char::decode_from_utf16() was stabilized - they are great counterparts to one another and it is confusing to have char::encode_utf8() + char::encode_utf16() but only have char::decode_utf16() but no char::decode_utf8().

AFAICT, the closest version of char::from_utf8() using std::str::from_utf8() would be as follows:

pub fn char_decode_from_utf8(bytes: &[u8]) -> Option<char> {
    let decoded = std::str::from_utf8(bytes).ok()?;
    let mut chars = decoded.chars();
    let result = chars.next()?;
    match chars.next() {
        None => Some(result),
        Some(_) => None, // `bytes` contains more than 1 codepoint!
    }
}

The last bit about ensuring the byte slice decodes to only one char and no more is an important part that a first attempt might overlook - maybe it is worth including it for that reason alone. (Ensuring input is <= 4 bytes as well before calling std::str::from_utf8() might also be worth doing.)

But really this code is much too bloated for what it does, and you'd be relying on the compiler to both first inline the UTF-8 decoding routine then remove the duplicate checks to get reasonable output out of this. It would be much better if the std library exposed the internal UTF-8 decoder directly behind this API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-feature-request Category: A feature request, i.e: not implemented / a PR. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

6 participants