Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

std: Change encode_utf{8,16} to return iterators #32204

Merged
merged 1 commit into from
Mar 23, 2016

Conversation

alexcrichton
Copy link
Member

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which also exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784

@alexcrichton
Copy link
Member Author

r? @aturon

cc @SimonSapin

fn encode_utf8(self) -> EncodeUtf8 {
let code = self as u32;
let mut buf = [0; 4];
let pos = if code < MAX_ONE_B && !buf.is_empty() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of !buf.is_empty()? Isn't it always true?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops! Just a holdover from the copy/pasted code below.

@alexcrichton alexcrichton force-pushed the redesign-char-encoding-types branch from 2b3c5ac to e3581c8 Compare March 11, 2016 23:01
fn encode_utf8(self, dst: &mut [u8]) -> Option<usize>;
#[stable(feature = "core", since = "1.6.0")]
fn encode_utf16(self, dst: &mut [u16]) -> Option<usize>;
#[unstable(feature = "unicode", issue = "27784")]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing #[stable] to #[unstable]? Is that acceptable because the trait is #[unstable]?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right yes thanks for the reminder! I meant to discuss this more at length on the PR comments but forgot.

So it looks like char::encode_utf8 is unstable (as expected), but we accidentally stabilized CharExt::encode_utf8 in libcore. This appears to be accidental (from what I can tell), so it seems that we're within our rights to switch back to unstable here (with libstd being the "source of truth")

That being said I wouldn't mind doing a crater run to see if it affects any crates.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This appears to be accidental (from what I can tell), so it seems that we're within our rights to switch back to unstable here (with libstd being the "source of truth")

That reasoning seems problematic. If a stable channel compiler lets me use some feature/item because it is marked #[stable], I have no way to know whether it is "accidentally" stable. Going back to #[unstable] effectively removes an item as far as the stable channel is concerned, which RFC 1105 says is a major change.

I thought this specific case was OK since the CharExt trait is #[unstable]. If I can’t write use core::char::CharExt; I can’t use CharExt’s methods. But I don’t need that use, it’s in the core prelude.

Maybe these methods have been stable a short enough time that little enough code uses them that the breakage is acceptable. I’m in favor of this change, but “oh but we didn’t mean that stability promise we made” should not be sufficient.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SimonSapin this was a mistake when stabilizing libcore. If it causes breakage we can back out and rethink our strategy, but the hypothesis is that this won't cause breakage as the standard library's copy has been unstable so anyone using stable has been avoiding it anyway.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, in general it's obviously not OK to revert a stabilization just because it was an accident. But the fact that the stable item was isolated to libcore and not stable in libstd meant that the impact of correction should be very low.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I see @alexcrichton beat me to basically the same point...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a separate issue was opened for this: #32460

@aturon
Copy link
Member

aturon commented Mar 21, 2016

Oof, sorry for the delay, this got buried in my inbox.

Nice improvements here. I don't think we need a crater run for the libcore stability changes -- I'm fine to land as-is.

@bors: r+

@bors
Copy link
Contributor

bors commented Mar 21, 2016

📌 Commit e3581c8 has been approved by aturon

@bors
Copy link
Contributor

bors commented Mar 22, 2016

⌛ Testing commit e3581c8 with merge 93ee81b...

@bors
Copy link
Contributor

bors commented Mar 22, 2016

💔 Test failed - auto-linux-64-x-armhf

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc rust-lang#27784
@alexcrichton alexcrichton force-pushed the redesign-char-encoding-types branch from e3581c8 to 48d5fe9 Compare March 22, 2016 17:25
@alexcrichton
Copy link
Member Author

@bors: r=aturon 48d5fe9

@bors
Copy link
Contributor

bors commented Mar 22, 2016

⌛ Testing commit 48d5fe9 with merge e1b4dd0...

bors added a commit that referenced this pull request Mar 22, 2016
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784
@bors
Copy link
Contributor

bors commented Mar 22, 2016

💔 Test failed - auto-win-gnu-32-nopt-t

@alexcrichton
Copy link
Member Author

@bors: retry

On Tue, Mar 22, 2016 at 2:44 PM, bors notifications@github.com wrote:

[image: 💔] Test failed - auto-win-gnu-32-nopt-t
http://buildbot.rust-lang.org/builders/auto-win-gnu-32-nopt-t/builds/3527


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#32204 (comment)

@alexcrichton
Copy link
Member Author

@bors: retry force clean

@bors
Copy link
Contributor

bors commented Mar 22, 2016

⌛ Testing commit 48d5fe9 with merge 0dcc413...

bors added a commit that referenced this pull request Mar 22, 2016
…turon

std: Change `encode_utf{8,16}` to return iterators

Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784
self.bytes.set_len(cur_len + used);
}
let bytes = unsafe {
char::from_u32_unchecked(code_point.value).encode_utf8()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if only "for a short time" this can create an invalid char (which can be a surrogate). Is that really OK? Doesn’t char provide range assertions to LLVM? Previously we had encode_utf8_raw taking u32 to avoid this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah right. I'd rather not expose those methods though from libcore again and we can just inline them here for now if we need to.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. But I’m asking, because I really don’t know: do we need to? Is this one of those things that appear to work when you try it but is technically undefined behavior and can cause bad things in future compiler versions?

I still have nigthly CI for https://github.com/SimonSapin/rust-wtf8 which has similar code, failing now that encode_utf8_raw is gone, and I’m unsure what to do about it. And I imagine other libraries may be tempted to do something similar: I recall the regex crate doing some contortions internally to avoid creating a surrogate char when "iterating" over character ranges like [a-z].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not actually sure either, it kinda depends what we tell LLVM, how we codgen this, and what operations happen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants